Utilizing the IOMMU Scalably

Posted on February 4, 2016

For my CS854 class I have to read a trio of research papers each week, and post summaries, which are then updated after the class that a student gives a presentation on it. Don't rely on my summaries for anything, but it might interest some of you, so I'm posting it here.

Just read the paper.

This post is on: Utilizing the IOMMU Scalably. https://www.usenix.org/conference/atc15/technical-session/presentation/peleg

Oh Cool DMA + MMU = IOMMU. Didn't know those existed, but I guess I've only dealt with DMA on really low level hardware, like the GBA.

This exists to have the protection of virtual memory while using DMA (direct memory access) to copy buffers around. This is useful for a NIC (Network interface card), and performance matters because we'd like to hit multi-Gb/s of network data transfer.

Currently we have bottlenecks

Assignment of IOVAs. (Virtual addresses used for IO).
Management of IOMMU's TLB (Flushing is slooow).

Solutions:

Dynamic identity mappings, removing IOVAs
- What's that mean?
Allocating IOVAs using kmalloc.
- Rely on the already optimized part of the kernel.
Per-core caching of IOVAs allocated by a globally locked IOVA allocator.
- Don't have to flush globally as often, or something.

The problem, restated, is that the NIC wants to DMA stuff. Much faster than tying up a CPU. This means mapping virtual addresses to the buffers and making it accessible to be DMA'd into. But then, we want to remove the mapping later.

Currently, we map a new virtual address for each buffer, and unmap it afterwards. This causes lots of traffic on the TLB.

If we have a static mapping, it's got acceptable performance, but that's not great for security reasons I think? I'm not quite sure why you need to unmap the buffers and keep dynamically allocating them. Oh it's for security from devices.

Linux keeps a queue of invalidations, and then does them all at once, but the batching datastructure is lock protected, and can be a bottleneck.

IOMMUs on x86 are just like the MMU, with 4 levels. There's also an IOTLB, for caching. This must be flushed when we modify any translation.

Oh cool, you can use the IOMMU with virtualization to let the guest OS directly control some device. I've got to learn more about virtualization sometime.

Dynamic mappings are there to protect the OS from devices. Creating and destroying millions of IOVAs a second is slow, who woulda guessed.

Normally Linux operates in deferred invalidation mode. After a request finishes, it happily returns without blocking until the mapping is fully removed from the IOMMU, only queued.

For the kmalloc solution, it mostly works, but it's hard to reclaim the intermediate pages.

4.1 Dynamic Identity Mapping

Ah, so we use the fact the buffers are normally contiguous, and we can use identity (1-to-1) mapping. Since we map and unmap the regions, it's called dynamic identity mapping.

It's got some drawbacks, and isn't great. You need to keep reference counts to pages, and have conflicting access permissions.

4.2 IOVA-kmalloc

Use kmalloc to get physical addresses, and use that address as an IOVA. This sorta wastes the physical memory, but since we can use 8 byte of physical memory for each page of IOVA allocated.

4.3 Scalable IOVA Allocation wiht Magazines.

Basic Idea: per-core cache of previously deallocated IOVA ranges.

This can avoid needed to acquire the global lock.

A common scenario is to have one core allocate multiple IOVAs and another core will deallocate all of them, which could cause a buildup of cached buffers at one core.

A magazine (as in, one holding ammo) is a bundle of M elements, and when a core tries to allocate when it has an empty magazine, it can grab a full magazine from the depo. Thus the cache miss rate is bounded by 1/M.

Conclusion: These three techniques are all useful, but have different trade-offs. Still better than stock Linux though, and without any significant security risks(?), so maybe it'll make its way into the Kernel?

Managing the memory used to store the page table well is still an open problem.