IceFS Cubes

Posted on March 4, 2016

For my CS854 class I have to read a trio of research papers each week, and post summaries, which are then updated after the class that a student gives a presentation on it. Don't rely on my summaries for anything, but it might interest some of you, so I'm posting it here.

Just read the paper.

Physical Disentanglement in a Container-Based File System

This paper introduces IceFS, which can separate directories into cubes (Ice Cubes, get it?). These cubes are isolated from each other, so failures in one don't affect any other. Also, calls to fsync(), in one don't affect the performance of others.

Wouldn't it be nice if you could run separate tasks in separate directories, and they didn't interfere? That's what IceFS is working towards.

Lets define what we're talking about, the paper calls it entangelment when the metadata or data from two different tasks are stored on the same block. (Not at all related to entanglement when you can't write the matrix as the kronecker product of two independent quantum states :P).

As an example of this failing to happen, running SQLite and Varmail at the same time cuts both of their performance in half, or one tenth, respectively. I'd be interested in knowing if this positively affected docker instances. If anyone has lots of free time, re-run their tests with SQLite and Varmail in their own docker instances too. I'd love to hear about the result.

This is because Varmail calls fsync after it's short writes, and SQLite has very large writes. These large writes get synced when Varmail calls fsync, causing fsync to take much longer, and preventing better batching of writes of SQLites side.

Another example. If you have a bunch of VMs with file system images in different directories, then if there is a fault which causes the host file system to become read only, this introduces downtime for all the VMs. If each directory were it's own cube however, this would only take down one VM, and it would be faster to fsck that small cube instead of the entire partition.

Now, I have some concerns here. The paper talks about faults in the VM that can take out the host OS's filesystem, but I would already hope that the VM is providing isolation. I don't quite understand what kind of faults would do that.

I guess if the host OS's drive suffers from a block failure in one of the blocks backing the VM, that could have the effect, but do you really want to just fsck and keep running, or start replacing the drives right away?

The Cube Abstraction

We could like to group sets of files and directories together into a cube. Each cube will be physically isolated from each other, and will not have an impact on other cubes if it fails or has an incompatible workload.

The key trick is to make sure the metadata for each cube and it's files don't get stored in a block that is used by anything else. This does increase overhead slightly, as we increase fragmentation, but there is a payoff.

We also need to unbundle transactions, to solve the fsync problem.

Last concerns

How does this compare to [LVM (Logical Volume Manager)[lvm]? I feel like the paper really should have discussed that, since they seem to have similar goals.

One downside of LVM might be that it requires remounting the volume to increase the maximum size.

I don't actually know anything about LVM, other than it made it harder to get grub running again after I messed up my Linux install.

Oh, and how often do you need to fsck a drive anyways? If we somehow managed to make fsck run in zero time, what would be the upside? Is there significant downtime due to fsck running?

Though those isolation results seem neat, maybe this could get absorbed by Docker if there are real improvements.