Theo's Blog

Rust on the GBA: Revisited

2016-09-26T00:00:00Z

Posted on September 26, 2016

It's been almost a year since I initially wrote my Getting Rust on the GBA series, and it's time I revisited it, and consolidated my explorations into one set of instructions.

A few things have changed in the Rust ecosystem since then, and I'm going to adapt this guide to use them.

Rustup has replaced multirust
xargo now exists! It's wonderful.

xargo is billed as "Effortless cross compilation of Rust programs to custom bare-metal targets like ARM Cortex-M", which is perfect for this project. And it lives up to it's name.

Now instead of downloading a copy of the Rust source code, and cross compiling core to the gba and sticking the library in some magic place, we can just say xargo build --target=gba instead of cargo build --target=gba.json.

But in addition to using this new tool, I've also updated my version of Rustc, and the old gba.json file stopped working. Turns out the data-layout field needs to be changed for some reason. But thanks to the IRC channel, I found the new data layout and it seems to work just fine now.

I also had to update the lang.rs files, as the stack_exhausted fn is no longer required or understood.

The newest vesion is already pushed to github, so take a look.

Coq: Index vs Parameter types: an example of where it matters.

2016-04-20T00:00:00Z

Posted on April 20, 2016

This is a problem which has come up in my research, and I'm documenting it, as I think it's a pretty interesting problem. (Not just because I'll forget why I did this if I don't write about it, no not at all :P)

This is a problem that arose from my research, so it's unfortunately not removed from context. The best I can do is give you a rundown on the moving pieces before I use them.

I've been working on a formalization of [Featherweight Java][FJ], which is what you get when you strip down Java till you hit lambda calculus. It's actually probably closer to lambda calculus, replacing lambdas with methods on objects.

// Two basic classes.  We don't sully ourselves with Ints or other base types.
class A extends Object {
    A() { super(); }
}

class B extends Object {
    B() { super(); }
}


class Pair extends Object {
    Object fst;
    Object snd;

    // Constructors are always trivial,
    // and just set each field to the matching parameter.
    Pair(Object fst, Object snd) {
        super(); this.fst=fst; this.snd=snd;
    }

    // Every method is just an single expression.
    // We also eschew mutation, always constructing a new object instead.
    Pair setfst(Object newfst) {
        return new Pair(newfst, this.snd);
    }
}

But most of my work as been in Coq, trying to represent this and then prove things about it.

For this post, we focus almost exclusively on the Class Table, which is represented as a list [(class, (parent_class, fields, methods)]. So for this example, one representation of the class table would be

Definition example_CT := [
    (Pair, (Object, [fst, snd], [constr_Pair, setfst])),
    (B, (Object, [], [constr_A])),
    (A, (Object, [], [constr_A]))]

where I don't actually care about how you represent individual methods or fields.

Note that we could have swapped the rows for A and B and it wouldn't really have mattered. It would make no sense to talk about classes that aren't in the class table (or are Object), so we have a predicate ok_type_ CT C which just says that either C is in CT, or C = Object.

We do want to enforce that you can't make loops like

    [ (C, (D, fs1, ms1)), (D, (C, fs2, ms2)) ]

as we want every chain of inheritance to terminate in Object at some point. We also want to rule out redefining a class:

    [ (C, (D, fs1, ms1)), (C, (E, fs2, ms2)) ]

I call this constraint directed_ct, and it's defined inductively for the class table list.

Inductive directed_ct : ctable -> Prop :=
| directed_ct_nil : directed_ct nil
| directed_ct_cons :
        forall (C D : cname) (fs : flds) (ms: mths) (ct : ctable),
        directed_ct ct ->
        C \notin (keys ct) -> (* No duplicate bindings *)
        ok_type_ ct D ->    (* No forward references *)
        directed_ct ((C, (D, fs, ms)) :: ct).

We have a few utility functions for talking about indexing into the class table, notable binds, where we say binds C (D, fs, ms) CT to mean there exists an entry like (C, (D, fs, ms)) in CT. extends_ is a way to say this without naming fs and ms directly.

Sub Class.

Now that the preliminaries are covered, I'm going to show two different definitions for a strict subclassing relation.

Inductive ssub_p (CT:ctable) : typ -> typ -> Prop :=
| ssub_p_trans : forall A B C,
        ssub_p CT A B ->
        ssub_p CT B C ->
        ssub_p CT A C
| ssub_p_extends : forall C D, extends_ CT C D -> ssub_p CT C D.

Inductive ssub_ : ctable -> typ -> typ -> Prop :=
| ssub_trans : forall CT A B C,
        ssub_ CT A B ->
        ssub_ CT B C ->
        ssub_ CT A C
| ssub_extends : forall CT C D, extends_ CT C D -> ssub_ CT C D.

These are almost the same, but ssub_ is said to index over CT, where are ssub_p is parametric in the choice of CT.

I found this Stack Overflow answer to be quite helpful. Let me just quote a little from it:

Parameters are merely indicative that the type is somewhat generic, and behaves parametrically with regards to the argument supplied.

What this means for instance, is that the type List T will have the same shapes regardless of which T you consider: nil, cons t0 nil, cons t1 (cons t2 nil), etc. The choice of T only affects which values can be plugged for t0, t1, t2.

Indices on the other hand may affect which inhabitants you may find in the type! That's why we say they index a family of types, that is, each indice tells you which type in the family you are looking at (in that sense, a parameter is a degenerate case where all the indices point to the same family).

For instance, the type family Fin n or finite sets of size n contains very different structures depending on your choice of n.

The index 0 indices an empty set. The index 1 indices a set with one element.

In that sense, the knowledge of the value of the index may carry important information! Usually, you can learn which constructors may or may not have been used by looking at an index. That's how pattern-matching in dependently-typed languages can eliminate non-feasible patterns, and extract information out of the triggering of a pattern.

This still didn't fully clear this up for me, I needed to look at the induction schemes for each to really get it. If you haven't looked at the generated induction schemes for inductive types yet, I would recommend checking out the Certified Programming with Dependent Types chapter on this first.

ssub_p_ind
     : forall (CT : ctable) (P : typ -> typ -> Prop),
       (forall A B C : typ,              (* trans *)
            ssub_p CT A B -> P A B ->
            ssub_p CT B C -> P B C ->
                P A C) ->
       (forall C D : cname,                 (* extends *)
            extends_ CT C D -> P C D) ->

       forall C D : typ,
            ssub_p CT C D -> P C D

ssub__ind
     : forall P : ctable -> typ -> typ -> Prop,
       (forall (CT : ctable) (A B C : typ), (* trans *)
            ssub_ CT A B -> P CT A B ->
            ssub_ CT B C -> P CT B C ->
                P CT A C) ->
       (forall (CT : ctable) (C D : cname),  (* extends *)
            extends_ CT C D -> P CT C D) ->

       forall (CT : ctable) (C D : typ),
            ssub_ CT C D -> P CT C D

I know that's a big block of code to look at, but it's the differences I want to highlight. Lets just look at the conclusions of each. For ssub_p_ind we get that forall C D, ssub_p CT C D -> P C D, but for ssub__ind, we get forall CT C D, ssub_ CT C D -> P CT C D.

Same with the two cases for transitivity and direct extension, ssub__ind always does a forall CT, where as ssub_p_ind has just that one forall CT at the very start.

I know that ssub__ind is more general, as I was able to prove ssub_p_ind given ssub__ind, by specializing the inductive cases to the choice of CT, but I was not able to prove the other direction. I pose that it is impossible, but I am happy to hear any counterexamples.

This seems like good news, we can just do all our work with ssub__ind, and everything will work out, as it's stronger. However, in practice, Coq runs into some trouble if you try and do this. Here's a lemmma that's much easier to solve with ssub_p_ind.

Lemma no_ssub_with_empty_table C D
    (H_sub: ssub_ nil C D)
    :
    False.

This simply states that if you haven't declared any more classes than Object, you can't have any strict subclassing relationships. Using the parametric induction scheme:

Proof.
    induction H_sub using ssub_p_ind.
    - (* The inductive hypothesis is immediately false, easy *)
    exact IHH_sub1.
    - (* We get a term H : extends_ nil C D, which unfold to
    exists fs ms, (C, (D, fs, ms)) \in nil.  which is another easy contradiction. *)
    unfold extends_.
    auto.
Qed.

It's quite trivial, as in both cases we get easy contradictions. However, if we try it with the other induction scheme, we get

Proof.
    induction H_sub.
    - (* Still get a False hypothesis *)
    exact IHH_sub1.
    - (* But now we get H : extends_ CT C D, which doens't give a contradiction
    to us at all *)
Abort.

What went wrong? Well, remember that we are trying to fill in an argument for ssub__ind. Lets take another look at the definition for the extends case:

...
       (forall (CT : ctable) (C D : cname),  (* extends *)
            extends_ CT C D -> P CT C D) ->
...

That's right, we have to show that this holds for all such CT. This severely cramps our style. Lets see if there's a way to force it to work anyways, because why not. We are going to start with refine instead, and be super explicit.

(* Exactly the same as above *)
Proof.
    refine (ssub__ind
    (* P *) (fun CT C D => False)
    (* H_trans *) _
    (* H_extends *) _
    nil C D H_sub).
    -
    (* goal :
    forall (CT : ctable) (t1 t2 t3 : typ), ssub_ CT t1 t2 -> False -> ssub_ CT t2 t3 -> False -> False
    *)
    auto. (* False -> False, easy *)
    -
    (* 
    forall (CT : ctable) (t1 t2 : cname), extends_ CT t1 t2 -> False
    *)
    (* still screwed! *)
Abort.

Now that we see what we've done, lets try and do something more clever. Lets add a condition to P that CT = nil, so in the second case we only have to prove:

    forall (CT : ctable) (t1 t2 : cname), extends_ CT t1 t2 -> CT = nil -> False.

That seems much more reasonable. Lets try it!

Proof.
    intros H_sub.
    refine (ssub__ind
    (* P *) (fun CT C D => CT = nil -> False) (* added that *)
    (* H_trans *) _
    (* H_extends *) _
    nil C D H_sub eq_refl). (* We also had to add the term (eq_refl: nil = nil) *)
    - auto. (* still trivial *)
    - (* forall (CT : ctable) (t1 t2 : cname), extends_ CT t1 t2 -> CT = nil -> False *)
    clear.
    intros CT C D H_extends H_eq.
    (* We have
        H_extends : extends_ CT C D
        H_eq : CT = nil
    *)
    rewrite H_eq in H_extends.
    (* H_extends: extends_ nil C D *)
    (* we are in the same place as above, easy. *)
    unfold extends_.
    auto.
Qed.

I think this counts as an application of the convoy pattern, as seen here. (Speaking of that site, I would love to have overlays for proofs like that blog, and have some ideas on how to generate it automatically. But work comes first.)

Now, let me show you where this falls apart.

Lemma strengthen_ssub (CT:ctable) C D A B ms fs
    (H_dir: directed_ct ((A, (B, fs, ms)) :: CT))
    (H_ok_C_s: ok_type_ ((A, (B, fs, ms)) :: CT) C)
    (H_noobj: Object \notin dom ((A, (B, fs, ms)) :: CT))
    (H_neq: A <> C)
    (H_sub: ssub_ CT C D)
    (H_ok_D: ok_type_ CT D)
    : ssub_ ((A, (B, fs, ms)) :: CT) C D.

This induction cannot be proven with ssub_p, we need to have the choice of CT be more determined. I think. I tried pretty hard, and even asked in the irc channel to try and solve it with the parametric way. But I always ended up with new fresh variables getting generated for C and D, which prevented me from ruling out that they were A and B.

Here's a snapshot of it failing.

But I did manage to prove it using ssub__ind, though it also required passing the hypothesis directly.

Here's a lemma for the symmetric case:

Lemma strengthen_ssub_case_2 (CT : ctable)
    (C : cname) (D : cname)
    (A : cname) (B : cname)
    (E : cname) (F : cname)
    (ms1 : mths) (fs1 : flds)
    (fs2 : flds) (ms2 : mths)
    :   ssub_ CT C D ->
        directed_ct CT ->
        Object \notin dom CT ->
        A <> C ->
        E <> C ->
        A \notin keys CT ->
        E \notin keys CT ->
    ssub_ ((A, (B, fs1, ms1)) :: (E, (F, fs2, ms2)) :: CT) C D.

I feel conflicted about naming these. I've mostly use C and D as the classes that show up in the final statement, but G and H don't sound like class names as much as A and B do, so I don't use 4 consecutive letters for the secondary class names. Bleh.

Proof.
    refine ((ssub__ind
    (* P *) (fun CT X Y => forall
                (H_dir: directed_ct CT)
                (H_noobj: Object \notin dom CT)
                (H_neq1: A <> X)
                (H_neq2: E <> X)
                (H_notin1: A \notin keys CT)
                (H_notin2: E \notin keys CT ),
                ssub_ ((A, (B, fs1, ms1))::(E, (F, fs2, ms2))::CT) X Y)
    (* H_Trans *) _
    (* H_Extend *) _)
    CT C D).

I left the hypothesis to the lemma as implications rather than naming then, as they only get applied to the result of ssub__ind, and would need to be cleared anyways. I could rearrange the order and put a forall CT C D, to avoid feeding those arguments to ssub__ind and clearing them afterwards, but I think it's better to name them.

    - (* trans *)
    clear CT C D.

    intros CT t1 t2 t3.
    intros H_sub_1 IHH_sub_1 H_sub_2 IHH_sub_2.
    intros. (* as named above in P. *)
    apply ssub_trans with (t2:=t2).
    +
    apply IHH_sub_1; assumption.
    + (* Need A <> t2, E <> t2 *)

    assert (t2 \in keys CT). {
        apply ssub_child_in_table with (D := t3); assumption.
    }
    assert (A <> t2). {
        destruct (A == t2).
        subst.
        contradiction.
        auto.
    }
    assert (E <> t2). {
        destruct (E == t2).
        subst.
        contradiction.
        auto.
    }
    apply IHH_sub_2; assumption.

It's actually easier to prove the transitivity case when you have fewer hypothesis for P. I started out with P just concluding ssub_ ((A, (B, fs1, ms1))::(E, (F, fs2, ms2))::CT) X Y, and the transitivity case was just apply ssub_trans; auto, but I required that A <> C and E <> C for the extends_ case. But when I added those, then I had to prove them for that middle class introduced by transitivity, which I knew little about.

To show that A <> t2, I noted that t2 has to be in the class table somewhere, as it is a subclass, while I know that A and E and not in the rest of the class table, as they are at the front.

    - (* extends *)
    clear dependent CT;
    clear dependent C;
    clear dependent D.
    intros CT C D H_extends.
    intros.

    unfold_extends H_extends.
    apply ssub_extends.
    unfold extends_.
    exists fs0, ms0.
    auto.
Qed.

And extends is just based on looking up C in the table, which doesn't change when we add in two different entries in front of it.

This proof had me stumped for quite a while before I explicitly wrote out P, and manually did the induction.

So sometimes you need the additional generality of index types, however the extra generality might cause Coq to do a worse job when using the induction tactic, so you shouldn't just blindly default to using index types when your datastructure really is fully parametric.

IceFS Cubes

2016-03-04T00:00:00Z

Posted on March 4, 2016

For my CS854 class I have to read a trio of research papers each week, and post summaries, which are then updated after the class that a student gives a presentation on it. Don't rely on my summaries for anything, but it might interest some of you, so I'm posting it here.

Just read the paper.

Physical Disentanglement in a Container-Based File System

This paper introduces IceFS, which can separate directories into cubes (Ice Cubes, get it?). These cubes are isolated from each other, so failures in one don't affect any other. Also, calls to fsync(), in one don't affect the performance of others.

Wouldn't it be nice if you could run separate tasks in separate directories, and they didn't interfere? That's what IceFS is working towards.

Lets define what we're talking about, the paper calls it entangelment when the metadata or data from two different tasks are stored on the same block. (Not at all related to entanglement when you can't write the matrix as the kronecker product of two independent quantum states :P).

As an example of this failing to happen, running SQLite and Varmail at the same time cuts both of their performance in half, or one tenth, respectively. I'd be interested in knowing if this positively affected docker instances. If anyone has lots of free time, re-run their tests with SQLite and Varmail in their own docker instances too. I'd love to hear about the result.

This is because Varmail calls fsync after it's short writes, and SQLite has very large writes. These large writes get synced when Varmail calls fsync, causing fsync to take much longer, and preventing better batching of writes of SQLites side.

Another example. If you have a bunch of VMs with file system images in different directories, then if there is a fault which causes the host file system to become read only, this introduces downtime for all the VMs. If each directory were it's own cube however, this would only take down one VM, and it would be faster to fsck that small cube instead of the entire partition.

Now, I have some concerns here. The paper talks about faults in the VM that can take out the host OS's filesystem, but I would already hope that the VM is providing isolation. I don't quite understand what kind of faults would do that.

I guess if the host OS's drive suffers from a block failure in one of the blocks backing the VM, that could have the effect, but do you really want to just fsck and keep running, or start replacing the drives right away?

The Cube Abstraction

We could like to group sets of files and directories together into a cube. Each cube will be physically isolated from each other, and will not have an impact on other cubes if it fails or has an incompatible workload.

The key trick is to make sure the metadata for each cube and it's files don't get stored in a block that is used by anything else. This does increase overhead slightly, as we increase fragmentation, but there is a payoff.

We also need to unbundle transactions, to solve the fsync problem.

Last concerns

How does this compare to [LVM (Logical Volume Manager)[lvm]? I feel like the paper really should have discussed that, since they seem to have similar goals.

One downside of LVM might be that it requires remounting the volume to increase the maximum size.

I don't actually know anything about LVM, other than it made it harder to get grub running again after I messed up my Linux install.

Oh, and how often do you need to fsck a drive anyways? If we somehow managed to make fsck run in zero time, what would be the upside? Is there significant downtime due to fsck running?

Though those isolation results seem neat, maybe this could get absorbed by Docker if there are real improvements.

Split-Level I/O Scheduling

2016-03-04T00:00:00Z

Posted on March 4, 2016

Just read the paper.

Split-Level I/O Scheduling

This paper introduces the idea of split-level I/O scheduling. This tries to gather the most useful information possible, and use it to build a better I/O scheduler. It splits the scheduling logic across handlers at three layers of the storage stack: block, system call, and page cache.

There are a number of different policies that we could be trying to implement:

Fairness
Performance
Isolation

These are all different goals, and require different things from a scheduler.

Lets take a look at the IO stack. It looks something like this: Which I took from thomas-krenn.com, created by Werner Fischer.

The major parts are

System Call Interface
Virtual File System
Page Cache
Block Layer IO
Device Drivers

Let us first consider what classic schedulers are like. There has been a lot of classic work, for example consideration of how the disk head moves on a hard drive, and re-arranging blocks to be written to minimize seek time.

There are two downsides to working at just the block level. The file system mandates that some requests are not re-ordered because they are critical to preserve consistency in case of a crash. This means that by the time the requests are hit the scheduler, it's too late for the scheduler to have a say. The second is that the scheduler has no information about what process is making a request, so it cannot do proper accounting.

One way to fix those two problems is the place the scheduler above the file system, at the system call level. This will fix those two problems, but runs into other ones. In particular, the system-call level scheduler will not know about the page cache, nor other information that would help when trying to schedule tasks most efficiently. Also, the file system will issue metadata requests and journaling writes, so one I/O request can multiply after the system call level, making it hard to accurately estimate costs.

So, why not both?

The idea is to implement handlers at the system call, page-cache, and block layers, so we can get the benefits of both kinds of handlers. One downside would be the added complexity, as well as some small overhead.

The system call level handlers can tag the I/O operation with the process that will be billed for it, while we still have a block level scheduler that can use that information to properly handle priority levels.

This is called Cause Mapping, and it lets us implement fairness properly. Without it, we can get things like this:

where the priority of the task doesn't matter, as all the IO comes from a writeback thread.

Writes aren't always written to disk immediately, often the changes stay in the page cache for a while. (This makes a lot of sense for things like mmap'd pages, where you are changes bytes at a time. I'm not entirely sure about when you're just appending to a file over and over again if the IO is buffered in the same way.)

The OS just tracks which pages are "dirty", in that they don't match what's on disk, and periodically has a background thread write the dirty pages to disk.

What's happening in that example with the CFQ is that all the dirty pages are being actually written to disk by this background thread, and the block level scheduler only sees a bunch of writes by the same thread. That leads to the failure of fairness.

The Cause Mapping introduced at the system call level scheduler can be used to solve this problem, making the split-level schedulers better in this respect.

Another problem solved is that of cost estimation, since the system call level scheduler has no idea if the reads/writes are going to just be absorbed by the page cache, or have to go all the way to disk, so it's hard to be completely fair. Also, write amplification (the transaction overhead, for example) also obscures how expensive each operation will be too.

But by the time the requests make it to the block level scheduler, it's much easier to predict the cost of the operations. (Of course, it'd be even easier to judge from the hard disk controller directly, but that's a bit too late to be useful, since the operation would have already been scheduled at that point.) At the block level, a scheduler is less likely to overestimate the cost (due to a cache hit), nor underestimate the cost (journaling), and is much more accurate.

Unfortunately, the block level scheduler could be too late to affect anything (i.e. preform reordering), as writes can be buffered for 30 seconds before being flushed and thus exposed to the block level scheduler, at which point there isn't much it can do about it.

This system starts off with a guess at the systemcall layer, and refines it as more information becomes available.

How to Get More Value From Your File System Directory Cache

2016-03-04T00:00:00Z

Posted on March 4, 2016

Just read the paper.

How to Get More Value From Your File System Directory Cache

This paper explores, well, how to get more value from your file system directory cache?

First, lets explore what needs caching, then what the cache is, then how to get more value from it.

What needs caching?

In a word: Paths. Lots of system calls care about file paths. In POSIX, in order to open a file, you need to have search (execute) permissions on all of the parent directories. This is seemingly unavoidably linear time in the number of parents, as you would need to check each one in turn, and do a lot of pointer chasing.

Lots of things, like Linux Security Modules rely deeply on this model, so it seems hard to just replace it. You need to make sure the new method preserves the effects we care about.

How is this cached?

The directory cache is a LRU cache of the most recently accessed directories, so we don't need to hit the disk repeatedly if we access the same directory repeatedly.

Each directory entry (dentry from now on) maps a path to a inode (kept in ram) with all the metadata about a file.

In addition to the LRU, we also have a tree structure matching that of the filesystem, a hash table from the pair of parent dentry vaddr and file name to dentry, and an alias list, to track hard links to an inode.

But even a cache hit is slow, compare 1.1 us for stat compared to 0.04us for getppid, or 0.3us for a 4KB pread.

It can also cache negative dentries, to prove that a file does not exist. This can speed up checks that fail too, which is pretty important.

This structure still is linear time in the number of path components.

How to get more value?

Create a system wide hash table from full, canonicalized paths to dentries, called the direct lookup hash table (DLHT). This is a cache, so it is populated lazily, and entries can be invalidated by operations such as renaming a directory.

In addition, we cache the results of previous prefix checks, in the prefix check cache (PCC). This depends on the permissions of the process, so we only share this between processes with identical permissions. Use a version number with each entry to detect stale entries.

We now have a faster fastpath, where you directly lookup the dentry in the DLHT, then check it's permissions in the PCC, and then win at speed. No more linear time costs.

And if it fails, then use the old path.

What if two processes change something at once?

Version number based locks. Everywhere.

Before a mutation, like renaming a directory, the operation walks all children (that were cached) and bumps their version counter. That'll prevent old PCC entries from applying to them anymore.

Also, remove all those dentries from the DLHT, as they could be invalid now.

Then, we have to consider of there is a slow path request currently in flight while the mutation is happening. If we don't do anything, it might add to the cache an entry for a directory that has been moved. We can keep a global invalidation counter, and only cache things if it hasn't been bumped during the lookup.

If you get exactly 2^32 mutations happening during one very slow lookup, you have other problems.

Also, use existing locking structures to make sure that the slow path handles concurrent requests well.

This is fundamentally a change to make reading cheaper, but writing more expensive. It does seem like a nice trade off, in that respect, since there are a lot more reads of a file's permissions than there are writes.

Directory Completeness

This paper also introduced a separate concept, of Directory Completeness. This is a single bit stored on a directory's dentry to mark if all of the directory's children are definitely in the cache. If it's marked, calls to ls can skip checking on disk.

This is cool, but not really related to the rest of the paper.

Concerns

If there are any bugs, this could have rather large implications for security. If file permissions don't work properly, you're going to have a bad time.

I feel like the signature stuff to avoid string comparisons is good, but could be separated out. Nothing else really depends on it, and it could be implemented on it's own.

Data Sharing or Resource Contention: Towards Performance Transparency on Multicore Systems

2016-02-05T00:00:00Z

Posted on February 5, 2016

Just read the paper.

Data Sharing or Resource Contention: Toward Performance Transparency on Multicore Systems

Idea: Use the hardware counters to see when tasks are slowing each other down with shared memory contention, and put those tasks on the same for to make it go faster. Or put tasks which each want to use a lot of different memory on different cores, to reduce cache space contention.

Counters

The paper uses 4 counters:

Last Level Cache (LLC) hits,
LLC misses,
misses at the last private (per socket/core) level of cache,
Remote memory accesses (NUMA)

Note: Last Level Cache means the one that is furthest from the CPU, as in, the last one you check before going to main memory.

Oh neat, DRAM also benefits from spatial locality. For Random Access Memory, that's a little surprising, but it makes sense.

Their SAM technique focuses more on making full use of the memory bandwidth for each core, while still keeping intra-core sharing in mind.

I don't have that much to say about it, without going into detail on their algorithm or going into detail on the results.

Their technique appears to work well, and beats Linux in their benchmarks.

I'm curious if it will be adopted, or if not, why?

Scalable Read-mostly Synchronization Using Passive Reader-Writer Locks

2016-02-05T00:00:00Z

Posted on February 5, 2016

Just read the paper.

Scalable Read-mostly Synchronization Using Passive Reader-Writer Locks

This is a new kind of Reader-Writer locks.

Some background. rwlocks are designed to work when you have many readers, who are allowed to read at the same time, or one writer.

These show up everywhere. Even in Rust's type system, where you are allowed one &mut borrow, or many immutable borrows. However, in this case we want a runtime lock, not compile time.

Also, the traditional implementation of rwlocks used in the Linux kernel have some problems, like as increasing latency for writers, or cause readers to contend, or not cope with a thread sleeping or being pre-empted in a critical section.

This paper introduces prwlock, passive reader-writer locks. This relies on the machine architecture having Total Store Ordering.

Compared to competing locks, prwlock has fast and low latency on the reader path, and bounded latency for the write path. It also attempts to be easier to use that RCU.

Design of `prwlock`.

We have a 64-bit version variable ver. Each writer increases the version and waits until all readers see the change.

This would work alone, but has a chance of starvation for writers, and any crashed reader (or even migrating between cores) could cause a deadlock.

We can use Inter Process Interrupts (IPIs) to request straggling readers to immediately report their status. Since IPSs are fairly cheap, this works well.

But it's possible that the reader has gone to sleep, and thus would miss the IPI. To avoid that problem, before any reader goes to sleep, it gets converted to an "Active Reader", and the lock maintains a count of how many "active readers" there are. Since sleeping isn't that common, the shared counter in the lock will not be a bottleneck.

Performance

Look at all the graphs! It's faster, good.

Strangely, it's slow to acquire a writer lock when there are no readers, but prwlock is faster than their competitors when there are readers.

Utilizing the IOMMU Scalably

2016-02-04T00:00:00Z

Posted on February 4, 2016

Just read the paper.

This post is on: Utilizing the IOMMU Scalably. https://www.usenix.org/conference/atc15/technical-session/presentation/peleg

Oh Cool DMA + MMU = IOMMU. Didn't know those existed, but I guess I've only dealt with DMA on really low level hardware, like the GBA.

This exists to have the protection of virtual memory while using DMA (direct memory access) to copy buffers around. This is useful for a NIC (Network interface card), and performance matters because we'd like to hit multi-Gb/s of network data transfer.

Currently we have bottlenecks

Assignment of IOVAs. (Virtual addresses used for IO).
Management of IOMMU's TLB (Flushing is slooow).

Solutions:

Dynamic identity mappings, removing IOVAs
- What's that mean?
Allocating IOVAs using kmalloc.
- Rely on the already optimized part of the kernel.
Per-core caching of IOVAs allocated by a globally locked IOVA allocator.
- Don't have to flush globally as often, or something.

The problem, restated, is that the NIC wants to DMA stuff. Much faster than tying up a CPU. This means mapping virtual addresses to the buffers and making it accessible to be DMA'd into. But then, we want to remove the mapping later.

Currently, we map a new virtual address for each buffer, and unmap it afterwards. This causes lots of traffic on the TLB.

If we have a static mapping, it's got acceptable performance, but that's not great for security reasons I think? I'm not quite sure why you need to unmap the buffers and keep dynamically allocating them. Oh it's for security from devices.

Linux keeps a queue of invalidations, and then does them all at once, but the batching datastructure is lock protected, and can be a bottleneck.

IOMMUs on x86 are just like the MMU, with 4 levels. There's also an IOTLB, for caching. This must be flushed when we modify any translation.

Oh cool, you can use the IOMMU with virtualization to let the guest OS directly control some device. I've got to learn more about virtualization sometime.

Dynamic mappings are there to protect the OS from devices. Creating and destroying millions of IOVAs a second is slow, who woulda guessed.

Normally Linux operates in deferred invalidation mode. After a request finishes, it happily returns without blocking until the mapping is fully removed from the IOMMU, only queued.

For the kmalloc solution, it mostly works, but it's hard to reclaim the intermediate pages.

4.1 Dynamic Identity Mapping

Ah, so we use the fact the buffers are normally contiguous, and we can use identity (1-to-1) mapping. Since we map and unmap the regions, it's called dynamic identity mapping.

It's got some drawbacks, and isn't great. You need to keep reference counts to pages, and have conflicting access permissions.

4.2 IOVA-kmalloc

Use kmalloc to get physical addresses, and use that address as an IOVA. This sorta wastes the physical memory, but since we can use 8 byte of physical memory for each page of IOVA allocated.

4.3 Scalable IOVA Allocation wiht Magazines.

Basic Idea: per-core cache of previously deallocated IOVA ranges.

This can avoid needed to acquire the global lock.

A common scenario is to have one core allocate multiple IOVAs and another core will deallocate all of them, which could cause a buildup of cached buffers at one core.

A magazine (as in, one holding ammo) is a bundle of M elements, and when a core tries to allocate when it has an empty magazine, it can grab a full magazine from the depo. Thus the cache miss rate is bounded by 1/M.

Conclusion: These three techniques are all useful, but have different trade-offs. Still better than stock Linux though, and without any significant security risks(?), so maybe it'll make its way into the Kernel?

Managing the memory used to store the page table well is still an open problem.

Adding a Pusheen Tap for homebrew.

2015-12-28T00:00:00Z

Posted on December 28, 2015

Note: My appologies, dear readers, I appear to by having some issues with the font, and it's squishing poor Pusheen. If anyone has any ideas, I would like to fix this.

░░░▐▀▄░░░░░░░▄▀▌░░░▄▄▄▄▄▄▄░░░░░░░░░░░░░
░░░▌▒▒▀▄▄▄▄▄▀▒▒▐▄▀▀▒██▒██▒▀▀▄░░░░░░░░░░
░░▐▒▒▒▒▀▒▀▒▀▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▀▄░░░░░░░░
░░▌▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▄▒▒▒▒▒▒▒▒▒▒▒▒▀▄░░░░░░
▀█▒▒▒█▌▒▒█▒▒▐█▒▒▒▀▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▌░░░░░
▀▌▒▒▒▒▒▒▀▒▀▒▒▒▒▒▒▀▀▒▒▒▒▒▒▒▒▒▒▒▒▒▒▐░░░▄▄
▐▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▌▄█▒█
▐▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒█▒█▀░
▐▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒█▀░░░
▐▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▌░░░░
░▌▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▐░░░░░
░▐▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▌░░░░░
░░▌▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▐░░░░░░
░░▐▄▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▄▌░░░░░░
░░░░▀▄▄▀▀▀▀▀▄▄▀▀▀▀▀▀▀▄▄▀▀▀▀▀▄▄▀░░░░░░░░

One of my friends on facebook posted a pic of a terminal command that printed out the ever adorable pusheen.

Someone else said "0/10, not on homebrew", and I knew I had a new mission.

Since this was an incredibly silly mission, I didn't actually expect to get this accepted into homebrew-core, so I made my own tap as it is known, my own public set of formulae. The end result is now you can run:

brew tap tbelaire/silly-things
brew install pusheen
pusheen

And enjoy this on your very own brew-compatible computer.

"How did you perform this magic?" one might ask. I'll be happy to layout the steps.

First off, create the command. Since I was super amused by the idea of cating this cat with /bin/cat, so a cat in /bin will print out this cat who is not currently in a bin, but would happily jump into one if the opportunity arose.

So my "script" is just:

#!/bin/cat
░░░▐▀▄░░░░░░░▄▀▌░░░▄▄▄▄▄▄▄░░░░░░░░░░░░░
░░░▌▒▒▀▄▄▄▄▄▀▒▒▐▄▀▀▒██▒██▒▀▀▄░░░░░░░░░░
░░▐▒▒▒▒▀▒▀▒▀▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▀▄░░░░░░░░
░░▌▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▄▒▒▒▒▒▒▒▒▒▒▒▒▀▄░░░░░░
▀█▒▒▒█▌▒▒█▒▒▐█▒▒▒▀▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▌░░░░░
▀▌▒▒▒▒▒▒▀▒▀▒▒▒▒▒▒▀▀▒▒▒▒▒▒▒▒▒▒▒▒▒▒▐░░░▄▄
▐▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▌▄█▒█
▐▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒█▒█▀░
▐▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒█▀░░░
▐▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▌░░░░
░▌▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▐░░░░░
░▐▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▌░░░░░
░░▌▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▐░░░░░░
░░▐▄▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▄▌░░░░░░
░░░░▀▄▄▀▀▀▀▀▄▄▀▀▀▀▀▀▀▄▄▀▀▀▀▀▄▄▀░░░░░░░░

The way this'll work is when /usr/local/bin/pusheen is called, it will see the #!/bin/cat line, and call /bin/cat /usr/local/bin/pusheen, and then print out a cat.

So cute.

Anyways, I then stuck it in a git repository https://github.com/tbelaire/pusheen, and made a release, so homebrew could have a tarball.

Then, it was just

brew create https://github.com/tbelaire/pusheen/archive/v0.1.tar.gz

and just a little editing, and I had this file:

# Documentation: https://github.com/Homebrew/homebrew/blob/master/share/doc/homebrew/Formula-Cookbook.md
#                http://www.rubydoc.info/github/Homebrew/homebrew/master/Formula
# PLEASE REMOVE ALL GENERATED COMMENTS BEFORE SUBMITTING YOUR PULL REQUEST!

class Pusheen < Formula
  desc ""
  homepage ""
  url "https://github.com/tbelaire/pusheen/archive/v0.1.tar.gz"
  version "0.1"
  sha256 "f29480b2dbb4eaa7bcb95c5698d44a242f4461965af50dcc98884404c286dbc7"


  def install
      bin.install "bin/pusheen"
  end

  test do
    # `test do` will create, run in and delete a temporary directory.
    #
    # This test will fail and we won't accept that! It's enough to just replace
    # "false" with the main program this formula installs, but it'd be nice if you
    # were more thorough. Run the test with `brew test pusheen`. Options passed
    # to `brew install` such as `--HEAD` also need to be provided to `brew test`.
    #
    # The installed folder is not in the path, so use the entire path to any
    # executables being tested: `system "#{bin}/program", "do", "something"`.
    system "pusheen"
  end
end

You can see my blatant disregard for rules as I so crassly left the generated comments alone. Such barbarism. Oh my.

Anyways, now that it's working, I copied that pusheen.rb file from /usr/local/Library/Formula/ to my own repository I just made up, tbelaire/homebrew-silly-things, and it was off to the metaphorical races. Homebrew is smart enough to fetch that when we call brew tab tbelaire/silly-things, so people all around the world can fix the void in their heart with Pusheen without having to leave their terminal.

And that has been your silly abuse of technology for cat related purposes for today.

Using Input

2015-12-10T00:00:00Z

Posted on December 10, 2015

I first ported a few more complicated examples from TONC, but I would rather introduce one thing at a time for these blog posts, so I've written a cute little etch-a-sketch example that's based off of the first crate. We're going to take input from the user and push the dots around.

It's a little out of order compared to TONC, but you can check out the section on input for more details.

Here's what I've done to main.rs. I've just moved the three points together, and created an x and a y to choose where to draw them.

pub extern "C" fn main(_: i32, _: *const *const i8) -> i32 {
    let mut m = gfx::Mode3::new();

    // Save our copy of the state of the keys.
    let mut keys = input::Input::new();

    // Location of the cursor.
    let mut x : i32 = 120;
    let mut y : i32 = 80;

    // Avoid repeated typecasts.
    let width = gfx::Mode3::WIDTH as i32;
    let height = gfx::Mode3::HEIGHT as i32;

    let colors = [Color::rgb15(31, 0, 0),
                  Color::rgb15(0, 31, 0),
                  Color::rgb15(0, 0, 31)];
    loop {
        // Wait for vsync, so we only draw once per frame.
        gfx::vid_vsync();

        // Save the current state of the keys.
        // This keeps the previous state around,
        // so we can check for button *presses*,
        // and tell that apart from holding the button.
        keys.poll();

        // These are neat little helpers functions that encapsulate
        // that pressing Left increases x and Right decreases it.
        // tri_horz() will return -1, 0, or 1.

        // This keeps everything positive.
        // Note that just like as in C, -1 % 5 = -1, so we need to add width.
        x = (x + width) % width;
        y = (y + height) % height;


        m.dot(x, y, colors[0]);
        m.dot((x+1) % width, y, colors[1]);
        m.dot((x+1) % width, (y+1) % height, colors[2]);
    }
}

We can then draw happy little loops:

I also added the ability to cycle the colors around using the shoulder buttons.

    let mut color_ix = 0;
    loop {
        // ...
        if keys.hit(Keys::L) {
            color_ix -= 1;
        } else if keys.hit(Keys::R) {
            color_ix += 1;
        }
        color_ix = (colors.len() + color_ix) % colors.len();

        // ...
        m.dot(x, y, colors[color_ix]);
        m.dot((x+1) % width, y,
              colors[(color_ix + 1) % colors.len()]);
        m.dot((x+1) % width, (y+1) % height,
              colors[(color_ix + 2) % colors.len()]);
    }

This uses keys.hit instead of keys.pressed or the tribool, so each time you press the button it shifts, but you have to release it before we can do it again.

Now, lets take a look at the input module that's backing all this.


use ::memmap;
use core::intrinsics::{volatile_load};

/// Keys also functions as the flags for the keys.
#[repr(C)]
#[derive(Copy, Clone, Debug)]
pub enum Keys {
    A =      0x0001,
    B =      0x0002,
    Select = 0x0004,
    Start  = 0x0008,
    Right  = 0x0010,
    Left   = 0x0020,
    Up     = 0x0040,
    Down   = 0x0080,
    R      = 0x0100,
    L      = 0x0200,
}

/// The OR of all the keys.
pub const KEY_MASK: u32 = 0x03FF;

//...

#[derive(Debug)]
pub struct Input {
    prev: u32,
    curr: u32,
}

We've got a nice flags enum in here, and an Input struct. This replaces the global in TONC, and we'll just alocate it in main and pass it down to where it's needed. Note that it's not Copy, since we really don't need it to be, and having a stale copy of it doesn't actually seem too useful.

fn bit_tribool(bits: u32, negative : KeyIndex, positive : KeyIndex) -> i32{
    ((bits >> positive as u32) & 1) as i32
        - ((bits >> negative as u32) & 1) as i32
}

impl Input {
    /// You should only need one copy of this struct.
    pub fn new() -> Input {
        Input{ prev: 0, curr: 0}
    }

    /// This should be called once a frame.
    pub fn poll(&mut self) {
        self.prev = self.curr;
        self.curr = unsafe {!(volatile_load(memmap::REG_KEYINPUT) as u32)}
                    & KEY_MASK;
    }

    /// hit checks if the key is now pressed, but wasn't before.
    pub fn hit(&mut self, k: Keys) -> bool {
        (!self.prev & self.curr) & (k as u32) != 0
    }

    // ...

    /// These family of functions return -1, 0, or 1.
    /// tri_horz is 1 when Left is pressed, and -1 when Right is.
    pub fn tri_horz(&mut self) -> i32 {
        bit_tribool(self.curr, KeyIndex::Left, KeyIndex::Right)
    }
    /// tri_vert is 1 when Up is pressed, and -1 when Down is.
    pub fn tri_vert(&mut self) -> i32 {
        bit_tribool(self.curr, KeyIndex::Up, KeyIndex::Down)
    }

    // ...
}

You can check out the full version of input.rs on github.

Lets take a closer look at .poll().

pub fn poll(&mut self) {
    self.prev = self.curr;
    self.curr = unsafe {!(volatile_load(memmap::REG_KEYINPUT) as u32)}
                & KEY_MASK;
}

We save the current set of keys in prev, and then do a volatile_load from the REG_KEYINPUT register. We then immediately flip all the bits, since the GBA hardware actually clears the bits when buttons are pressed, which is weird, and we don't want to have to think about that. We also mask it with the KEY_MASK, which is just the OR of all the key's flags.

I've copied masses of constants from TONC, for example:

// memmap.rs

pub const MEM_IO : u32 = 0x04000000;
// ... 
pub const REG_BASE: u32 = MEM_IO;
// ... 
pub const REG_KEYINPUT: *mut u16 = (REG_BASE + 0x0130) as *mut u16;	// Key status

Theo's Blog

Rust on the GBA: Revisited

Coq: Index vs Parameter types: an example of where it matters.

Sub Class.

IceFS Cubes

The Cube Abstraction

Last concerns

Split-Level I/O Scheduling

How to Get More Value From Your File System Directory Cache

What needs caching?

How is this cached?

How to get more value?

What if two processes change something at once?

Directory Completeness

Concerns

Data Sharing or Resource Contention: Towards Performance Transparency on Multicore Systems

Counters

Scalable Read-mostly Synchronization Using Passive Reader-Writer Locks

Design of prwlock.

Performance

Utilizing the IOMMU Scalably

Adding a Pusheen Tap for homebrew.

Using Input

Design of `prwlock`.