Measuring File Locality

Let's try to analyze whether there is a significant locality in the namespace.

To understand better whether the heuristics mentioned in the last lesson make sense, let’s analyze some traces of file system access and see if indeed there is namespace locality. For some reason, there doesn’t seem to be a good study of this topic in the literature.

Specifically, we’ll use the SEER traces“The Design of the SEER Predictive Caching System” by G. H. Kuenning. MOBICOMM ’94, Santa Cruz, California, December 1994. According to Kuenning, this is the best overview of the SEER project, which led to (among other things) the collection of these traces. and analyze how “far away” file accesses were from one another in the directory tree. For example, if file f is opened, and then re-opened next in the trace (before any other files are opened), the distance between these two opens in the directory tree is zero (as they are the same file). If a file f in directory dir (i.e., dir/f) is opened and followed by an open of file g in the same directory (i.e., dir/g), the distance between the two file accesses is one, as they share the same directory but are not the same file. Our distance metric, in other words, measures how far up the directory tree you have to travel to find the common ancestor of two files; the closer they are in the tree, the lower the metric.

The graph above shows the locality observed in the SEER traces over all workstations in the SEER cluster over the entirety of all traces. The graph plots the difference metric along the x-axis and shows the cumulative percentage of file opens that were of that difference along the y-axis. Specifically, for the SEER traces (marked “Trace” in the graph), you can see that about 7% of file accesses were to the file that was opened previously and that nearly 40% of file accesses were to either the same file or to one in the same directory (i.e., a difference of zero or one). Thus, the FFS locality assumption seems to make sense (at least for these traces).

Interestingly, another 25% or so of file accesses were to files that had a distance of two. This type of locality occurs when the user has structured a set of related directories in a multi-level fashion and consistently jumps between them. For example, if a user has a src directory and builds object files (.o files) into an obj directory, and both of these directories are sub-directories of a main proj directory, a common access pattern will be proj/src/foo.c followed by proj/obj/foo.o. The distance between these two accesses is two, as proj is the common ancestor. FFS does not capture this type of locality in its policies, and thus more seeking will occur between such accesses.

For comparison, the graph also shows locality for a “Random” trace. The random trace was generated by selecting files from within an existing SEER trace in random order, and calculating the distance metric between these randomly-ordered accesses. As you can see, there is less namespace locality in the random traces, as expected. However, because eventually, every file shares a common ancestor (e.g., the root), there is some locality, and thus random is useful as a comparison point.

Get hands-on with 1400+ tech skills courses.