-
Notifications
You must be signed in to change notification settings - Fork 2
Description
One of the innovations in odgi's file format is that it contains a steps-by-segment index (I will explain what this is below), which can speed up certain operations on variation graphs. FlatGFA does not have this. The plan for this project would be to implement this auxiliary index and measure its impact on performance.
Recall that paths consist of steps, and each step "crosses" a given segment. FlatGFA stores a list of steps for each path; you can look at these steps to see which segments are involved. However, if you want to find all the paths that cross a particular segment, it's kinda slow: you have to look at every path in the entire graph and walk down its step list to see if there is a step for the segment of interest. The steps-by-segment index is meant to speed up this operation. It simply stores, for every segment, a list of every step from any path that traverses that segment. Notice that this information is redundant: we are storing something that we already "know" from a different data source (the list of steps for each path). But this redundancy is good for performance when you need it. (That's why I'm calling it an "index," like in databases: it's a redundant data structure that makes certain queries fast.)
In Rusty terms, here is what a steps-by-segment index looks like:
struct StepRef {
path: Id<Path>; // The path that this step is from.
offset: u32; // The offset within this path where this step occurs.
}
steps: Vec<StepRef>,
segment_steps: Vec<Span<StepRef>>,Basically, the idea is that StepRef is what you need to refer to a specific step in a specific path. We have one giant, flat pool of StepRefs, and every segment records a range in that pool for all the steps that cross that segment.
Let's try adding this index into FlatGFA in a few steps:
- Build an in-memory steps-by-segment index. Here, we don't change the file format at all. Instead, we just provide a function that builds a steps-by-segment index in memory to support fast queries. So if you want to use this, you have to first explicitly construct the index and then query it.
- Try this out in
odgi depth, which is a best-case scenario for benefitting from the index (it just counts the size of theSpan<StepRef>for each segment). Measure the performance increase when the index exists, and measure the time taken to construct the index. Is it a win end-to-end (doe the speed benefit outweigh the cost of index construction)? - Consider inventing an on-disk way to store the steps-by-segment index. It would be great if this were optional, i.e., FlatGFA files could decide whether they want to include the index or not. For files that do opt into storing the index, this means that commands like
odgi depthdon't have to construct the index first; they can use it directly. - Add Python bindings for the index.
- Check in with our biologist collaborators to understand other use cases for the index, including interesting set operations on path sets.