-
Notifications
You must be signed in to change notification settings - Fork 2
Description
We use the u32 type for indices throughout FlatGFA. This limits the number of things we can have in our pools to
(I also think it's plausible that there could be a GFA that has more than 4 billion total nucleotides in all its segments, but that hasn't come up yet, so we won't consider it here. But it would be nice if whatever approach we come up with for this immediate problem could also hypothetically apply to that problem.)
We would like to extend FlatGFA so it can support these larger variation graphs. I can think of a few potential approaches:
- Just switch to
u64s for step indices. This is sort of unsatisfying because it seems wasteful for small graphs, but it would be simple. - Configurably use
u32step indices for small graphs andu64indices for big graphs. The idea would be that the file's header contains a flag saying which representation to use, and when you read a file, you look at that flag and then adapt all your indexing accordingly. This would avoid wasting space for small graphs while still adapting to any size graph. It seems tricky to implement, though. - Switch to using path-local indices. Under the assumption that no single path will ever have more than 4 billion steps, let's avoid storing the global index of each step and instead store the local index: i.e., the
u32offset from the start of the path's steps. This would require au64index for the "starting step" for each path, and we would need a fast/convenient way to convert betweenu32local offsets andu64global offsets.
And there are probably other creative strategies I haven't thought of yet. This project will involve evaluating these various strategies and implementing one of them. Namely:
- Pick an approach from above (or make up a new one).
- Implement it.
- Obtain some big GFAs for testing to make sure it works.
- Measure the impact on file size and performance, for both big and small graphs.
- If we are happy with the trade-offs, merge the strategy into
main.