Remove the 32-bit limit on step counts

We use the `u32` type for indices [throughout FlatGFA](https://github.com/cucapra/pollen/blob/e0cf9284b411fe201e8168b8e4d978765e1cfc9c/flatgfa/src/pool.rs#L11). This limits the number of things we can have in our pools to $2^{32}-1$, or about 4 billion. 4 billion is probably always going to be more than enough paths, segments, and links. But there exist GFAs out there with more than 4 billion total _path steps_. As in, there are a lot of paths, and no single path has more than 4 billion steps on its own, but if you add all the steps together from all the paths, _that_ comes out to be more than 4 billion. If you try to convert such a GFA to FlatGFA, we would crash.

(I also think it's plausible that there could be a GFA that has more than 4 billion total nucleotides in all its segments, but that hasn't come up yet, so we won't consider it here. But it would be nice if whatever approach we come up with for this immediate problem could also hypothetically apply to that problem.)

We would like to extend FlatGFA so it can support these larger variation graphs. I can think of a few potential approaches:

* Just switch to `u64`s for step indices. This is sort of unsatisfying because it seems wasteful for small graphs, but it would be simple.
* Configurably use `u32` step indices for small graphs and `u64` indices for big graphs. The idea would be that the file's header contains a flag saying which representation to use, and when you read a file, you look at that flag and then adapt all your indexing accordingly. This would avoid wasting space for small graphs while still adapting to any size graph. It seems tricky to implement, though.
* Switch to using path-local indices. Under the assumption that no single path will ever have more than 4 billion steps, let's avoid storing the _global_ index of each step and instead store the _local_ index: i.e., the `u32` offset from the start of the path's steps. This would require a `u64` index for the "starting step" for each path, and we would need a fast/convenient way to convert between `u32` local offsets and `u64` global offsets.

And there are probably other creative strategies I haven't thought of yet. This project will involve evaluating these various strategies and implementing one of them. Namely:

* [ ] Pick an approach from above (or make up a new one).
* [ ] Implement it.
* [ ] Obtain some big GFAs for testing to make sure it works.
* [ ] Measure the impact on file size and performance, for both big and small graphs.
* [ ] If we are happy with the trade-offs, merge the strategy into `main`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove the 32-bit limit on step counts #226

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Remove the 32-bit limit on step counts #226

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions