Skip to content

Remove the 32-bit limit on step counts #226

@sampsyo

Description

@sampsyo

We use the u32 type for indices throughout FlatGFA. This limits the number of things we can have in our pools to $2^{32}-1$, or about 4 billion. 4 billion is probably always going to be more than enough paths, segments, and links. But there exist GFAs out there with more than 4 billion total path steps. As in, there are a lot of paths, and no single path has more than 4 billion steps on its own, but if you add all the steps together from all the paths, that comes out to be more than 4 billion. If you try to convert such a GFA to FlatGFA, we would crash.

(I also think it's plausible that there could be a GFA that has more than 4 billion total nucleotides in all its segments, but that hasn't come up yet, so we won't consider it here. But it would be nice if whatever approach we come up with for this immediate problem could also hypothetically apply to that problem.)

We would like to extend FlatGFA so it can support these larger variation graphs. I can think of a few potential approaches:

  • Just switch to u64s for step indices. This is sort of unsatisfying because it seems wasteful for small graphs, but it would be simple.
  • Configurably use u32 step indices for small graphs and u64 indices for big graphs. The idea would be that the file's header contains a flag saying which representation to use, and when you read a file, you look at that flag and then adapt all your indexing accordingly. This would avoid wasting space for small graphs while still adapting to any size graph. It seems tricky to implement, though.
  • Switch to using path-local indices. Under the assumption that no single path will ever have more than 4 billion steps, let's avoid storing the global index of each step and instead store the local index: i.e., the u32 offset from the start of the path's steps. This would require a u64 index for the "starting step" for each path, and we would need a fast/convenient way to convert between u32 local offsets and u64 global offsets.

And there are probably other creative strategies I haven't thought of yet. This project will involve evaluating these various strategies and implementing one of them. Namely:

  • Pick an approach from above (or make up a new one).
  • Implement it.
  • Obtain some big GFAs for testing to make sure it works.
  • Measure the impact on file size and performance, for both big and small graphs.
  • If we are happy with the trade-offs, merge the strategy into main.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions