Skip to content

[Discussion] Design goal for v1.0 #361

@Moelf

Description

@Moelf

This is a bigger scope discussion that includes #319 and #314

There are two questions:

  • what features / behavior we want for TTree/RNTuple reading
  • what the default behavior should be

I think there are essentially three kinds of behavior when it comes to accessing the data (let's put aside the discussion on application-level behavior mapping, such as what EDM4hep.jl needs, and only focus on data access aspect of those applications):

  1. one-row at a time iteration over the entire table
  2. columnar style

1.a

for evt in df
    evt.Muon.pt
end

2.a

df.Muon.pt

But you can also have batched variations of 1 and 2:

1.b

for sub_df in Tables.partitions(df), evt in sub_df
    evt.Muon.pt
end

2.b

for sub_df in Tables.partitions(df)
    sub_df.Muon.pt
end

Currently, we have been developing this package by assuming users want 1.a (non-bach, for loop), and all the internal designs are geared towards that, for example:

  • iterate(LazyTree) is type stable, this causes big latency when making the tree
  • each TBranch/Field carries N caches, this can cause high memory use
  • each TBranch/Field carries N locks, N = number of threads, this causes over-head when indexing, see benchmark

Proposal

I think it might be useful to reduce latency and memory problem by dropping both iteration stability by default, and the per-thread buffer and lock. If I can only choose to drop one, I'd drop per-thread buffer and lock, and recommend 1.b as default pattern.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions