-
Notifications
You must be signed in to change notification settings - Fork 85
Description
(this is tested in uproot v5.6.5, likely present in other versions as well).
The issue & reproducer
When running the following snippet and benchmarking this with memray:
import uproot
def loop_iterate(rootfile):
with uproot.open(rootfile, array_cache=None, object_cache=None) as f:
tree = f["tree"]
for batch in tree.iterate(step_size="200 MB", library="ak"):
print(repr(batch))
if __name__ == "__main__":
loop_iterate("~/Downloads/zlib9-jagged0.root")I'm getting a memory consumption of the physical RAM usage (RSS) of up to 1.6GB (even though the step size is 200 MB):
This is already surprising, however another indication that there's something up is that an explicit gc.collect at the end of each iteration improves the RSS situation by ~2x, i.e.:
import uproot
+import gc
def loop_iterate(rootfile):
with uproot.open(rootfile, array_cache=None, object_cache=None) as f:
tree = f["tree"]
for batch in tree.iterate(step_size="200 MB", library="ak"):
print(repr(batch))
+ del batch
+ gc.collect()
if __name__ == "__main__":
loop_iterate("~/Downloads/zlib9-jagged0.root")which gives up to 800MB RSS consumption:
Why is this bad?
RSS is the physical RAM usage by this process, which dask monitors to decide if a worker should be killed due to OOM or not.
What I've found so far...
The memory usage grows by the following function: https://github.com/scikit-hep/uproot5/blob/main/src/uproot/behaviors/TBranch.py#L1440-L1452 and to be more explicit by this part of it: https://github.com/scikit-hep/uproot5/blob/main/src/uproot/behaviors/TBranch.py#L3421-L3428
What does work correctly is that the file arrays dictionary by the above function is ~200 MB, that's good! However, this _ranges_or_baskets_to_arrays still uses ~800 MB to fill the ~200 MB arrays dict and does not free it again.
Also, the "popper-trick" that @jpivarski introduced in #1305 does enable the manual gc.collect to help here (without it even that won't help).
So, my understanding right now is that uproot.iterate does yield correctly sized arrays, but it uses way to much memory while doing so and also doesn't free it properly.
Other implications
_ranges_or_baskets_to_arrays is used also in other loading functions, and some quick tests showed that:
def loop_manual(rootfile):
with uproot.open(rootfile, array_cache=None, object_cache=None) as f:
tree = f["tree"]
starts = list(range(0, tree.num_entries, tree.num_entries // 20))
stops = starts[1:] + [tree.num_entries]
ranges = list(zip(starts, stops))
for start, stop in ranges:
print(f"entry {start} to {stop}")
entry = tree.arrays(entry_start=start, entry_stop=stop, library="ak")
print(repr(entry))
def loop_same_chunks(rootfile):
with uproot.open(rootfile, array_cache=None, object_cache=None) as f:
tree = f["tree"]
chunk_starts = 0
chunk_stops = 53687091
for _ in range(10):
print(f"entry {chunk_starts} to {chunk_stops}")
entry = tree.arrays(entry_start=chunk_starts, entry_stop=chunk_stops, library="ak")
print(repr(entry))have a similar memory behavior, see e.g. the profile for loop_manual (the numerical values of the y axis is different of course because I can't exactly mirror "200 MB" steps by hand):
and for loop_same_chunks:
What I want to see / was expecting
The orange and blue line overlap and roughly follow a saw tooth shape with 200 MB jumps per iteration (and not much additional overhead in RAM).
This has been originally been found by @oshadura in the scope of the integration challenge, here I just attach a local reproducer with some first findings.