Skip to content

(Potential) memory leak with uproot.iterate #1528

@pfackeldey

Description

@pfackeldey

(this is tested in uproot v5.6.5, likely present in other versions as well).

The issue & reproducer

When running the following snippet and benchmarking this with memray:

import uproot

def loop_iterate(rootfile):
    with uproot.open(rootfile, array_cache=None, object_cache=None) as f:
        tree = f["tree"]
        for batch in tree.iterate(step_size="200 MB", library="ak"):
            print(repr(batch))


if __name__ == "__main__":
    loop_iterate("~/Downloads/zlib9-jagged0.root")

I'm getting a memory consumption of the physical RAM usage (RSS) of up to 1.6GB (even though the step size is 200 MB):

Image

This is already surprising, however another indication that there's something up is that an explicit gc.collect at the end of each iteration improves the RSS situation by ~2x, i.e.:

import uproot
+import gc

def loop_iterate(rootfile):
    with uproot.open(rootfile, array_cache=None, object_cache=None) as f:
        tree = f["tree"]
        for batch in tree.iterate(step_size="200 MB", library="ak"):
            print(repr(batch))
+           del batch
+           gc.collect()


if __name__ == "__main__":
    loop_iterate("~/Downloads/zlib9-jagged0.root")

which gives up to 800MB RSS consumption:

Image

Why is this bad?

RSS is the physical RAM usage by this process, which dask monitors to decide if a worker should be killed due to OOM or not.

What I've found so far...

The memory usage grows by the following function: https://github.com/scikit-hep/uproot5/blob/main/src/uproot/behaviors/TBranch.py#L1440-L1452 and to be more explicit by this part of it: https://github.com/scikit-hep/uproot5/blob/main/src/uproot/behaviors/TBranch.py#L3421-L3428

What does work correctly is that the file arrays dictionary by the above function is ~200 MB, that's good! However, this _ranges_or_baskets_to_arrays still uses ~800 MB to fill the ~200 MB arrays dict and does not free it again.

Also, the "popper-trick" that @jpivarski introduced in #1305 does enable the manual gc.collect to help here (without it even that won't help).

So, my understanding right now is that uproot.iterate does yield correctly sized arrays, but it uses way to much memory while doing so and also doesn't free it properly.

Other implications

_ranges_or_baskets_to_arrays is used also in other loading functions, and some quick tests showed that:

def loop_manual(rootfile):
    with uproot.open(rootfile, array_cache=None, object_cache=None) as f:
        tree = f["tree"]

        starts = list(range(0, tree.num_entries, tree.num_entries // 20))
        stops = starts[1:] + [tree.num_entries]
        ranges = list(zip(starts, stops))
        for start, stop in ranges:
            print(f"entry {start} to {stop}")
            entry = tree.arrays(entry_start=start, entry_stop=stop, library="ak")
            print(repr(entry))


def loop_same_chunks(rootfile):
    with uproot.open(rootfile, array_cache=None, object_cache=None) as f:
        tree = f["tree"]

        chunk_starts = 0
        chunk_stops = 53687091
        for _ in range(10):
            print(f"entry {chunk_starts} to {chunk_stops}")
            entry = tree.arrays(entry_start=chunk_starts, entry_stop=chunk_stops, library="ak")
            print(repr(entry))

have a similar memory behavior, see e.g. the profile for loop_manual (the numerical values of the y axis is different of course because I can't exactly mirror "200 MB" steps by hand):

Image

and for loop_same_chunks:

Image

What I want to see / was expecting

The orange and blue line overlap and roughly follow a saw tooth shape with 200 MB jumps per iteration (and not much additional overhead in RAM).


This has been originally been found by @oshadura in the scope of the integration challenge, here I just attach a local reproducer with some first findings.

cc @oshadura @alexander-held

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug (unverified)The problem described would be a bug, but needs to be triaged

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions