-
Notifications
You must be signed in to change notification settings - Fork 16
Description
After opening a file, if a user tries to access the num_frames of a scan tifffile will iterate over each page to find their offsets (see step 2 in the Details of data loading section in the readme). This turns out to be very slow when done over the network (almost 200x slower than when the file is local):
In [13]: f2 = tifffile.TiffFile('/mnt/scratch06/Two-Photon/taliah/2019-04-03_12-41-44/21067_10_00003_00001.tif') # over the network
In [14]: cProfile.run('n2 = len(f2.pages)')
240111 function calls (240109 primitive calls) in 28.641 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 28.641 28.641 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 tifffile.py:2035(filehandle)
1 0.287 0.287 28.641 28.641 tifffile.py:3375(_seek)
1 0.000 0.000 28.641 28.641 tifffile.py:3567(__len__)
40000 0.053 0.000 28.080 0.001 tifffile.py:5570(read)
40001 0.065 0.000 0.209 0.000 tifffile.py:5662(seek)
19999 0.010 0.000 0.010 0.000 tifffile.py:5704(size)
1 0.000 0.000 0.000 0.000 tifffile.py:5708(closed)
40000 0.049 0.000 0.049 0.000 {built-in method _struct.unpack}
1 0.000 0.000 28.641 28.641 {built-in method builtins.exec}
101 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
3/1 0.000 0.000 28.641 28.641 {built-in method builtins.len}
19999 0.006 0.000 0.006 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
40000 28.027 0.001 28.027 0.001 {method 'read' of '_io.BufferedReader' objects}
40001 0.144 0.000 0.144 0.000 {method 'seek' of '_io.BufferedReader' objects}
In [18]: f3 = tifffile.TiffFile('/data/pipeline/21067_10_00003_00001.tif') # local
In [19]: cProfile.run('n2 = len(f3.pages)')
240111 function calls (240109 primitive calls) in 0.154 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.154 0.154 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 tifffile.py:2035(filehandle)
1 0.046 0.046 0.154 0.154 tifffile.py:3375(_seek)
1 0.000 0.000 0.154 0.154 tifffile.py:3567(__len__)
40000 0.011 0.000 0.062 0.000 tifffile.py:5570(read)
40001 0.014 0.000 0.036 0.000 tifffile.py:5662(seek)
19999 0.003 0.000 0.003 0.000 tifffile.py:5704(size)
1 0.000 0.000 0.000 0.000 tifffile.py:5708(closed)
40000 0.006 0.000 0.006 0.000 {built-in method _struct.unpack}
1 0.000 0.000 0.154 0.154 {built-in method builtins.exec}
101 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
3/1 0.000 0.000 0.154 0.154 {built-in method builtins.len}
19999 0.002 0.000 0.002 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
40000 0.051 0.000 0.051 0.000 {method 'read' of '_io.BufferedReader' objects}
40001 0.023 0.000 0.023 0.000 {method 'seek' of '_io.BufferedReader' objects}
The chain of operations goes scan.num_frames -> len(TiffFile.pages) -> TiffFile.TiffPages.seek(-1). What seek(-1) does is starting on the first page which has already been read, move page by page accessing their offset value and saving it in an index. Per page, it performs two seeks and two reads on the tiff file handle (which is an io.BufferedReader object); these reads take most of the time.
However, they only read 8 bytes each (fh.read(tagnosize) reads the number of tags and fh.read(offsetsize) reads the actual offset) which doesn't account to enough info for it to be a bottleneck (even assuming each 8 byte is packeted as a 96 byte TCP packet, that is only around 4 Mb which would not take 28 seconds). My guess is that it is the sheer number of packets that is causing the problem.
In any way, because all of ScanImage's tiff files' pages are the same size on file, the offset from page to page will be exactly the same so we only need to compute one offset overall (or maybe one per file to be safe and avoid read errors if two files come from diff scans). This will require changing the seek function in tifffile.TiffPages to only compute the offset once and fill out the rest of page offsets with it.