Skip to content

Improve speed when opening tiff files over the network #5

@ecobost

Description

@ecobost

After opening a file, if a user tries to access the num_frames of a scan tifffile will iterate over each page to find their offsets (see step 2 in the Details of data loading section in the readme). This turns out to be very slow when done over the network (almost 200x slower than when the file is local):

In [13]: f2 = tifffile.TiffFile('/mnt/scratch06/Two-Photon/taliah/2019-04-03_12-41-44/21067_10_00003_00001.tif')   # over the network                                                                                                 

In [14]: cProfile.run('n2 = len(f2.pages)')                                                                                                                                                                        
         240111 function calls (240109 primitive calls) in 28.641 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   28.641   28.641 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 tifffile.py:2035(filehandle)
        1    0.287    0.287   28.641   28.641 tifffile.py:3375(_seek)
        1    0.000    0.000   28.641   28.641 tifffile.py:3567(__len__)
    40000    0.053    0.000   28.080    0.001 tifffile.py:5570(read)
    40001    0.065    0.000    0.209    0.000 tifffile.py:5662(seek)
    19999    0.010    0.000    0.010    0.000 tifffile.py:5704(size)
        1    0.000    0.000    0.000    0.000 tifffile.py:5708(closed)
    40000    0.049    0.000    0.049    0.000 {built-in method _struct.unpack}
        1    0.000    0.000   28.641   28.641 {built-in method builtins.exec}
      101    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
      3/1    0.000    0.000   28.641   28.641 {built-in method builtins.len}
    19999    0.006    0.000    0.006    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    40000   28.027    0.001   28.027    0.001 {method 'read' of '_io.BufferedReader' objects}
    40001    0.144    0.000    0.144    0.000 {method 'seek' of '_io.BufferedReader' objects}

In [18]: f3 = tifffile.TiffFile('/data/pipeline/21067_10_00003_00001.tif')   # local                                                                                                                                      

In [19]: cProfile.run('n2 = len(f3.pages)')                                                                                                                                                                        
         240111 function calls (240109 primitive calls) in 0.154 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.154    0.154 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 tifffile.py:2035(filehandle)
        1    0.046    0.046    0.154    0.154 tifffile.py:3375(_seek)
        1    0.000    0.000    0.154    0.154 tifffile.py:3567(__len__)
    40000    0.011    0.000    0.062    0.000 tifffile.py:5570(read)
    40001    0.014    0.000    0.036    0.000 tifffile.py:5662(seek)
    19999    0.003    0.000    0.003    0.000 tifffile.py:5704(size)
        1    0.000    0.000    0.000    0.000 tifffile.py:5708(closed)
    40000    0.006    0.000    0.006    0.000 {built-in method _struct.unpack}
        1    0.000    0.000    0.154    0.154 {built-in method builtins.exec}
      101    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
      3/1    0.000    0.000    0.154    0.154 {built-in method builtins.len}
    19999    0.002    0.000    0.002    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    40000    0.051    0.000    0.051    0.000 {method 'read' of '_io.BufferedReader' objects}
    40001    0.023    0.000    0.023    0.000 {method 'seek' of '_io.BufferedReader' objects}

The chain of operations goes scan.num_frames -> len(TiffFile.pages) -> TiffFile.TiffPages.seek(-1). What seek(-1) does is starting on the first page which has already been read, move page by page accessing their offset value and saving it in an index. Per page, it performs two seeks and two reads on the tiff file handle (which is an io.BufferedReader object); these reads take most of the time.

However, they only read 8 bytes each (fh.read(tagnosize) reads the number of tags and fh.read(offsetsize) reads the actual offset) which doesn't account to enough info for it to be a bottleneck (even assuming each 8 byte is packeted as a 96 byte TCP packet, that is only around 4 Mb which would not take 28 seconds). My guess is that it is the sheer number of packets that is causing the problem.

In any way, because all of ScanImage's tiff files' pages are the same size on file, the offset from page to page will be exactly the same so we only need to compute one offset overall (or maybe one per file to be safe and avoid read errors if two files come from diff scans). This will require changing the seek function in tifffile.TiffPages to only compute the offset once and fill out the rest of page offsets with it.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions