Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-26662

ap_verify import slow in Gen 3

    XMLWordPrintable

    Details

    • Story Points:
      6
    • Sprint:
      AP F20-4 (September), AP F20-5 (October)
    • Team:
      Alert Production
    • Urgent?:
      No

      Description

      Ingesting HiTS2015 takes a very long time in Gen 3; after 30 minutes on lsst-devl01, it had not yet finished the preliminary import used to create an independent repository. This is unexpected, since of the affected datasets (calibs, crosstalk, defects, refcats, templates), only templates should scale up in size with the overall dataset size, and only defects and refcats should require file copies. For comparison, the entire import of CI-HiTS2015 takes 30 seconds.

      Run tests comparing the current Gen 2 and Gen 3 behavior to confirm whether the latter is much slower. If so, track down the cause.

        Attachments

          Issue Links

            Activity

            No builds found.
            krzys Krzysztof Findeisen created issue -
            krzys Krzysztof Findeisen made changes -
            Field Original Value New Value
            Epic Link DM-25145 [ 435263 ]
            krzys Krzysztof Findeisen made changes -
            Link This issue relates to DM-26526 [ DM-26526 ]
            krzys Krzysztof Findeisen made changes -
            Description Ingesting {{ap_verify_hits2015}} takes a very long time in Gen 3; after 30 minutes on {{lsst-devl01}}, it had not yet finished the preliminary import of calibs, templates, etc. This is unexpected, since of the affected datasets (calibs, crosstalk, defects, refcats, templates), only templates should scale up in size with the overall dataset size, and only defects and refcats should require file copies.

            Run tests comparing the current Gen 2 and Gen 3 behavior to confirm whether the latter is much slower. If so, track down the cause.
            Ingesting {{ap_verify_hits2015}} takes a very long time in Gen 3; after 30 minutes on {{lsst-devl01}}, it had not yet finished the preliminary import used to create an independent repository. This is unexpected, since of the affected datasets (calibs, crosstalk, defects, refcats, templates), only templates should scale up in size with the overall dataset size, and only defects and refcats should require file copies.

            Run tests comparing the current Gen 2 and Gen 3 behavior to confirm whether the latter is much slower. If so, track down the cause.
            krzys Krzysztof Findeisen made changes -
            Description Ingesting {{ap_verify_hits2015}} takes a very long time in Gen 3; after 30 minutes on {{lsst-devl01}}, it had not yet finished the preliminary import used to create an independent repository. This is unexpected, since of the affected datasets (calibs, crosstalk, defects, refcats, templates), only templates should scale up in size with the overall dataset size, and only defects and refcats should require file copies.

            Run tests comparing the current Gen 2 and Gen 3 behavior to confirm whether the latter is much slower. If so, track down the cause.
            Ingesting {{HiTS2015}} takes a very long time in Gen 3; after 30 minutes on {{lsst-devl01}}, it had not yet finished the preliminary import used to create an independent repository. This is unexpected, since of the affected datasets (calibs, crosstalk, defects, refcats, templates), only templates should scale up in size with the overall dataset size, and only defects and refcats should require file copies. For comparison, the entire import of {{CI-HiTS2015}} takes 30 seconds.

            Run tests comparing the current Gen 2 and Gen 3 behavior to confirm whether the latter is much slower. If so, track down the cause.
            tjenness Tim Jenness made changes -
            Labels gen3-middleware
            krzys Krzysztof Findeisen made changes -
            Link This issue relates to DM-25806 [ DM-25806 ]
            Hide
            krzys Krzysztof Findeisen added a comment -

            I've confirmed that this is a difference between Gen 2 behavior (10 minutes for HiTS2015) and Gen 3 behavior (80 minutes). It is not caused or affected by having a schema mismatch between the export file and the new repository being imported into.

            All tests so far have used ingest_dataset.py, but manually calling Butler.import_ may be more effective for profiling.

            Show
            krzys Krzysztof Findeisen added a comment - I've confirmed that this is a difference between Gen 2 behavior (10 minutes for HiTS2015) and Gen 3 behavior (80 minutes). It is not caused or affected by having a schema mismatch between the export file and the new repository being imported into. All tests so far have used ingest_dataset.py , but manually calling Butler.import_ may be more effective for profiling.
            krzys Krzysztof Findeisen made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            swinbank John Swinbank made changes -
            Epic Link DM-25145 [ 435263 ] DM-26803 [ 439755 ]
            sullivan Ian Sullivan made changes -
            Story Points 4 6
            Hide
            tjenness Tim Jenness added a comment -

            Do you disable checksum calculation in datastore? In DM-26937 that accounted for a third of the run time. The metadata extraction for DECam is also different for gen2 and gen3 so that might account for some of the time. Knowing the hotspots for ingest would be really useful. Doing the ingest once, then exporting, and reimporting will of course be faster but getting a profile for butler ingest-raws on a bunch of DECam data would be wonderful.

            Show
            tjenness Tim Jenness added a comment - Do you disable checksum calculation in datastore? In DM-26937 that accounted for a third of the run time. The metadata extraction for DECam is also different for gen2 and gen3 so that might account for some of the time. Knowing the hotspots for ingest would be really useful. Doing the ingest once, then exporting, and reimporting will of course be faster but getting a profile for butler ingest-raws on a bunch of DECam data would be wonderful.
            tjenness Tim Jenness made changes -
            Link This issue relates to DM-26937 [ DM-26937 ]
            sullivan Ian Sullivan made changes -
            Sprint AP F20-4 (September) [ 1039 ] AP F20-4 (September), AP F20-5 (October) [ 1039, 1046 ]
            krzys Krzysztof Findeisen made changes -
            Attachment cprofile-checksum.dat [ 45651 ]
            Attachment cprofile-ingest.dat [ 45652 ]
            Hide
            krzys Krzysztof Findeisen added a comment - - edited

            Without checksums it runs very quickly (1 minute), much faster than Gen 2. With checksums it takes 47 minutes, of which 2 minutes are spent on ingestion. I've attached the cProfile files (which I don't think contain any sensitive data...) in case Tim Jenness wants to investigate further.

            Since checksums are an optional feature, we could force them to be off in ap_verify runs and call this issue done.

            On the other hand, we could try to speed up checksumming as a bug in daf_butler; aside from the following suspicious lines

              ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            160004158 2359.079    0.000 2359.079    0.000 {method 'update' of '_blake2.blake2b' objects}
            160009060   25.368    0.000  348.888    0.000 /software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/daf_butler/19.0.0-170-g029acfa6+ff10c6d78d/python/lsst/daf/butler/datastores/fileLikeDatastore.py:1481(<lambda>)
            160009261  323.575    0.000  323.575    0.000 {method 'read' of '_io.BufferedReader' objects}
            

            I don't know enough about how datastores work to debug this.

            Show
            krzys Krzysztof Findeisen added a comment - - edited Without checksums it runs very quickly (1 minute), much faster than Gen 2. With checksums it takes 47 minutes, of which 2 minutes are spent on ingestion. I've attached the cProfile files (which I don't think contain any sensitive data...) in case Tim Jenness wants to investigate further. Since checksums are an optional feature, we could force them to be off in ap_verify runs and call this issue done. On the other hand, we could try to speed up checksumming as a bug in daf_butler ; aside from the following suspicious lines ncalls tottime percall cumtime percall filename:lineno(function) 160004158 2359.079 0.000 2359.079 0.000 {method 'update' of '_blake2.blake2b' objects} 160009060 25.368 0.000 348.888 0.000 /software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/daf_butler/19.0.0-170-g029acfa6+ff10c6d78d/python/lsst/daf/butler/datastores/fileLikeDatastore.py:1481(<lambda>) 160009261 323.575 0.000 323.575 0.000 {method 'read' of '_io.BufferedReader' objects} I don't know enough about how datastores work to debug this.
            krzys Krzysztof Findeisen made changes -
            Attachment cprofile-ingest.dat [ 45652 ]
            krzys Krzysztof Findeisen made changes -
            Attachment cprofile-nocheck.dat [ 45653 ]
            krzys Krzysztof Findeisen made changes -
            Attachment cprofile-nocheck.dat [ 45653 ]
            krzys Krzysztof Findeisen made changes -
            Attachment cprofile-checksum.dat [ 45651 ]
            krzys Krzysztof Findeisen made changes -
            Attachment cprofile-checksum.dat [ 45654 ]
            Attachment cprofile-nocheck.dat [ 45655 ]
            Hide
            tjenness Tim Jenness added a comment -

            Great. I turned off checksums by default last week in DM-27007. I also fixed a problem where checksums were being calculated twice when they were enabled.

            So I think the ingest time might improve by a good factor after fixing that bug. Blake2b is already meant to be a relatively fast checksum calculator.

            https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/datastores/fileLikeDatastore.py#L1448 has the calculation (we read the data file in 8k chunks).

            Show
            tjenness Tim Jenness added a comment - Great. I turned off checksums by default last week in DM-27007 . I also fixed a problem where checksums were being calculated twice when they were enabled. So I think the ingest time might improve by a good factor after fixing that bug. Blake2b is already meant to be a relatively fast checksum calculator. https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/datastores/fileLikeDatastore.py#L1448 has the calculation (we read the data file in 8k chunks).
            Hide
            krzys Krzysztof Findeisen added a comment - - edited

            The import handles about 50 GB of data in 4795 files, so if everything is done in 8K chunks, that accounts for about 7 million of the 160 million calls....

            Show
            krzys Krzysztof Findeisen added a comment - - edited The import handles about 50 GB of data in 4795 files, so if everything is done in 8K chunks, that accounts for about 7 million of the 160 million calls....
            Hide
            tjenness Tim Jenness added a comment -

            So 13 times too many calls. How many detectors are in these files? Could it be calculating the checksum for each detector even though it's the same file multiple times? It looks like there are 4902 calls to the checksum code but around 108 files ingested. How many files do you expect? And how many detectors?

            Show
            tjenness Tim Jenness added a comment - So 13 times too many calls. How many detectors are in these files? Could it be calculating the checksum for each detector even though it's the same file multiple times? It looks like there are 4902 calls to the checksum code but around 108 files ingested. How many files do you expect? And how many detectors?
            Hide
            krzys Krzysztof Findeisen added a comment - - edited

            The number of calls to computeChecksum is correct: 4795 from importing from the preloaded Gen 3 repository, plus 107 from ingesting the raw files. If we include the 44 GB raws, then we expect an extra 6 million calls to the per-chunk code.

            These numbers suggest about 0.6 s per file on import and 1.2 s per file on ingest, even though the imported files are much smaller on average (calibs and template coadds, but also defects, refcat shards, etc.). In fact, since ingest takes half as long with checksums disabled, that suggests checksumming takes the same amount of time regardless of file size...

            FWIW, DECam has 62 detectors.

            Show
            krzys Krzysztof Findeisen added a comment - - edited The number of calls to computeChecksum is correct: 4795 from importing from the preloaded Gen 3 repository, plus 107 from ingesting the raw files. If we include the 44 GB raws, then we expect an extra 6 million calls to the per-chunk code. These numbers suggest about 0.6 s per file on import and 1.2 s per file on ingest, even though the imported files are much smaller on average (calibs and template coadds, but also defects, refcat shards, etc.). In fact, since ingest takes half as long with checksums disabled, that suggests checksumming takes the same amount of time regardless of file size... FWIW, DECam has 62 detectors.
            Hide
            tjenness Tim Jenness added a comment -

            I just tried to ingest a couple of DECam files and it only tried to calculate the checksum once for each file (not once per detector)

            Show
            tjenness Tim Jenness added a comment - I just tried to ingest a couple of DECam files and it only tried to calculate the checksum once for each file (not once per detector)
            Hide
            tjenness Tim Jenness added a comment -

            Ingest is going to be slower since we have to read the file metadata to work out the dataIds (I've finally added some logging to ingest in DM-27071).

            Import already knows all the dataIds.

            Show
            tjenness Tim Jenness added a comment - Ingest is going to be slower since we have to read the file metadata to work out the dataIds (I've finally added some logging to ingest in DM-27071 ). Import already knows all the dataIds.
            Hide
            tjenness Tim Jenness added a comment -

            Changing the chunk size for checksum calculation from 8k to 64k resulted in a 10% performance improvement on my laptop. Of course if the data are coming over a mounted filesystem it's never going to be great. I still don't really understand what you are saying regarding the number of times bytes are read from a file in the hash update loop. I see exactly the number of iterations as I expect in single file testing.

            Is the outcome though that we can close this ticket since checksum is now disabled by default?

            Show
            tjenness Tim Jenness added a comment - Changing the chunk size for checksum calculation from 8k to 64k resulted in a 10% performance improvement on my laptop. Of course if the data are coming over a mounted filesystem it's never going to be great. I still don't really understand what you are saying regarding the number of times bytes are read from a file in the hash update loop. I see exactly the number of iterations as I expect in single file testing. Is the outcome though that we can close this ticket since checksum is now disabled by default?
            Hide
            krzys Krzysztof Findeisen added a comment - - edited

            For the import step, we have 4795 files with a total of 50 GiB. Assuming efficient packing, this is 6.5 million 8-KiB chunks; inefficiency adds at most 1 chunk per file, which is negligible on this scale.
            For the ingest step, we have 107 files with a total of 44 GiB. The same calculation gives 5.8 million 8-KiB chunks.
            Thus, we expect 4902 calls to computeChecksum (observed) and 12.3 million calls to read and update (too low by an order of magnitude compared to 160 million).

            I also don't understand why, if summing each chunk dominates the computation cost, the running time for import (45 minutes) is much larger than for ingest (2 minutes) – scaling closer to number of files than to number of chunks. The timings seem to include I/O overhead already, so that's not it.

            In any case, if we're leaving the checksum code as-is, then I would like to enforce no checksums in the ap_verify code. Otherwise, if the default changes again it will cripple us on large runs.

            Show
            krzys Krzysztof Findeisen added a comment - - edited For the import step, we have 4795 files with a total of 50 GiB. Assuming efficient packing, this is 6.5 million 8-KiB chunks; inefficiency adds at most 1 chunk per file, which is negligible on this scale. For the ingest step, we have 107 files with a total of 44 GiB. The same calculation gives 5.8 million 8-KiB chunks. Thus, we expect 4902 calls to computeChecksum (observed) and 12.3 million calls to read and update (too low by an order of magnitude compared to 160 million). I also don't understand why, if summing each chunk dominates the computation cost, the running time for import (45 minutes) is much larger than for ingest (2 minutes) – scaling closer to number of files than to number of chunks. The timings seem to include I/O overhead already, so that's not it. In any case, if we're leaving the checksum code as-is, then I would like to enforce no checksums in the ap_verify code. Otherwise, if the default changes again it will cripple us on large runs.
            Hide
            tjenness Tim Jenness added a comment - - edited

            The 160 million calls to the hash updater rather than ~ 12 million is very concerning to me. It suggests that there are some cases where this line goes completely haywire. https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/datastores/fileLikeDatastore.py#L1481 That is concerning so maybe it should be reorganized (although this is a very common approach). Maybe weird things happen over mounted file systems where you ask for 8kB of data but you get a lot less?

            Forcing checksum to be false in a datastore config seems fine.

            Show
            tjenness Tim Jenness added a comment - - edited The 160 million calls to the hash updater rather than ~ 12 million is very concerning to me. It suggests that there are some cases where this line goes completely haywire. https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/datastores/fileLikeDatastore.py#L1481 That is concerning so maybe it should be reorganized (although this is a very common approach). Maybe weird things happen over mounted file systems where you ask for 8kB of data but you get a lot less? Forcing checksum to be false in a datastore config seems fine.
            Hide
            tjenness Tim Jenness added a comment -

            Fundamentally, when read() says it can read at most N bytes I've never actually seen it do that in real life (apart from when reading the final part of a file). It seems like in this case it is mostly returning far fewer bytes per read call so it seems like we can't do anything about that.

            Show
            tjenness Tim Jenness added a comment - Fundamentally, when read() says it can read at most N bytes I've never actually seen it do that in real life (apart from when reading the final part of a file). It seems like in this case it is mostly returning far fewer bytes per read call so it seems like we can't do anything about that.
            Hide
            krzys Krzysztof Findeisen added a comment -

            Can you take a look at the workaround, Tim Jenness? It's a 4-line change.

            Show
            krzys Krzysztof Findeisen added a comment - Can you take a look at the workaround, Tim Jenness ? It's a 4-line change.
            krzys Krzysztof Findeisen made changes -
            Reviewers Tim Jenness [ tjenness ]
            Status In Progress [ 3 ] In Review [ 10004 ]
            Hide
            tjenness Tim Jenness added a comment -

            Looks ok.

            Show
            tjenness Tim Jenness added a comment - Looks ok.
            tjenness Tim Jenness made changes -
            Status In Review [ 10004 ] Reviewed [ 10101 ]
            krzys Krzysztof Findeisen made changes -
            Resolution Done [ 10000 ]
            Status Reviewed [ 10101 ] Done [ 10002 ]

              People

              Assignee:
              krzys Krzysztof Findeisen
              Reporter:
              krzys Krzysztof Findeisen
              Reviewers:
              Tim Jenness
              Watchers:
              Eric Bellm, Krzysztof Findeisen, Meredith Rawls, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.