# ap_verify import slow in Gen 3

XMLWordPrintable

#### Details

• Type: Bug
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Story Points:
6
• Sprint:
AP F20-4 (September), AP F20-5 (October)
• Team:
• Urgent?:
No

#### Description

Ingesting HiTS2015 takes a very long time in Gen 3; after 30 minutes on lsst-devl01, it had not yet finished the preliminary import used to create an independent repository. This is unexpected, since of the affected datasets (calibs, crosstalk, defects, refcats, templates), only templates should scale up in size with the overall dataset size, and only defects and refcats should require file copies. For comparison, the entire import of CI-HiTS2015 takes 30 seconds.

Run tests comparing the current Gen 2 and Gen 3 behavior to confirm whether the latter is much slower. If so, track down the cause.

#### Attachments

1. cprofile-checksum.dat
2.44 MB
2. cprofile-nocheck.dat
2.44 MB

#### Activity

No builds found.
Krzysztof Findeisen created issue -
Field Original Value New Value
Epic Link DM-25145 [ 435263 ]
 Link This issue relates to DM-26526 [ DM-26526 ]
 Description Ingesting {{ap_verify_hits2015}} takes a very long time in Gen 3; after 30 minutes on {{lsst-devl01}}, it had not yet finished the preliminary import of calibs, templates, etc. This is unexpected, since of the affected datasets (calibs, crosstalk, defects, refcats, templates), only templates should scale up in size with the overall dataset size, and only defects and refcats should require file copies. Run tests comparing the current Gen 2 and Gen 3 behavior to confirm whether the latter is much slower. If so, track down the cause. Ingesting {{ap_verify_hits2015}} takes a very long time in Gen 3; after 30 minutes on {{lsst-devl01}}, it had not yet finished the preliminary import used to create an independent repository. This is unexpected, since of the affected datasets (calibs, crosstalk, defects, refcats, templates), only templates should scale up in size with the overall dataset size, and only defects and refcats should require file copies. Run tests comparing the current Gen 2 and Gen 3 behavior to confirm whether the latter is much slower. If so, track down the cause.
 Description Ingesting {{ap_verify_hits2015}} takes a very long time in Gen 3; after 30 minutes on {{lsst-devl01}}, it had not yet finished the preliminary import used to create an independent repository. This is unexpected, since of the affected datasets (calibs, crosstalk, defects, refcats, templates), only templates should scale up in size with the overall dataset size, and only defects and refcats should require file copies. Run tests comparing the current Gen 2 and Gen 3 behavior to confirm whether the latter is much slower. If so, track down the cause. Ingesting {{HiTS2015}} takes a very long time in Gen 3; after 30 minutes on {{lsst-devl01}}, it had not yet finished the preliminary import used to create an independent repository. This is unexpected, since of the affected datasets (calibs, crosstalk, defects, refcats, templates), only templates should scale up in size with the overall dataset size, and only defects and refcats should require file copies. For comparison, the entire import of {{CI-HiTS2015}} takes 30 seconds. Run tests comparing the current Gen 2 and Gen 3 behavior to confirm whether the latter is much slower. If so, track down the cause.
 Labels gen3-middleware
 Link This issue relates to DM-25806 [ DM-25806 ]
Hide
Krzysztof Findeisen added a comment -

I've confirmed that this is a difference between Gen 2 behavior (10 minutes for HiTS2015) and Gen 3 behavior (80 minutes). It is not caused or affected by having a schema mismatch between the export file and the new repository being imported into.

All tests so far have used ingest_dataset.py, but manually calling Butler.import_ may be more effective for profiling.

Show
Krzysztof Findeisen added a comment - I've confirmed that this is a difference between Gen 2 behavior (10 minutes for HiTS2015) and Gen 3 behavior (80 minutes). It is not caused or affected by having a schema mismatch between the export file and the new repository being imported into. All tests so far have used ingest_dataset.py , but manually calling Butler.import_ may be more effective for profiling.
 Status To Do [ 10001 ] In Progress [ 3 ]
 Epic Link DM-25145 [ 435263 ] DM-26803 [ 439755 ]
 Story Points 4 6
Hide
Tim Jenness added a comment -

Do you disable checksum calculation in datastore? In DM-26937 that accounted for a third of the run time. The metadata extraction for DECam is also different for gen2 and gen3 so that might account for some of the time. Knowing the hotspots for ingest would be really useful. Doing the ingest once, then exporting, and reimporting will of course be faster but getting a profile for butler ingest-raws on a bunch of DECam data would be wonderful.

Show
Tim Jenness added a comment - Do you disable checksum calculation in datastore? In DM-26937 that accounted for a third of the run time. The metadata extraction for DECam is also different for gen2 and gen3 so that might account for some of the time. Knowing the hotspots for ingest would be really useful. Doing the ingest once, then exporting, and reimporting will of course be faster but getting a profile for butler ingest-raws on a bunch of DECam data would be wonderful.
 Link This issue relates to DM-26937 [ DM-26937 ]
 Sprint AP F20-4 (September) [ 1039 ] AP F20-4 (September), AP F20-5 (October) [ 1039, 1046 ]
 Attachment cprofile-checksum.dat [ 45651 ] Attachment cprofile-ingest.dat [ 45652 ]
Hide
Krzysztof Findeisen added a comment - - edited

Without checksums it runs very quickly (1 minute), much faster than Gen 2. With checksums it takes 47 minutes, of which 2 minutes are spent on ingestion. I've attached the cProfile files (which I don't think contain any sensitive data...) in case Tim Jenness wants to investigate further.

Since checksums are an optional feature, we could force them to be off in ap_verify runs and call this issue done.

On the other hand, we could try to speed up checksumming as a bug in daf_butler; aside from the following suspicious lines

  ncalls tottime percall cumtime percall filename:lineno(function) 160004158 2359.079 0.000 2359.079 0.000 {method 'update' of '_blake2.blake2b' objects} 160009060 25.368 0.000 348.888 0.000 /software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/daf_butler/19.0.0-170-g029acfa6+ff10c6d78d/python/lsst/daf/butler/datastores/fileLikeDatastore.py:1481() 160009261 323.575 0.000 323.575 0.000 {method 'read' of '_io.BufferedReader' objects} 

I don't know enough about how datastores work to debug this.

Show
Krzysztof Findeisen added a comment - - edited Without checksums it runs very quickly (1 minute), much faster than Gen 2. With checksums it takes 47 minutes, of which 2 minutes are spent on ingestion. I've attached the cProfile files (which I don't think contain any sensitive data...) in case Tim Jenness wants to investigate further. Since checksums are an optional feature, we could force them to be off in ap_verify runs and call this issue done. On the other hand, we could try to speed up checksumming as a bug in daf_butler ; aside from the following suspicious lines ncalls tottime percall cumtime percall filename:lineno(function) 160004158 2359.079 0.000 2359.079 0.000 {method 'update' of '_blake2.blake2b' objects} 160009060 25.368 0.000 348.888 0.000 /software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/daf_butler/19.0.0-170-g029acfa6+ff10c6d78d/python/lsst/daf/butler/datastores/fileLikeDatastore.py:1481(<lambda>) 160009261 323.575 0.000 323.575 0.000 {method 'read' of '_io.BufferedReader' objects} I don't know enough about how datastores work to debug this.
 Attachment cprofile-ingest.dat [ 45652 ]
 Attachment cprofile-nocheck.dat [ 45653 ]
 Attachment cprofile-nocheck.dat [ 45653 ]
 Attachment cprofile-checksum.dat [ 45651 ]
 Attachment cprofile-checksum.dat [ 45654 ] Attachment cprofile-nocheck.dat [ 45655 ]
Hide
Tim Jenness added a comment -

Great. I turned off checksums by default last week in DM-27007. I also fixed a problem where checksums were being calculated twice when they were enabled.

So I think the ingest time might improve by a good factor after fixing that bug. Blake2b is already meant to be a relatively fast checksum calculator.

https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/datastores/fileLikeDatastore.py#L1448 has the calculation (we read the data file in 8k chunks).

Show
Tim Jenness added a comment - Great. I turned off checksums by default last week in DM-27007 . I also fixed a problem where checksums were being calculated twice when they were enabled. So I think the ingest time might improve by a good factor after fixing that bug. Blake2b is already meant to be a relatively fast checksum calculator. https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/datastores/fileLikeDatastore.py#L1448 has the calculation (we read the data file in 8k chunks).
Hide
Krzysztof Findeisen added a comment - - edited

The import handles about 50 GB of data in 4795 files, so if everything is done in 8K chunks, that accounts for about 7 million of the 160 million calls....

Show
Krzysztof Findeisen added a comment - - edited The import handles about 50 GB of data in 4795 files, so if everything is done in 8K chunks, that accounts for about 7 million of the 160 million calls....
Hide
Tim Jenness added a comment -

So 13 times too many calls. How many detectors are in these files? Could it be calculating the checksum for each detector even though it's the same file multiple times? It looks like there are 4902 calls to the checksum code but around 108 files ingested. How many files do you expect? And how many detectors?

Show
Tim Jenness added a comment - So 13 times too many calls. How many detectors are in these files? Could it be calculating the checksum for each detector even though it's the same file multiple times? It looks like there are 4902 calls to the checksum code but around 108 files ingested. How many files do you expect? And how many detectors?
Hide
Krzysztof Findeisen added a comment - - edited

The number of calls to computeChecksum is correct: 4795 from importing from the preloaded Gen 3 repository, plus 107 from ingesting the raw files. If we include the 44 GB raws, then we expect an extra 6 million calls to the per-chunk code.

These numbers suggest about 0.6 s per file on import and 1.2 s per file on ingest, even though the imported files are much smaller on average (calibs and template coadds, but also defects, refcat shards, etc.). In fact, since ingest takes half as long with checksums disabled, that suggests checksumming takes the same amount of time regardless of file size...

FWIW, DECam has 62 detectors.

Show
Krzysztof Findeisen added a comment - - edited The number of calls to computeChecksum is correct: 4795 from importing from the preloaded Gen 3 repository, plus 107 from ingesting the raw files. If we include the 44 GB raws, then we expect an extra 6 million calls to the per-chunk code. These numbers suggest about 0.6 s per file on import and 1.2 s per file on ingest, even though the imported files are much smaller on average (calibs and template coadds, but also defects, refcat shards, etc.). In fact, since ingest takes half as long with checksums disabled, that suggests checksumming takes the same amount of time regardless of file size... FWIW, DECam has 62 detectors.
Hide
Tim Jenness added a comment -

I just tried to ingest a couple of DECam files and it only tried to calculate the checksum once for each file (not once per detector)

Show
Tim Jenness added a comment - I just tried to ingest a couple of DECam files and it only tried to calculate the checksum once for each file (not once per detector)
Hide
Tim Jenness added a comment -

Ingest is going to be slower since we have to read the file metadata to work out the dataIds (I've finally added some logging to ingest in DM-27071).

Import already knows all the dataIds.

Show
Tim Jenness added a comment - Ingest is going to be slower since we have to read the file metadata to work out the dataIds (I've finally added some logging to ingest in DM-27071 ). Import already knows all the dataIds.
Hide
Tim Jenness added a comment -

Changing the chunk size for checksum calculation from 8k to 64k resulted in a 10% performance improvement on my laptop. Of course if the data are coming over a mounted filesystem it's never going to be great. I still don't really understand what you are saying regarding the number of times bytes are read from a file in the hash update loop. I see exactly the number of iterations as I expect in single file testing.

Is the outcome though that we can close this ticket since checksum is now disabled by default?

Show
Tim Jenness added a comment - Changing the chunk size for checksum calculation from 8k to 64k resulted in a 10% performance improvement on my laptop. Of course if the data are coming over a mounted filesystem it's never going to be great. I still don't really understand what you are saying regarding the number of times bytes are read from a file in the hash update loop. I see exactly the number of iterations as I expect in single file testing. Is the outcome though that we can close this ticket since checksum is now disabled by default?
Hide
Krzysztof Findeisen added a comment - - edited

For the import step, we have 4795 files with a total of 50 GiB. Assuming efficient packing, this is 6.5 million 8-KiB chunks; inefficiency adds at most 1 chunk per file, which is negligible on this scale.
For the ingest step, we have 107 files with a total of 44 GiB. The same calculation gives 5.8 million 8-KiB chunks.
Thus, we expect 4902 calls to computeChecksum (observed) and 12.3 million calls to read and update (too low by an order of magnitude compared to 160 million).

I also don't understand why, if summing each chunk dominates the computation cost, the running time for import (45 minutes) is much larger than for ingest (2 minutes) – scaling closer to number of files than to number of chunks. The timings seem to include I/O overhead already, so that's not it.

In any case, if we're leaving the checksum code as-is, then I would like to enforce no checksums in the ap_verify code. Otherwise, if the default changes again it will cripple us on large runs.

Show
Krzysztof Findeisen added a comment - - edited For the import step, we have 4795 files with a total of 50 GiB. Assuming efficient packing, this is 6.5 million 8-KiB chunks; inefficiency adds at most 1 chunk per file, which is negligible on this scale. For the ingest step, we have 107 files with a total of 44 GiB. The same calculation gives 5.8 million 8-KiB chunks. Thus, we expect 4902 calls to computeChecksum (observed) and 12.3 million calls to read and update (too low by an order of magnitude compared to 160 million). I also don't understand why, if summing each chunk dominates the computation cost, the running time for import (45 minutes) is much larger than for ingest (2 minutes) – scaling closer to number of files than to number of chunks. The timings seem to include I/O overhead already, so that's not it. In any case, if we're leaving the checksum code as-is, then I would like to enforce no checksums in the ap_verify code. Otherwise, if the default changes again it will cripple us on large runs.
Hide
Tim Jenness added a comment - - edited

The 160 million calls to the hash updater rather than ~ 12 million is very concerning to me. It suggests that there are some cases where this line goes completely haywire. https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/datastores/fileLikeDatastore.py#L1481 That is concerning so maybe it should be reorganized (although this is a very common approach). Maybe weird things happen over mounted file systems where you ask for 8kB of data but you get a lot less?

Forcing checksum to be false in a datastore config seems fine.

Show
Tim Jenness added a comment - - edited The 160 million calls to the hash updater rather than ~ 12 million is very concerning to me. It suggests that there are some cases where this line goes completely haywire. https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/datastores/fileLikeDatastore.py#L1481 That is concerning so maybe it should be reorganized (although this is a very common approach). Maybe weird things happen over mounted file systems where you ask for 8kB of data but you get a lot less? Forcing checksum to be false in a datastore config seems fine.
Hide
Tim Jenness added a comment -

Fundamentally, when read() says it can read at most N bytes I've never actually seen it do that in real life (apart from when reading the final part of a file). It seems like in this case it is mostly returning far fewer bytes per read call so it seems like we can't do anything about that.

Show
Tim Jenness added a comment - Fundamentally, when read() says it can read at most N bytes I've never actually seen it do that in real life (apart from when reading the final part of a file). It seems like in this case it is mostly returning far fewer bytes per read call so it seems like we can't do anything about that.
Hide
Krzysztof Findeisen added a comment -

Can you take a look at the workaround, Tim Jenness? It's a 4-line change.

Show
Krzysztof Findeisen added a comment - Can you take a look at the workaround, Tim Jenness ? It's a 4-line change.
 Reviewers Tim Jenness [ tjenness ] Status In Progress [ 3 ] In Review [ 10004 ]
Hide
Tim Jenness added a comment -

Looks ok.

Show
Tim Jenness added a comment - Looks ok.
 Status In Review [ 10004 ] Reviewed [ 10101 ]
 Resolution Done [ 10000 ] Status Reviewed [ 10101 ] Done [ 10002 ]

#### People

Assignee:
Krzysztof Findeisen
Reporter:
Krzysztof Findeisen
Reviewers:
Tim Jenness
Watchers:
Eric Bellm, Krzysztof Findeisen, Meredith Rawls, Tim Jenness