Fix Version/s: None
Sprint:AP F20-4 (September), AP F20-5 (October)
Ingesting HiTS2015 takes a very long time in Gen 3; after 30 minutes on lsst-devl01, it had not yet finished the preliminary import used to create an independent repository. This is unexpected, since of the affected datasets (calibs, crosstalk, defects, refcats, templates), only templates should scale up in size with the overall dataset size, and only defects and refcats should require file copies. For comparison, the entire import of CI-HiTS2015 takes 30 seconds.
Run tests comparing the current Gen 2 and Gen 3 behavior to confirm whether the latter is much slower. If so, track down the cause.
The 160 million calls to the hash updater rather than ~ 12 million is very concerning to me. It suggests that there are some cases where this line goes completely haywire. https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/datastores/fileLikeDatastore.py#L1481 That is concerning so maybe it should be reorganized (although this is a very common approach). Maybe weird things happen over mounted file systems where you ask for 8kB of data but you get a lot less?
Forcing checksum to be false in a datastore config seems fine.
Fundamentally, when read() says it can read at most N bytes I've never actually seen it do that in real life (apart from when reading the final part of a file). It seems like in this case it is mostly returning far fewer bytes per read call so it seems like we can't do anything about that.
Can you take a look at the workaround, Tim Jenness? It's a 4-line change.
For the import step, we have 4795 files with a total of 50 GiB. Assuming efficient packing, this is 6.5 million 8-KiB chunks; inefficiency adds at most 1 chunk per file, which is negligible on this scale.
For the ingest step, we have 107 files with a total of 44 GiB. The same calculation gives 5.8 million 8-KiB chunks.
Thus, we expect 4902 calls to computeChecksum (observed) and 12.3 million calls to read and update (too low by an order of magnitude compared to 160 million).
I also don't understand why, if summing each chunk dominates the computation cost, the running time for import (45 minutes) is much larger than for ingest (2 minutes) – scaling closer to number of files than to number of chunks. The timings seem to include I/O overhead already, so that's not it.
In any case, if we're leaving the checksum code as-is, then I would like to enforce no checksums in the ap_verify code. Otherwise, if the default changes again it will cripple us on large runs.