Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-26662

ap_verify import slow in Gen 3

    XMLWordPrintable

    Details

    • Story Points:
      6
    • Sprint:
      AP F20-4 (September), AP F20-5 (October)
    • Team:
      Alert Production
    • Urgent?:
      No

      Description

      Ingesting HiTS2015 takes a very long time in Gen 3; after 30 minutes on lsst-devl01, it had not yet finished the preliminary import used to create an independent repository. This is unexpected, since of the affected datasets (calibs, crosstalk, defects, refcats, templates), only templates should scale up in size with the overall dataset size, and only defects and refcats should require file copies. For comparison, the entire import of CI-HiTS2015 takes 30 seconds.

      Run tests comparing the current Gen 2 and Gen 3 behavior to confirm whether the latter is much slower. If so, track down the cause.

        Attachments

          Issue Links

            Activity

            Hide
            krzys Krzysztof Findeisen added a comment - - edited

            For the import step, we have 4795 files with a total of 50 GiB. Assuming efficient packing, this is 6.5 million 8-KiB chunks; inefficiency adds at most 1 chunk per file, which is negligible on this scale.
            For the ingest step, we have 107 files with a total of 44 GiB. The same calculation gives 5.8 million 8-KiB chunks.
            Thus, we expect 4902 calls to computeChecksum (observed) and 12.3 million calls to read and update (too low by an order of magnitude compared to 160 million).

            I also don't understand why, if summing each chunk dominates the computation cost, the running time for import (45 minutes) is much larger than for ingest (2 minutes) – scaling closer to number of files than to number of chunks. The timings seem to include I/O overhead already, so that's not it.

            In any case, if we're leaving the checksum code as-is, then I would like to enforce no checksums in the ap_verify code. Otherwise, if the default changes again it will cripple us on large runs.

            Show
            krzys Krzysztof Findeisen added a comment - - edited For the import step, we have 4795 files with a total of 50 GiB. Assuming efficient packing, this is 6.5 million 8-KiB chunks; inefficiency adds at most 1 chunk per file, which is negligible on this scale. For the ingest step, we have 107 files with a total of 44 GiB. The same calculation gives 5.8 million 8-KiB chunks. Thus, we expect 4902 calls to computeChecksum (observed) and 12.3 million calls to read and update (too low by an order of magnitude compared to 160 million). I also don't understand why, if summing each chunk dominates the computation cost, the running time for import (45 minutes) is much larger than for ingest (2 minutes) – scaling closer to number of files than to number of chunks. The timings seem to include I/O overhead already, so that's not it. In any case, if we're leaving the checksum code as-is, then I would like to enforce no checksums in the ap_verify code. Otherwise, if the default changes again it will cripple us on large runs.
            Hide
            tjenness Tim Jenness added a comment - - edited

            The 160 million calls to the hash updater rather than ~ 12 million is very concerning to me. It suggests that there are some cases where this line goes completely haywire. https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/datastores/fileLikeDatastore.py#L1481 That is concerning so maybe it should be reorganized (although this is a very common approach). Maybe weird things happen over mounted file systems where you ask for 8kB of data but you get a lot less?

            Forcing checksum to be false in a datastore config seems fine.

            Show
            tjenness Tim Jenness added a comment - - edited The 160 million calls to the hash updater rather than ~ 12 million is very concerning to me. It suggests that there are some cases where this line goes completely haywire. https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/datastores/fileLikeDatastore.py#L1481 That is concerning so maybe it should be reorganized (although this is a very common approach). Maybe weird things happen over mounted file systems where you ask for 8kB of data but you get a lot less? Forcing checksum to be false in a datastore config seems fine.
            Hide
            tjenness Tim Jenness added a comment -

            Fundamentally, when read() says it can read at most N bytes I've never actually seen it do that in real life (apart from when reading the final part of a file). It seems like in this case it is mostly returning far fewer bytes per read call so it seems like we can't do anything about that.

            Show
            tjenness Tim Jenness added a comment - Fundamentally, when read() says it can read at most N bytes I've never actually seen it do that in real life (apart from when reading the final part of a file). It seems like in this case it is mostly returning far fewer bytes per read call so it seems like we can't do anything about that.
            Hide
            krzys Krzysztof Findeisen added a comment -

            Can you take a look at the workaround, Tim Jenness? It's a 4-line change.

            Show
            krzys Krzysztof Findeisen added a comment - Can you take a look at the workaround, Tim Jenness ? It's a 4-line change.
            Hide
            tjenness Tim Jenness added a comment -

            Looks ok.

            Show
            tjenness Tim Jenness added a comment - Looks ok.

              People

              Assignee:
              krzys Krzysztof Findeisen
              Reporter:
              krzys Krzysztof Findeisen
              Reviewers:
              Tim Jenness
              Watchers:
              Eric Bellm, Krzysztof Findeisen, Meredith Rawls, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.