Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-26937

Profile raw data ingest

    XMLWordPrintable

    Details

    • Story Points:
      1
    • Team:
      Architecture
    • Urgent?:
      No

      Description

      We are going to want to ingest a lot of raw files so we should profile raw ingest to ensure there are no obvious slowdowns.

      testdata_lsst has 450 files and I have a copy locally so first try with that.

      Checksum calculation should be disabled (and possibly I should make that the default).

        Attachments

          Issue Links

            Activity

            Hide
            tjenness Tim Jenness added a comment -

            In my first test I unzipped all the .gz files and replaced them with fpacked files so that we do not get a slow down from gunzipping to read metadata.

            With profiling enabled single-threaded ingest of the 450 files (from imsim, latiss, phosim, ts8, and ts3) the run time is about 20 seconds (remember checksum is disabled).

            About 6 seconds of that is importing python code so the ingest phase is about 14 seconds (how long run takes).

            Calculating all the ingest metadata takes nearly 10 seconds (much of that seems to be down to calculating AzEl/Airmass for the on sky data).
            3 seconds of that is reading the YAML camera (for detector mapping) and nearly 4 seconds is spent in astropy coordinates.

            Butler ingest only takes about 1 second.

            ie metadata extraction accounts for only 7% of run time. If this holds generally and if we intend to reingest multiple times, it seems like by far the best approach is to have a mode where we can read pre-calculated ObservationInfo from a separate file. This would also help ingest over S3 (for example writing a .yaml file of the same name as the .fits file and in the same place). The downside being there has to be a separate piece of code that has to run and do this metadata extraction and persist it somewhere for later use by ingest.

            For the record, checksum calculation adds an additional 15 seconds.

            This is all locally on my iMac with sqlite and an attached SSD over Thunderbolt.

            Show
            tjenness Tim Jenness added a comment - In my first test I unzipped all the .gz files and replaced them with fpacked files so that we do not get a slow down from gunzipping to read metadata. With profiling enabled single-threaded ingest of the 450 files (from imsim, latiss, phosim, ts8, and ts3) the run time is about 20 seconds (remember checksum is disabled). About 6 seconds of that is importing python code so the ingest phase is about 14 seconds (how long run takes). Calculating all the ingest metadata takes nearly 10 seconds (much of that seems to be down to calculating AzEl/Airmass for the on sky data). 3 seconds of that is reading the YAML camera (for detector mapping) and nearly 4 seconds is spent in astropy coordinates. Butler ingest only takes about 1 second. ie metadata extraction accounts for only 7% of run time. If this holds generally and if we intend to reingest multiple times, it seems like by far the best approach is to have a mode where we can read pre-calculated ObservationInfo from a separate file. This would also help ingest over S3 (for example writing a .yaml file of the same name as the .fits file and in the same place). The downside being there has to be a separate piece of code that has to run and do this metadata extraction and persist it somewhere for later use by ingest. For the record, checksum calculation adds an additional 15 seconds. This is all locally on my iMac with sqlite and an attached SSD over Thunderbolt.
            Hide
            tjenness Tim Jenness added a comment -

            For reference, with no profiler the whole thing takes 13 seconds (and the ingest phase is almost instantaneous after all the files have been scanned).

            Funnily enough enabling multiprocessing (8 cores) takes significantly longer (24s real) but I assume will make a big difference for thousands of files (and especially where there aren't multiple instruments being ingested at once).

            Show
            tjenness Tim Jenness added a comment - For reference, with no profiler the whole thing takes 13 seconds (and the ingest phase is almost instantaneous after all the files have been scanned). Funnily enough enabling multiprocessing (8 cores) takes significantly longer (24s real) but I assume will make a big difference for thousands of files (and especially where there aren't multiple instruments being ingested at once).
            Hide
            neilsen Eric Neilsen added a comment - - edited

            In the profiling I've done so far on ConvertRepoTask as part of PREOPS-102, the profile I did of a (now about ~1 week out of date) conversion of a subset of DESC DC2 showed that airmass calculation was taking about 20% of the time, much of which was spent in astropy coordinates (converting equatorial to horizon coordinates). My past experiences with astropy coordinates suggest that there may be better ways of doing this. (Also, the airmass is already in the header of the DC2 raws.)

            Show
            neilsen Eric Neilsen added a comment - - edited In the profiling I've done so far on ConvertRepoTask as part of PREOPS-102, the profile I did of a (now about ~1 week out of date) conversion of a subset of DESC DC2 showed that airmass calculation was taking about 20% of the time, much of which was spent in astropy coordinates (converting equatorial to horizon coordinates). My past experiences with astropy coordinates suggest that there may be better ways of doing this. (Also, the airmass is already in the header of the DC2 raws.)
            Hide
            tjenness Tim Jenness added a comment -

            Thanks. Yes, a third of the metadata time was in airmass, a third in detector name to number mappings, and a third in azel calcualtion. There are some things to note here:

            • I have no idea what the airmass header is called for imsim in the files that I was testing with. If you can patch https://github.com/lsst/obs_lsst/blob/master/python/lsst/obs/lsst/translators/imsim.py#L88 that would be great.
            • AltAz begin is derived from the RADEC because I couldn't find the elevation/zenithAngle in imsim file headers either (and because imsim data are from the future this causes much angst to astropy).
            • Butler doesn't use airmass (it does use zenith angle) so one option is to modify ObservationInfo such that it only calculates the information butler actually needs. That should be a relatively trivial change to make to astro_metadata_translator.
            • If imSim data don't include any elevation effects then storing the zenith angle in the butler registry is presumably a waste of time and we could always write a fixed number there. Maybe James Chiang can comment on that side of things. If imSim returns fixed azel/airmass in the translator then ingest would be significantly sped up.

            Allowing the ObservationInfo to restrict calculations to only things we want calculated seems like a good compromise in the short term.

            Show
            tjenness Tim Jenness added a comment - Thanks. Yes, a third of the metadata time was in airmass, a third in detector name to number mappings, and a third in azel calcualtion. There are some things to note here: I have no idea what the airmass header is called for imsim in the files that I was testing with. If you can patch https://github.com/lsst/obs_lsst/blob/master/python/lsst/obs/lsst/translators/imsim.py#L88 that would be great. AltAz begin is derived from the RADEC because I couldn't find the elevation/zenithAngle in imsim file headers either (and because imsim data are from the future this causes much angst to astropy). Butler doesn't use airmass (it does use zenith angle) so one option is to modify ObservationInfo such that it only calculates the information butler actually needs. That should be a relatively trivial change to make to astro_metadata_translator. If imSim data don't include any elevation effects then storing the zenith angle in the butler registry is presumably a waste of time and we could always write a fixed number there. Maybe James Chiang can comment on that side of things. If imSim returns fixed azel/airmass in the translator then ingest would be significantly sped up. Allowing the ObservationInfo to restrict calculations to only things we want calculated seems like a good compromise in the short term.
            Hide
            jchiang James Chiang added a comment -

            imSim does use the hour angle and the observatory latitude to model DCR.   The airmass is also used directly in the PSF model.

            Show
            jchiang James Chiang added a comment - imSim does use the hour angle and the observatory latitude to model DCR.   The airmass is also used directly in the PSF model.

              People

              Assignee:
              tjenness Tim Jenness
              Reporter:
              tjenness Tim Jenness
              Watchers:
              Eric Neilsen, Hsin-Fang Chiang, James Chiang, Jim Bosch, Michelle Gower, Robert Gruendl [X] (Inactive), Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins Builds

                  No builds found.