In my first test I unzipped all the .gz files and replaced them with fpacked files so that we do not get a slow down from gunzipping to read metadata.
With profiling enabled single-threaded ingest of the 450 files (from imsim, latiss, phosim, ts8, and ts3) the run time is about 20 seconds (remember checksum is disabled).
About 6 seconds of that is importing python code so the ingest phase is about 14 seconds (how long run takes).
Calculating all the ingest metadata takes nearly 10 seconds (much of that seems to be down to calculating AzEl/Airmass for the on sky data).
3 seconds of that is reading the YAML camera (for detector mapping) and nearly 4 seconds is spent in astropy coordinates.
Butler ingest only takes about 1 second.
ie metadata extraction accounts for only 7% of run time. If this holds generally and if we intend to reingest multiple times, it seems like by far the best approach is to have a mode where we can read pre-calculated ObservationInfo from a separate file. This would also help ingest over S3 (for example writing a .yaml file of the same name as the .fits file and in the same place). The downside being there has to be a separate piece of code that has to run and do this metadata extraction and persist it somewhere for later use by ingest.
For the record, checksum calculation adds an additional 15 seconds.
This is all locally on my iMac with sqlite and an attached SSD over Thunderbolt.
In my first test I unzipped all the .gz files and replaced them with fpacked files so that we do not get a slow down from gunzipping to read metadata.
With profiling enabled single-threaded ingest of the 450 files (from imsim, latiss, phosim, ts8, and ts3) the run time is about 20 seconds (remember checksum is disabled).
About 6 seconds of that is importing python code so the ingest phase is about 14 seconds (how long run takes).
Calculating all the ingest metadata takes nearly 10 seconds (much of that seems to be down to calculating AzEl/Airmass for the on sky data).
3 seconds of that is reading the YAML camera (for detector mapping) and nearly 4 seconds is spent in astropy coordinates.
Butler ingest only takes about 1 second.
ie metadata extraction accounts for only 7% of run time. If this holds generally and if we intend to reingest multiple times, it seems like by far the best approach is to have a mode where we can read pre-calculated ObservationInfo from a separate file. This would also help ingest over S3 (for example writing a .yaml file of the same name as the .fits file and in the same place). The downside being there has to be a separate piece of code that has to run and do this metadata extraction and persist it somewhere for later use by ingest.
For the record, checksum calculation adds an additional 15 seconds.
This is all locally on my iMac with sqlite and an attached SSD over Thunderbolt.