Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-25965

Refactor gen3 raw ingest to support remote files

    XMLWordPrintable

    Details

    • Story Points:
      10
    • Team:
      Ops Middleware
    • Urgent?:
      No

      Description

      As we move to a cloud-based IDF we have to consider that we will not always be wanting to seed gen3 repositories with raw data on local disk. Eventually we will be in a situation where raw files are in cloud storage and we do not want to have to copy them all down to local files simply to read the metadata.

      Metadata extraction is the key phase that needs to be modified to support cloud storage. S3Datastore, for example, can already copy from one s3 bucket to another (although I understand that this is itself incredibly inefficient and involves downloading the file).

      To support metadata extraction in RawIngestTask we need to consider the following options:

      1. Requesting, say, the first 10,000 bytes and parsing the header from those (this may include forcing a gunzip of the partial byte stream).
      2. Requiring that the cloud storage has a corresponding .txt file containing the header for every .fits file to be ingested.

      The second option is going to be more explicit and could handle DECam data where there are multiple datasets in a single file and so we can't simply read the first N bytes.

      This ticket is to explore options for reimplementing the extractMetadata method in rawIngestTask.

        Attachments

          Issue Links

            Activity

            Hide
            tjenness Tim Jenness added a comment -

            aws does seem to support s3 to s3 without download: https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html and there are hints that awscli is far faster than boto3 copy.

            Show
            tjenness Tim Jenness added a comment - aws does seem to support s3 to s3 without download: https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html and there are hints that awscli is far faster than boto3 copy.
            Hide
            tjenness Tim Jenness added a comment -

            This ticket is going to require the following changes:

            • Support for index or sidecar JSON files in ingest DM-27476
            • Adding os.walk functionality to ButlerURI
            • Switching RawIngestTask to use ButlerURI
            • For a remote URI (and no index/sidecar file) downloading either the entire file or first N bytes for local metadata extraction
            • For DECam, modifying the JSON format to allow HDUs to be specified.
            • Allowing ingest to process N files at a time (we do not want to be in a situation where we walk the tree and have a million hits) or else find all the leaf subdirectories and process them separately.
            Show
            tjenness Tim Jenness added a comment - This ticket is going to require the following changes: Support for index or sidecar JSON files in ingest DM-27476 Adding os.walk functionality to ButlerURI Switching RawIngestTask to use ButlerURI For a remote URI (and no index/sidecar file) downloading either the entire file or first N bytes for local metadata extraction For DECam, modifying the JSON format to allow HDUs to be specified. Allowing ingest to process N files at a time (we do not want to be in a situation where we walk the tree and have a million hits) or else find all the leaf subdirectories and process them separately.
            Hide
            tjenness Tim Jenness added a comment -

            Following DM-29011, it makes sense to always process files per-directory and the new `ButlerURI.findFileResources` method does exactly that. It can group by directory if a directory is searched (if explicit files are given they return in a single group).

            For object store ingest Kian-Tat Lim suggests that we do not store index files or sidecar files in the same tree as the data files but instead store them in an identical but parallel tree. This has a number of advantages in that we can scan the tree faster (we won't be retrieving the FITS file keys) and we can easily regenerate a tree in a different place.

            Currently index files list their contents relative to the location of the index file but if we did this we'd probably want to change it such that a prefix can be given to override the paths (the alternative is to store a full relative path to the file location – that is possible buy may be undesirable). Also, sidecar files are currently discovered by being alongside the data file. Having them be found as proxies for the data file and convincing raw ingest to not ingest the JSON files will also need work. Finally, the astrometadata command would need to be modified to make it possible to create this parallel tree of either sidecar or index files.

            We also need to decide how these files are being created in operations. A sidecar file can be made as part of the file transfer process. An index file requires some knowledge of when all the files for that directory have been written (or else parallelized updating of the index file).

            We have previously suggested that we might want to use S3 metadata. The ObservationInfo JSON files are about 1kB and would fit in the 2kB limit. ButlerURI could be modified to have a getMetadata-type method that returns the metadata as a dict and for local file it could look for a .json sidecar file automatically. This won't work for the raw metadata FITS headers that we are now advocating is used for these JSON files since they exceed the 2kB limit.

            The next step is to convert RawIngestTask to ButlerURI assuming JSON files are in-place with the FITS files.

            Show
            tjenness Tim Jenness added a comment - Following DM-29011 , it makes sense to always process files per-directory and the new `ButlerURI.findFileResources` method does exactly that. It can group by directory if a directory is searched (if explicit files are given they return in a single group). For object store ingest Kian-Tat Lim suggests that we do not store index files or sidecar files in the same tree as the data files but instead store them in an identical but parallel tree. This has a number of advantages in that we can scan the tree faster (we won't be retrieving the FITS file keys) and we can easily regenerate a tree in a different place. Currently index files list their contents relative to the location of the index file but if we did this we'd probably want to change it such that a prefix can be given to override the paths (the alternative is to store a full relative path to the file location – that is possible buy may be undesirable). Also, sidecar files are currently discovered by being alongside the data file. Having them be found as proxies for the data file and convincing raw ingest to not ingest the JSON files will also need work. Finally, the astrometadata command would need to be modified to make it possible to create this parallel tree of either sidecar or index files. We also need to decide how these files are being created in operations. A sidecar file can be made as part of the file transfer process. An index file requires some knowledge of when all the files for that directory have been written (or else parallelized updating of the index file). We have previously suggested that we might want to use S3 metadata. The ObservationInfo JSON files are about 1kB and would fit in the 2kB limit. ButlerURI could be modified to have a getMetadata-type method that returns the metadata as a dict and for local file it could look for a .json sidecar file automatically. This won't work for the raw metadata FITS headers that we are now advocating is used for these JSON files since they exceed the 2kB limit. The next step is to convert RawIngestTask to ButlerURI assuming JSON files are in-place with the FITS files.
            Hide
            tjenness Tim Jenness added a comment -

            Jim Bosch would you please take a look at this? It's not very large although I had to fix some other packages.

            • Switch to ButlerURI in obs_base ingest
            • For ingest now group files by directory (but not if you specify an explicit list of files – those are all done together). I am not sure if you want a switch for going back to the old "do them all at once" mode.
            • Add .pop method to PropertySet/List to match dict.pop
            • Add abspath() method to ButlerURI.
            • Fix error message from butler ingest if two FileDataset clash.
            • Fix extension handling in astro_metadata_translator sidecar code.
            Show
            tjenness Tim Jenness added a comment - Jim Bosch would you please take a look at this? It's not very large although I had to fix some other packages. Switch to ButlerURI in obs_base ingest For ingest now group files by directory (but not if you specify an explicit list of files – those are all done together). I am not sure if you want a switch for going back to the old "do them all at once" mode. Add .pop method to PropertySet/List to match dict.pop Add abspath() method to ButlerURI. Fix error message from butler ingest if two FileDataset clash. Fix extension handling in astro_metadata_translator sidecar code.
            Hide
            tjenness Tim Jenness added a comment -

            I also had to fix obs_decam and obs_cfht since I had forgotten they had their own implementations of metadata extraction.

            Show
            tjenness Tim Jenness added a comment - I also had to fix obs_decam and obs_cfht since I had forgotten they had their own implementations of metadata extraction.
            Hide
            jbosch Jim Bosch added a comment -

            Looks good! I left a few minor comments on various PRs, and one discussion piece that's probably out of scope for actually acting on here.

            Show
            jbosch Jim Bosch added a comment - Looks good! I left a few minor comments on various PRs, and one discussion piece that's probably out of scope for actually acting on here.

              People

              Assignee:
              tjenness Tim Jenness
              Reporter:
              tjenness Tim Jenness
              Reviewers:
              Jim Bosch
              Watchers:
              Frossie Economou, Hsin-Fang Chiang, Jim Bosch, Kian-Tat Lim, Michelle Gower, Mikolaj Kowalik, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.