Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: obs_base
-
Labels:
-
Story Points:10
-
Team:Ops Middleware
-
Urgent?:No
Description
As we move to a cloud-based IDF we have to consider that we will not always be wanting to seed gen3 repositories with raw data on local disk. Eventually we will be in a situation where raw files are in cloud storage and we do not want to have to copy them all down to local files simply to read the metadata.
Metadata extraction is the key phase that needs to be modified to support cloud storage. S3Datastore, for example, can already copy from one s3 bucket to another (although I understand that this is itself incredibly inefficient and involves downloading the file).
To support metadata extraction in RawIngestTask we need to consider the following options:
- Requesting, say, the first 10,000 bytes and parsing the header from those (this may include forcing a gunzip of the partial byte stream).
- Requiring that the cloud storage has a corresponding .txt file containing the header for every .fits file to be ingested.
The second option is going to be more explicit and could handle DECam data where there are multiple datasets in a single file and so we can't simply read the first N bytes.
This ticket is to explore options for reimplementing the extractMetadata method in rawIngestTask.
aws does seem to support s3 to s3 without download: https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html and there are hints that awscli is far faster than boto3 copy.