Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-33086

Have plan for dealing with post-ingest file compression

    XMLWordPrintable

    Details

    • Story Points:
      2
    • Team:
      Architecture
    • Urgent?:
      No

      Description

      Yousuke Utsumi reported that at SLAC they can no longer read any raw data using the butler. The problem turns out to be that the files were compressed with fpack after ingestion and so the file sizes are, of course, now different to what they were originally. This breaks the very simple file integrity check implemented in the butler.

      Some options for us:

      • Have a flag in butler ingest-files and butler ingest-raws that indicates that the files being ingested should not have the file size tracked at all. This should only be supported for link or direct ingest modes. A file copy bringing the file inside the datastore should never be modified by an external entity.
      • Provide a butler admin command that will force datastore to recalculate the file size for files in the given collection and dataset type.

      It sounds like both of these might be needed.

        Attachments

          Issue Links

            Activity

            Hide
            tjenness Tim Jenness added a comment -

            If we want something today, the best approach may be to call butler prune-dataset on the files in question and then butler ingest-raws again. The prune won't delete the files and they will reingest with the right file size.

            Show
            tjenness Tim Jenness added a comment - If we want something today, the best approach may be to call butler prune-dataset on the files in question and then butler ingest-raws again. The prune won't delete the files and they will reingest with the right file size.
            Hide
            rhl Robert Lupton added a comment -

            Depending on how we run things in Chile, I think that we may need to think about putting more FITS-file smarts into the registry. It's not a clean software design, but it may be how Rubin needs to work.

            Show
            rhl Robert Lupton added a comment - Depending on how we run things in Chile, I think that we may need to think about putting more FITS-file smarts into the registry. It's not a clean software design, but it may be how Rubin needs to work.
            Hide
            tjenness Tim Jenness added a comment -

            Not registry. This is datastore, which is a completely distinct butler subsystem. Datastore itself doesn't know anything about FITS files or any other file types. It just understands files. The part of the system which actually understands file formats is the Formatter class associated with each file. Having every butler.get call a data validation method in the formatter is very easy to do but when we have tried similar things in the past the biggest complaints have been about the overhead that entails. Checking file size was deemed to be a very simple compromise because the overhead of a single stat() call (or even an API call to the object store) is barely noticed.

            Allowing a Formatter to do checksum verification as part of the file read it is already doing can be up to the formatter and doesn't need any change to butler itself. The formatter could support a "check" parameter that allows the caller to enable or disable the checksum test – does the afw FITS reader support checksum verification?

            Show
            tjenness Tim Jenness added a comment - Not registry. This is datastore, which is a completely distinct butler subsystem. Datastore itself doesn't know anything about FITS files or any other file types. It just understands files. The part of the system which actually understands file formats is the Formatter class associated with each file. Having every butler.get call a data validation method in the formatter is very easy to do but when we have tried similar things in the past the biggest complaints have been about the overhead that entails. Checking file size was deemed to be a very simple compromise because the overhead of a single stat() call (or even an API call to the object store) is barely noticed. Allowing a Formatter to do checksum verification as part of the file read it is already doing can be up to the formatter and doesn't need any change to butler itself. The formatter could support a "check" parameter that allows the caller to enable or disable the checksum test – does the afw FITS reader support checksum verification?
            Hide
            rhl Robert Lupton added a comment -

            OK, fair point about performance.

            Show
            rhl Robert Lupton added a comment - OK, fair point about performance.
            Hide
            jchiang James Chiang added a comment -

            Among the two options presented in the description of this issue, I'd say the first one, an option not to have the file sizes tracked, would be preferred in the near term. We did try pruning and reingesting, but that takes the same time as the original ingests and so isn't very practical given the weeks of data we've ingested already. Similarly, we tried to set the file_size column in the file_datastore_records table to -1 so that the test in the fileDatastore.py code would be skipped, but given the number of records in that table (>3M), the update command didn't finish after ~1 hour and not before the db connection itself was lost. In order to carry on, we've opted to clone a local copy of daf_butler and modified the code to disable the file size tracking.

            Show
            jchiang James Chiang added a comment - Among the two options presented in the description of this issue, I'd say the first one, an option not to have the file sizes tracked, would be preferred in the near term. We did try pruning and reingesting, but that takes the same time as the original ingests and so isn't very practical given the weeks of data we've ingested already. Similarly, we tried to set the file_size column in the file_datastore_records table to -1 so that the test in the fileDatastore.py code would be skipped, but given the number of records in that table (>3M), the update command didn't finish after ~1 hour and not before the db connection itself was lost. In order to carry on, we've opted to clone a local copy of daf_butler and modified the code to disable the file size tracking.
            Hide
            tjenness Tim Jenness added a comment -

            James Chiang I've implemented option 1. Please take a look. There are two pull requests: one for obs_base and one for daf_butler and it's mostly passing one flag all the way through to the right datastore method.

            butler ingest-raws --no-track-files-attrs is the new option. I ended up with different parameter names for ingest-raws and daf_butler because ingest-raws knows for sure that there are files involved but in theory butler ingest could not in the future.

            Show
            tjenness Tim Jenness added a comment - James Chiang I've implemented option 1. Please take a look. There are two pull requests: one for obs_base and one for daf_butler and it's mostly passing one flag all the way through to the right datastore method. butler ingest-raws --no-track-files-attrs is the new option. I ended up with different parameter names for ingest-raws and daf_butler because ingest-raws knows for sure that there are files involved but in theory butler ingest could not in the future.
            Hide
            jchiang James Chiang added a comment -

            Thanks! LGTM.

            Show
            jchiang James Chiang added a comment - Thanks! LGTM.

              People

              Assignee:
              tjenness Tim Jenness
              Reporter:
              tjenness Tim Jenness
              Reviewers:
              James Chiang
              Watchers:
              James Chiang, John Parejko, Robert Lupton, Steve Pietrowicz, Tim Jenness, Yousuke Utsumi
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins Builds

                  No builds found.