Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-32061

export-calibs has problems with datasets using direct ingest

    XMLWordPrintable

    Details

    • Story Points:
      3
    • Team:
      Architecture
    • Urgent?:
      No

      Description

      Quentin Le Boulc'h reported the following problem when exporting calibrations from the DC2 repo:

      $ butler export-calibs /repo/dc2 2.2i_calib_export  2.2i/calib/gen2
      ...
      FileExistsError: Destination path 'file:///home/quentinleboulch/2.2i_calib_export/2.2i/calib/gen2/20220101T000000Z/bias/bias_LSSTCam-imSim_2_2i_calib_gen2_20220101T000000Z.fits' already exists. Transfer from file:///datasets/DC2/DR6/Run2.2i/patched/2021-02-10/CALIB/bias/2022-01-01/bias-R01-S01-det001_2022-01-01.fits cannot be completed.
      

      There are a few problems but the fundamental one is that the DatasetRef being returned from queryDatasetAssociations are not fully expanded. This might not be a problem in itself but the way the export is setup triggers unexpected failures.

      • As a debugging aid it would be helpful if export-calibs allowed the user to specify the dataset type they wanted in addition to the calibration collection.
      • There is no way to set the transfer mode in this command so you always end up with auto.
      • If you wanted to disable the file export but just get the XML with paths relative to the datastore there is no way to do it since an output directory is always created.
      • "ingest" has logic to handle a "split" ingest transfer where some files are relative paths and some are absolute paths. Export does not know anything about this.
      • In the absence of a "split" transfer option, when the exporter encounters absolute URIs in direct mode it decides its only course of action is to create local copies in the output directory. To do that it needs to build a file name from the dataset ref.
      • The template wants to use the full detector name in the filename and therefore needs the detector record. Since the DatasetRef are not fully expanded it can't do that and silently leaves it out of the template.
      • Without a detector in the filename the files are not unique and so everything breaks.

      This suggests many fixes should be made to deal with this mess:

      • Add a transfer option to the command.
      • Add a datasets option to the command.
      • Add support for "split" transfer mode to exportDatasets which would transfer files in the datastore but leave absolute URIs to files outside unchanged.
      • Expand the DatasetRef retrieved by queryDatasetAssociations in case they will be used to copy the file from absolute URI to new location.
      • Consider making FileTemplate complain if a dimension record is needed but none is available.

        Attachments

          Activity

          Hide
          tjenness Tim Jenness added a comment -

          Krzysztof Findeisen would you be able to review these changes and see if they fix your problem?

          • Added transfer option.
          • Added ability to filter dataset types.
          • Added default collection.
          • Now DatasetRefs are expanded so that templates work if copy is used.
          • The template system now complains if a template needs an expanded ref but does not have one.

          Controversially I also made "direct" transfer mode synonymous with "None" (which is not an explicit option). This means that the files are not copied anywhere on export but on import that are copied (or split transfer mode can retain the original direct location for files that were not inside a datastore).

          Show
          tjenness Tim Jenness added a comment - Krzysztof Findeisen would you be able to review these changes and see if they fix your problem? Added transfer option. Added ability to filter dataset types. Added default collection. Now DatasetRefs are expanded so that templates work if copy is used. The template system now complains if a template needs an expanded ref but does not have one. Controversially I also made "direct" transfer mode synonymous with "None" (which is not an explicit option). This means that the files are not copied anywhere on export but on import that are copied (or split transfer mode can retain the original direct location for files that were not inside a datastore).
          Hide
          tjenness Tim Jenness added a comment -

          Christopher Waters would you be able to review?

          Show
          tjenness Tim Jenness added a comment - Christopher Waters would you be able to review?
          Hide
          czw Christopher Waters added a comment -

          Minor point on the PR.

          Show
          czw Christopher Waters added a comment - Minor point on the PR.

            People

            Assignee:
            tjenness Tim Jenness
            Reporter:
            tjenness Tim Jenness
            Reviewers:
            Christopher Waters
            Watchers:
            Christopher Waters, James Chiang, Jim Bosch, Krzysztof Findeisen, Quentin Le Boulc'h, Tim Jenness
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.