Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-29857

Create pure Gen 3 dataset management scripts for ap_verify datasets

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: ap_verify
    • Labels:

      Description

      Currently, ap_verify datasets are primarily based on uningested files, which are processed in Gen 2 and converted to Gen 3 as needed. This workflow will no longer work once Gen 2 is removed; we need a pure Gen 3 workflow to maintain and create datasets.

      Exact work TBD, but the following constraints apply:

      • raws are still held in uningested form; no changes needed
      • curated calibs are currently stored in a Gen 3 repository as a necessary artifact of the Gen 2 conversion process. It may be worth removing them from the repository and instead handling them from scratch during dataset ingestion; this would reduce dataset churn and download size (at the cost of longer ingestion). However, check whether this works in Jenkins; the obs_*_data package might not be available.
      • non-curated calibs are now stored in a Gen 3 repository, as there is no such thing as "calib ingestion". Presumably we would need to run the CP pipeline when making or updating these; should they be certified into dataset/preloaded, or export-imported?
      • templates would need to be created into a repository, as in Gen 2. Our Gen 2 approach of directory copying won't work; export-import should be fine, given our past experience with running the hybrid data set.
      • skymaps are produced by template generation, but potentially conflict with curated calib handling. It may be worth creating a script for getting the skymap right just to avoid any pitfalls.
      • Refcats are not yet supported as of April 2021.

        Attachments

          Issue Links

            Activity

            Hide
            tjenness Tim Jenness added a comment -

            Why wouldn't the obs_*_data packages be available? They are part of lsst_distrib.

            Show
            tjenness Tim Jenness added a comment - Why wouldn't the obs_*_data packages be available? They are part of lsst_distrib.
            Hide
            tjenness Tim Jenness added a comment -

            Also, can you explain how curated calibrations and skymaps interfere?

            I think given the discussion on DM-29543, if you have the specific refcat already then your side would be to create the CSV ingest file and run butler ingest-files and butler register-dataset-type.

            Show
            tjenness Tim Jenness added a comment - Also, can you explain how curated calibrations and skymaps interfere? I think given the discussion on DM-29543 , if you have the specific refcat already then your side would be to create the CSV ingest file and run butler ingest-files and butler register-dataset-type.
            Hide
            krzys Krzysztof Findeisen added a comment -

            When we were developing the current convert+import system, there were some problems from the (gen 2) skymap associated with templates having a name collision with the (gen 3) skymap produced during repository initialization. I don't remember the details (perhaps the skymaps weren't quite identical?).

            Show
            krzys Krzysztof Findeisen added a comment - When we were developing the current convert+import system, there were some problems from the (gen 2) skymap associated with templates having a name collision with the (gen 3) skymap produced during repository initialization. I don't remember the details (perhaps the skymaps weren't quite identical?).
            Hide
            sullivan Ian Sullivan added a comment -

            Following DM-31281, this ticket is to implement #3 and #4 from the list of steps needed:

            1) Talk to cp_pipe folks (Chris W?) to learn best practices for running a default CP Pipeline on a new pile of raw calibs + science images for any given instrument
            2) Write a "definitive" version of ApTemplate.yaml for building good-seeing templates from a pile raw science images for any given instrument, and double check the default skymap choices will work for us
            3) Talk to refcat experts (John P?) to learn how to import an appropriate refcats collection to a new repository
            4) Assemble the above into a script that turns a new pile of raw calibs + science images into an "ap_verify data set," presumably in the form of a single chained collection in a new repo, ready for ApVerify.yaml

            Show
            sullivan Ian Sullivan added a comment - Following DM-31281 , this ticket is to implement #3 and #4 from the list of steps needed: 1) Talk to cp_pipe folks (Chris W?) to learn best practices for running a default CP Pipeline on a new pile of raw calibs + science images for any given instrument 2) Write a "definitive" version of ApTemplate.yaml for building good-seeing templates from a pile raw science images for any given instrument, and double check the default skymap choices will work for us 3) Talk to refcat experts (John P?) to learn how to import an appropriate refcats collection to a new repository 4) Assemble the above into a script that turns a new pile of raw calibs + science images into an "ap_verify data set," presumably in the form of a single chained collection in a new repo, ready for ApVerify.yaml
            Hide
            krzys Krzysztof Findeisen added a comment -

            In an attempt to break this up into more manageable pieces, I've moved #3 to its own issue, DM-32389. The revised scope of this issue is to combine the pieces of DM-32388, DM-31681, and DM-32389 to create a script that:

            1. Creates a Gen 3 repository (the current preloaded) containing curated calibs. (As noted in the OP, we may want to remove these from the ap_verify dataset format, but that's best left for after we no longer need to sync with Gen 2.)
            2. Creates and certifies master calibs for science data (in /repo/main:u/*?).
            3. Imports these master calibs into preloaded.
            4. Creates and certifies master calibs for template generation (in /repo/main:u/*?).
            5. Creates templates (in /repo/main:u/*?).
            6. Imports these templates into preloaded.
            7. Imports refcats into preloaded.

            The issue is considered complete once we have created and tested the script. We should not actually commit the results to ap_verify datasets while we are still supporting Gen 2, to ensure that the Gen 2 and Gen 3 repositories are exactly equivalent – reprocessing the calibs and templates with current pipelines will almost certainly change the results!

            Show
            krzys Krzysztof Findeisen added a comment - In an attempt to break this up into more manageable pieces, I've moved #3 to its own issue, DM-32389 . The revised scope of this issue is to combine the pieces of DM-32388 , DM-31681 , and DM-32389 to create a script that: Creates a Gen 3 repository (the current preloaded ) containing curated calibs. (As noted in the OP, we may want to remove these from the ap_verify dataset format, but that's best left for after we no longer need to sync with Gen 2.) Creates and certifies master calibs for science data (in /repo/main:u/* ?). Imports these master calibs into preloaded . Creates and certifies master calibs for template generation (in /repo/main:u/* ?). Creates templates (in /repo/main:u/* ?). Imports these templates into preloaded . Imports refcats into preloaded . The issue is considered complete once we have created and tested the script. We should not actually commit the results to ap_verify datasets while we are still supporting Gen 2, to ensure that the Gen 2 and Gen 3 repositories are exactly equivalent – reprocessing the calibs and templates with current pipelines will almost certainly change the results!
            Hide
            krzys Krzysztof Findeisen added a comment -

            Thank you for agreeing to review this, Eric Bellm. The primary development was done in ap_verify_ci_hits2015; the scripts in ap_verify_hits2015 are nearly identical copies.

            Show
            krzys Krzysztof Findeisen added a comment - Thank you for agreeing to review this, Eric Bellm . The primary development was done in ap_verify_ci_hits2015 ; the scripts in ap_verify_hits2015 are nearly identical copies.
            Hide
            ebellm Eric Bellm added a comment -

            I've still got the CI generate_all_gen3.sh script running but it seems to be executing fine--I'm going to go ahead and mark this as approved for now so it is unblocked and will check back in on the off chance something goes astray.

            I did wonder if there are any changes to the installation instructions needed at https://pipelines.lsst.io/modules/lsst.ap.verify/datasets-install.html to handle the gen3-only case?

            Show
            ebellm Eric Bellm added a comment - I've still got the CI generate_all_gen3.sh script running but it seems to be executing fine--I'm going to go ahead and mark this as approved for now so it is unblocked and will check back in on the off chance something goes astray. I did wonder if there are any changes to the installation instructions needed at https://pipelines.lsst.io/modules/lsst.ap.verify/datasets-install.html to handle the gen3-only case?
            Hide
            krzys Krzysztof Findeisen added a comment - - edited

            Note that, as written in the comments, the CI generate_all_gen3.sh takes 10 hours to run.

            Dataset installation is unaffected by this transition, as the basic format of the datasets (i.e., as EUPS packages) hasn't changed. Other documentation changes, if any, will be covered on DM-33150.

            Show
            krzys Krzysztof Findeisen added a comment - - edited Note that, as written in the comments, the CI generate_all_gen3.sh takes 10 hours to run. Dataset installation is unaffected by this transition, as the basic format of the datasets (i.e., as EUPS packages) hasn't changed. Other documentation changes, if any, will be covered on DM-33150 .

              People

              Assignee:
              krzys Krzysztof Findeisen
              Reporter:
              krzys Krzysztof Findeisen
              Reviewers:
              Eric Bellm
              Watchers:
              Eric Bellm, Ian Sullivan, Krzysztof Findeisen, Meredith Rawls, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.