Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-9292

validate_drp processing metadata should include basic provenance

    Details

    • Type: Story
    • Status: To Do
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: QA
    • Labels:
      None

      Description

      The validateDrp.py per filter output files do not include basic metadata such as the camera name, telescope, or test dataset descriptor/name. This means that an external "source of truth" must be used to put the drp run into context. Such metadata should either be included in the per filter output files or in a new "summary" metadata file that describes the entire test run.

        Attachments

          Issue Links

            Activity

            Hide
            wmwood-vasey Michael Wood-Vasey added a comment -

            That should be good for now. Let's solve the problem for cases where the input data repo is git versioned. That will work for validation_data_cfht, validation_data_decam, and validation_data_hsc which will be the datasets we run daily.

            Show
            wmwood-vasey Michael Wood-Vasey added a comment - That should be good for now. Let's solve the problem for cases where the input data repo is git versioned. That will work for validation_data_cfht , validation_data_decam , and validation_data_hsc which will be the datasets we run daily.
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            I think this may raise another issue. The qadb schema doesn't have a concept of dataset versions.

            http://sqr-008.lsst.io/en/latest/#the-qa-database

            Show
            jhoblitt Joshua Hoblitt added a comment - I think this may raise another issue. The qadb schema doesn't have a concept of dataset versions. http://sqr-008.lsst.io/en/latest/#the-qa-database
            Hide
            jsick Jonathan Sick added a comment -

            To preserve the portability of validate_drp, I think all of this provenance tracking could usefully happen outside of validate_drp in post-qa. post-qa is where we currently insert lsstsw package provenance, for example.

            This means we can put logic about the dataset into the validate_drp Jenkins job, which directly controls both validate_drp and post-qa.

            This still works nicely in the future when post-qa might be replaced by a SuperTask.

            Show
            jsick Jonathan Sick added a comment - To preserve the portability of validate_drp , I think all of this provenance tracking could usefully happen outside of validate_drp in post-qa . post-qa is where we currently insert lsstsw package provenance, for example. This means we can put logic about the dataset into the validate_drp Jenkins job, which directly controls both validate_drp and post-qa . This still works nicely in the future when post-qa might be replaced by a SuperTask.
            Hide
            wmwood-vasey Michael Wood-Vasey added a comment -

            Yes, I like very much.

            Show
            wmwood-vasey Michael Wood-Vasey added a comment - Yes, I like very much.
            Hide
            jsick Jonathan Sick added a comment -

            One difficulty is that I think we'd like each metric measurement to say exactly what data it was produced from. In validate_drp, we're sub-setting the input butler repository by selecting datarefs from a specific filter; thus a given measurement isn't simply coming from "X" data repo (which itself is git-versioned), it's coming from a specific subset of that repo.

            I think we could address this issue by decomposing validate_drp into sub-tasks. At a basic level, there could be a task to run all measurements from a single filter. That way the validate_drp Jenkins harness would know exactly what datarefs are going into the runOneFilter-type computations. validate_drp would still ship with a wrapper task that iterates over all available filters in a Butler repo, running this "runOneFilter" sub-task for each.

            tl;dr is that a post-qa + Jenkins job shim should work for now to fulfill this ticket to get an MVP with multiple datasets in SQUASH, but I think in this epic we'll want to have a design session where we think more holistically about data provenance tracking and look at how we can adopt SuperTasks to make it happen.

            Show
            jsick Jonathan Sick added a comment - One difficulty is that I think we'd like each metric measurement to say exactly what data it was produced from. In validate_drp , we're sub-setting the input butler repository by selecting datarefs from a specific filter; thus a given measurement isn't simply coming from "X" data repo (which itself is git-versioned), it's coming from a specific subset of that repo. I think we could address this issue by decomposing validate_drp into sub-tasks. At a basic level, there could be a task to run all measurements from a single filter. That way the validate_drp Jenkins harness would know exactly what datarefs are going into the runOneFilter -type computations. validate_drp would still ship with a wrapper task that iterates over all available filters in a Butler repo, running this " runOneFilter " sub-task for each. tl;dr is that a post-qa + Jenkins job shim should work for now to fulfill this ticket to get an MVP with multiple datasets in SQUASH, but I think in this epic we'll want to have a design session where we think more holistically about data provenance tracking and look at how we can adopt SuperTasks to make it happen.

              People

              • Assignee:
                Unassigned
                Reporter:
                jhoblitt Joshua Hoblitt
                Watchers:
                Angelo Fausti, Jonathan Sick, Joshua Hoblitt, Michael Wood-Vasey
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Summary Panel