Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-8339

Make a Pegasus workflow based on ci_hsc and Fall2016 interface

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Follow the pipeline steps in ci_hsc to make an abstract workflow in Pegasus in a shared-many-thing scheme using existing task/butler interface. In this ticket, ignore orchestration details, and executor will not be used.

        Attachments

          Activity

          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          Following the current ci_hsc, this workflow will include 10 tasks (processCcd, makeSkyMap, makeCoaddTempExp, assembleCoadd, detectCoaddSources, mergeCoaddDetections, measureCoaddSources, mergeCoaddMeasurements, forcedPhotCoadd, forcedPhotCcd), process the 33 input raw files and consider one patch (tract=0, patch="5,4").

          Show
          hchiang2 Hsin-Fang Chiang added a comment - Following the current ci_hsc, this workflow will include 10 tasks (processCcd, makeSkyMap, makeCoaddTempExp, assembleCoadd, detectCoaddSources, mergeCoaddDetections, measureCoaddSources, mergeCoaddMeasurements, forcedPhotCoadd, forcedPhotCcd), process the 33 input raw files and consider one patch (tract=0, patch="5,4").
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          For some dataset types, "schema" is created at the first time the task is run, and is read as input for subsequent runs. I'm adding "pre-runs" that are run with no input science data, and the outputted schemas are tracked as their outputs via Pegasus (and used as input for the "real" runs with input science data). I assume all schemas can be pre-generated at pre-runs and do not depend on science data.

          ci_hsc does something similar. To avoid race condition, pre-runs are done to create schema/config/versions.

          Uploading plots of the abstract workflow (dax). Also including a dax without the pre-runs:

          because the task sequence is clearer there.

          Show
          hchiang2 Hsin-Fang Chiang added a comment - For some dataset types, "schema" is created at the first time the task is run, and is read as input for subsequent runs. I'm adding "pre-runs" that are run with no input science data, and the outputted schemas are tracked as their outputs via Pegasus (and used as input for the "real" runs with input science data). I assume all schemas can be pre-generated at pre-runs and do not depend on science data. ci_hsc does something similar. To avoid race condition, pre-runs are done to create schema/config/versions. Uploading plots of the abstract workflow (dax). Also including a dax without the pre-runs: because the task sequence is clearer there.
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          The 33 raw data in ci_hsc were carefully selected to cover one patch in the sky, and tasks beyond ProcessCcd consider only one patch here. In this ticket I followed ci_hsc, and made no effort to generalize the script to handle multiple patches (--> future work).

          Also I treat schema files the same way as other science data files, ignore config and package version files, except two config overrides files that I treat as input files.

          In total there are 95 jobs (see ), including:
          33 processCcd, 1 makeSkyMap, 11 makeCoaddTempExp, 2 assembleCoadd, 2 detectCoaddSources, 1 mergeCoaddDetections, 2 measureCoaddSources, 1 mergeCoaddMeasurements, 2 forcedPhotCoadd, 33 forcedPhotCcd, and 7 preruns (processCcd, detectCoaddSources, mergeCoaddDetections, measureCoaddSources, mergeCoaddMeasurements, forcedPhotCoadd, forcedPhotCcd). In total 324 files are tracked via Pegasus in this workflow (the yellow rectangles in dax_ciHsc_files_simplified.pdf).

          p.s. All 3 attached plots are simplified (i.e. redundant edges are removed).

          Show
          hchiang2 Hsin-Fang Chiang added a comment - The 33 raw data in ci_hsc were carefully selected to cover one patch in the sky, and tasks beyond ProcessCcd consider only one patch here. In this ticket I followed ci_hsc , and made no effort to generalize the script to handle multiple patches (--> future work). Also I treat schema files the same way as other science data files, ignore config and package version files, except two config overrides files that I treat as input files. In total there are 95 jobs (see ), including: 33 processCcd, 1 makeSkyMap, 11 makeCoaddTempExp, 2 assembleCoadd, 2 detectCoaddSources, 1 mergeCoaddDetections, 2 measureCoaddSources, 1 mergeCoaddMeasurements, 2 forcedPhotCoadd, 33 forcedPhotCcd, and 7 preruns (processCcd, detectCoaddSources, mergeCoaddDetections, measureCoaddSources, mergeCoaddMeasurements, forcedPhotCoadd, forcedPhotCcd). In total 324 files are tracked via Pegasus in this workflow (the yellow rectangles in dax_ciHsc_files_simplified.pdf ). p.s. All 3 attached plots are simplified (i.e. redundant edges are removed).
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          The workflow has been made by translating the task sequence in ci_hsc/SConstruct to Pegasus dax and directly using HscMapper and individual CmdLineTasks. Notes on the input/output of individual Pipeline tasks is here. The dax-generating code is here So far I run this workflow on a Nebula instance where HTCondor+Pegasus+Stack is installed.

          Show
          hchiang2 Hsin-Fang Chiang added a comment - The workflow has been made by translating the task sequence in ci_hsc/SConstruct to Pegasus dax and directly using HscMapper and individual CmdLineTasks. Notes on the input/output of individual Pipeline tasks is here . The dax-generating code is here So far I run this workflow on a Nebula instance where HTCondor+Pegasus+Stack is installed.
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          I added a new story to expand this dax to go beyond ci_hsc and include more than one pactch: DM-8603.

          Show
          hchiang2 Hsin-Fang Chiang added a comment - I added a new story to expand this dax to go beyond ci_hsc and include more than one pactch: DM-8603 .
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          Mikolaj Kowalik, from our meeting yesterday, I understand that you want a much simplified workflow based on HSC data, probably only including ProcessCcd. I think that can be extracted from this ciHsc workflow (by deleting all other jobs you don't want). Let me know if you want me to do that.

          Right now I don't have a workflow that splits the 3 sub-tasks of ProcessCcd into separate jobs. I can make one if you need.

          Show
          hchiang2 Hsin-Fang Chiang added a comment - Mikolaj Kowalik , from our meeting yesterday, I understand that you want a much simplified workflow based on HSC data, probably only including ProcessCcd. I think that can be extracted from this ciHsc workflow (by deleting all other jobs you don't want). Let me know if you want me to do that. Right now I don't have a workflow that splits the 3 sub-tasks of ProcessCcd into separate jobs. I can make one if you need.

            People

            • Assignee:
              hchiang2 Hsin-Fang Chiang
              Reporter:
              hchiang2 Hsin-Fang Chiang
              Watchers:
              Hsin-Fang Chiang, Mikolaj Kowalik, Rob Kooper, Steve Pietrowicz
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: