Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-4141

cmdLineTasks should provide proper unix return codes

    Details

    • Type: Improvement
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: pipe_base, pipe_drivers
    • Labels:
      None
    • Story Points:
      1
    • Team:
      DM Science

      Description

      When a cmdLineTask fails it doesn't appear to return a non-0 exit code to the shell, making it hard to write shell scripts that chain commands together.

      Please fix this.

      E.g.

      $ bin/assembleCoadd.py /lustre/Subaru/SSP --rerun yasuda/SSP3.8.5_20150810_cosmos:rhl/brightObjectMasks --id tract=9813 patch=5,5 filter=HSC-I --selectId ccd=0..103 visit=1238..1246:2 -c doMaskBrightObjects=True && echo "Success" 
      ...
      2015-10-22T02:44:13: assembleCoadd FATAL: Failed in task initialization: 
       
      Your Eups versions have changed.  The difference is: 
      --- 
      +++ 
      @@ -48 +48 @@
      -obs_subaru                     HSC-3.11.0a_hsc  /data1a/ana/products2014/Linux64/obs_subaru/HSC-3.11.0a_hsc
      +obs_subaru                     LOCAL:/home/rhl/LSST/obs/subaru rev:ef3c892f clean-working-copy
      @@ -55 +55 @@
      -pipe_tasks                     LOCAL:/home/rhl/LSST/pipe/tasks-HSC-1342 rev:84b0f3c4 2 files changed, 5 insertions(+), 6 deletions(-)
      +pipe_tasks                     LOCAL:/home/rhl/LSST/pipe/tasks-HSC-1342 rev:9e8ed18b 2 files changed, 47 insertions(+), 42 deletions(-)
      @@ -60 +59,0 @@
      -pyflakes                       git              /home/rhl/Src/pyflakes
      Please run with --clobber-config to override
       
      Success
      

        Attachments

          Issue Links

            Activity

            Hide
            rhl Robert Lupton added a comment -

            I was just bitten by this again so I fixed it.

            Processing each dataRef now returns a member exitStatus in its return struct, and if any dataRefs return a non-zero value the number of failed dataRefs is passed to a call to sys.exit in parseAndRun. This behaviour may be overridden by passing --noExit to the command line task.

            Show
            rhl Robert Lupton added a comment - I was just bitten by this again so I fixed it. Processing each dataRef now returns a member exitStatus in its return struct, and if any dataRefs return a non-zero value the number of failed dataRefs is passed to a call to sys.exit in parseAndRun . This behaviour may be overridden by passing --noExit to the command line task.
            Hide
            rhl Robert Lupton added a comment -

            I don't think this'll be a problem for MPI, but you'd know

            Show
            rhl Robert Lupton added a comment - I don't think this'll be a problem for MPI, but you'd know
            Hide
            price Paul Price added a comment -

            This is a very useful improvement.
            I made some nitpick comments on the GitHub PRs.

            Show
            price Paul Price added a comment - This is a very useful improvement. I made some nitpick comments on the GitHub PRs.
            Hide
            rhl Robert Lupton added a comment -

            merged and pushed both products

            Show
            rhl Robert Lupton added a comment - merged and pushed both products
            Hide
            glanzman Tom Glanzman added a comment -

            Jim Bosch recommended I report a return code issue here.  I am running the makeBrighterFatterKernel.py script at NERSC using the Parsl.  The problem I see is compound.  First, when running multiple instances of this script in parallel against the same repo, some fraction crash due to contention with the contents of the .../runinfo/<label>/config directory.  A typical failure looks like this:

            makeBrighterFatterKernel FATAL: Failed in task initialization: [Errno 2] No such file or directory: '/global/cscratch1/sd/descdm/tomTest/bf_repoA/rerun/20190627/config/makeBrighterFatterKernel.py'

            This may not be unexpected – due to comments in the code regarding possible consequences of running multiple instances in parallel.  However, when this failure occurs, the return code is zero.  This is clearly a bug from my perspective.

            Code being run: 

              dm stack w_2019_19 along with 

              cp_pipe-DM-18683-w_2019_19

            Command:

            /global/common/software/lsst/cori-haswell-gcc/DC2/bf_kernel/software/cp_pipe-DM-18683-w_2019_19/cp_pipe/bin/makeBrighterFatterKernel.py /global/cscratch1/sd/descdm/tomTest/bf_repoA --rerun 20190627 --id detect
            or=25..49 --visit-pairs 5000510,5000525 5000530,5000540 5000550,5000560 5000570,5000580 5000410,5000420 5000430,5000440 5000450,5000460 5000470,5000480 5000310,5000320 5000330,5000340 5000350,5000360 5000370,5000380 5000210,50002
            20 5000230,5000240 5000250,5000260 5000270,5000280 5000110,5000120 5000130,5000140 5000150,5000160 5000170,5000180 -c xcorrCheckRejectLevel=2 doCalcGains=True isr.doDark=True isr.doBias=True isr.doCrosstalk=True isr.doDefect=Fals
            e isr.doLinearize=False forceZeroSum=True correlationModelRadius=3 correlationQuadraticFit=True level=AMP --clobber-config -j 25

            • The fraction of instances that crash with this or a similar error varies from ~20% when running a full 189-sensor focal plane with 189 separate script instances, down to <1% when employing the "-j" option and reducing the number of script instances to ~10.

              Thanks for any help,

                - Tom

            Show
            glanzman Tom Glanzman added a comment - Jim Bosch recommended I report a return code issue here.  I am running the makeBrighterFatterKernel.py script at NERSC using the Parsl.  The problem I see is compound.  First, when running multiple instances of this script in parallel against the same repo, some fraction crash due to contention with the contents of the .../runinfo/<label>/config directory.  A typical failure looks like this: makeBrighterFatterKernel FATAL: Failed in task initialization: [Errno 2] No such file or directory: '/global/cscratch1/sd/descdm/tomTest/bf_repoA/rerun/20190627/config/makeBrighterFatterKernel.py' This may not be unexpected – due to comments in the code regarding possible consequences of running multiple instances in parallel.  However, when this failure occurs, the return code is zero.  This is clearly a bug from my perspective. Code being run:    dm stack w_2019_19 along with    cp_pipe- DM-18683 -w_2019_19 Command: /global/common/software/lsst/cori-haswell-gcc/DC2/bf_kernel/software/cp_pipe- DM-18683 -w_2019_19/cp_pipe/bin/makeBrighterFatterKernel.py /global/cscratch1/sd/descdm/tomTest/bf_repoA --rerun 20190627 --id detect or=25..49 --visit-pairs 5000510,5000525 5000530,5000540 5000550,5000560 5000570,5000580 5000410,5000420 5000430,5000440 5000450,5000460 5000470,5000480 5000310,5000320 5000330,5000340 5000350,5000360 5000370,5000380 5000210,50002 20 5000230,5000240 5000250,5000260 5000270,5000280 5000110,5000120 5000130,5000140 5000150,5000160 5000170,5000180 -c xcorrCheckRejectLevel=2 doCalcGains=True isr.doDark=True isr.doBias=True isr.doCrosstalk=True isr.doDefect=Fals e isr.doLinearize=False forceZeroSum=True correlationModelRadius=3 correlationQuadraticFit=True level=AMP --clobber-config -j 25 The fraction of instances that crash with this or a similar error varies from ~20% when running a full 189-sensor focal plane with 189 separate script instances, down to <1% when employing the "-j" option and reducing the number of script instances to ~10.   Thanks for any help,     - Tom

              People

              • Assignee:
                rhl Robert Lupton
                Reporter:
                rhl Robert Lupton
                Reviewers:
                Paul Price
                Watchers:
                Hsin-Fang Chiang, John Swinbank, Lauren MacArthur, Paul Price, Robert Lupton, Russell Owen, Tom Glanzman
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel