Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-20376

makeBrighterFatterKernel.py fails initialization with RC=0

    Details

    • Type: Bug
    • Status: To Do
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: cp_pipe
    • Labels:
      None
    • Team:
      Architecture

      Description

      (This report is basically copied over from DM-4141, as it does not quite fit there.)

      I am running the makeBrighterFatterKernel.py script at NERSC using the Parsl.  The problem I see is compound.  First, when running multiple instances of this script in parallel against the same repo, some fraction[*] crash due to contention with the contents of the .../runinfo/<label>/config directory.  A typical failure looks like this:

      makeBrighterFatterKernel FATAL: Failed in task initialization: [Errno 2] No such file or directory: '/global/cscratch1/sd/descdm/tomTest/bf_repoA/rerun/20190627/config/makeBrighterFatterKernel.py'

      This may not be unexpected - due to comments in the code regarding possible consequences of running multiple instances in parallel.  However, when this failure occurs, the return code is zero.  This is clearly a bug from my perspective.

      Code being run: 

        dm stack w_2019_19 along with 

        cp_pipe-DM-18683-w_2019_19

      Command:

      /global/common/software/lsst/cori-haswell-gcc/DC2/bf_kernel/software/cp_pipe-DM-18683-w_2019_19/cp_pipe/bin/makeBrighterFatterKernel.py /global/cscratch1/sd/descdm/tomTest/bf_repoA --rerun 20190627 --id detect
      or=25..49 --visit-pairs 5000510,5000525 5000530,5000540 5000550,5000560 5000570,5000580 5000410,5000420 5000430,5000440 5000450,5000460 5000470,5000480 5000310,5000320 5000330,5000340 5000350,5000360 5000370,5000380 5000210,50002
      20 5000230,5000240 5000250,5000260 5000270,5000280 5000110,5000120 5000130,5000140 5000150,5000160 5000170,5000180 -c xcorrCheckRejectLevel=2 doCalcGains=True isr.doDark=True isr.doBias=True isr.doCrosstalk=True isr.doDefect=Fals
      e isr.doLinearize=False forceZeroSum=True correlationModelRadius=3 correlationQuadraticFit=True level=AMP --clobber-config -j 25

      • The fraction of instances that crash with this or a similar error varies from ~20% when running a full 189-sensor focal plane with 189 separate script instances, down to <1% when employing the "-j" option and reducing the number of script instances to ~10.

       Let me know if I an provide an other evidence/clues.

      Thanks for any help,

          - Tom

        Attachments

          Activity

          Hide
          ktl Kian-Tat Lim added a comment -

          It's relatively easy to fix the return code problem. For the "No such file or directory" problem, what kind of shared filesystem is being used? Is a traceback printed?

          Show
          ktl Kian-Tat Lim added a comment - It's relatively easy to fix the return code problem. For the "No such file or directory" problem, what kind of shared filesystem is being used? Is a traceback printed?
          Hide
          glanzman Tom Glanzman added a comment - - edited

          Kian-Tat Lim The file system is at NERSC and consists of several combinations:

          1. login nodes (Haswell-like) directly mount GPFS (home and project space), and Lustre (scratch)
          2. batch nodes (both Haswell and KNL) mount GPFS r/w (or r/o in some cases) and Lustre

          However, in case 2, I do not know if clever Cray tricks are employed, such as their DVS (Data Virtualization Service).  NERSC recommends using Lustre on batch nodes for any I/O intensive work.  The rerun directory in question is in the Lustre (scratch) area.

           

          Yes, there is a traceback.  Here is one example:

          Traceback (most recent call last):
            File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_base/17.0.1-2-g3e5d191+31/python/lsst/pipe/base/cmdLineTask.py", line 333, in precall
              self._precallImpl(task, parsedCmd)
            File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_base/17.0.1-2-g3e5d191+31/python/lsst/pipe/base/cmdLineTask.py", line 309, in _precallImpl
              task.writeConfig(parsedCmd.butler, clobber=self.clobberConfig, doBackup=self.doBackup)
            File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_base/17.0.1-2-g3e5d191+31/python/lsst/pipe/base/cmdLineTask.py", line 669, in writeConfig
              butler.put(self.config, configName, doBackup=doBackup)
            File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/17.0.1-1-gfc6fb1f+14/python/lsst/daf/persistence/butler.py", line 1434, in put
              location.getRepository().write(location, obj)
            File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/17.0.1-1-gfc6fb1f+14/python/lsst/daf/persistence/repository.py", line 185, in write
              return butlerLocationStorage.write(butlerLocation, obj)
            File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/17.0.1-1-gfc6fb1f+14/python/lsst/daf/persistence/posixStorage.py", line 258, in write
              writeFormatter(butlerLocation, obj)
            File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/17.0.1-1-gfc6fb1f+14/python/lsst/daf/persistence/posixStorage.py", line 553, in writeConfigStorage
              obj.save(logLoc.locString())
            File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/python/miniconda3-4.5.12/lib/python3.7/contextlib.py", line 119, in __exit__
              next(self.gen)
            File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/17.0.1-1-gfc6fb1f+14/python/lsst/daf/persistence/safeFileIo.py", line 144, in SafeFilename
              setFileMode(name)
            File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/17.0.1-1-gfc6fb1f+14/python/lsst/daf/persistence/safeFileIo.py", line 57, in setFileMode
              os.chmod(filename, (~umask & 0o666))
          FileNotFoundError: [Errno 2] No such file or directory: '/global/cscratch1/sd/descdm/tomTest/bf_repoA/rerun/20190627/config/makeBrighterFatterKernel.py'
          
          

          Show
          glanzman Tom Glanzman added a comment - - edited Kian-Tat Lim The file system is at NERSC and consists of several combinations: login nodes (Haswell-like) directly mount GPFS (home and project space), and Lustre (scratch) batch nodes (both Haswell and KNL) mount GPFS r/w (or r/o in some cases) and Lustre However, in case 2, I do not know if clever Cray tricks are employed, such as their DVS (Data Virtualization Service).  NERSC recommends using Lustre on batch nodes for any I/O intensive work.  The rerun directory in question is in the Lustre (scratch) area.   Yes, there is a traceback.  Here is one example: Traceback (most recent call last): File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_base/17.0.1-2-g3e5d191+31/python/lsst/pipe/base/cmdLineTask.py" , line 333 , in precall self._precallImpl(task, parsedCmd) File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_base/17.0.1-2-g3e5d191+31/python/lsst/pipe/base/cmdLineTask.py" , line 309 , in _precallImpl task.writeConfig(parsedCmd.butler, clobber=self.clobberConfig, doBackup=self.doBackup) File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_base/17.0.1-2-g3e5d191+31/python/lsst/pipe/base/cmdLineTask.py" , line 669 , in writeConfig butler.put(self.config, configName, doBackup=doBackup) File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/17.0.1-1-gfc6fb1f+14/python/lsst/daf/persistence/butler.py" , line 1434 , in put location.getRepository().write(location, obj) File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/17.0.1-1-gfc6fb1f+14/python/lsst/daf/persistence/repository.py" , line 185 , in write return butlerLocationStorage.write(butlerLocation, obj) File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/17.0.1-1-gfc6fb1f+14/python/lsst/daf/persistence/posixStorage.py" , line 258 , in write writeFormatter(butlerLocation, obj) File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/17.0.1-1-gfc6fb1f+14/python/lsst/daf/persistence/posixStorage.py" , line 553 , in writeConfigStorage obj.save(logLoc.locString()) File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/python/miniconda3-4.5.12/lib/python3.7/contextlib.py" , line 119 , in __exit__ next(self.gen) File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/17.0.1-1-gfc6fb1f+14/python/lsst/daf/persistence/safeFileIo.py" , line 144 , in SafeFilename setFileMode(name) File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_19/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/17.0.1-1-gfc6fb1f+14/python/lsst/daf/persistence/safeFileIo.py" , line 57 , in setFileMode os.chmod(filename, (~umask & 0o666)) FileNotFoundError: [Errno 2 ] No such file or directory: '/global/cscratch1/sd/descdm/tomTest/bf_repoA/rerun/20190627/config/makeBrighterFatterKernel.py'
          Hide
          ktl Kian-Tat Lim added a comment -

          Paul Price gave a workaround: run the task first (once) with no —id parameter. I believe Gen3 middleware will formalize a step like this. Solving the underlying problem efficiently is difficult and probably not worth fixing in Gen2. I’ll PR the return code change, though.

          Show
          ktl Kian-Tat Lim added a comment - Paul Price gave a workaround: run the task first (once) with no —id parameter. I believe Gen3 middleware will formalize a step like this. Solving the underlying problem efficiently is difficult and probably not worth fixing in Gen2. I’ll PR the return code change, though.

            People

            • Assignee:
              ktl Kian-Tat Lim
              Reporter:
              glanzman Tom Glanzman
              Watchers:
              John Swinbank, Kian-Tat Lim, Tom Glanzman
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Summary Panel