Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-11337

Failure in image access or ingest in HSC and Decam examples

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: obs_base, validate_drp
    • Labels:
      None

      Description

      The last timed job that runs examples from CFHT, HSC, Decam failed on ingesting images from HSC or Decam. My first suspicion is disk+data access issues on the worker node.

      https://ci.lsst.codes/job/sqre/job/validate_drp/dataset=hsc,label=centos-7,python=py2/lastBuild/consoleFull
      https://ci.lsst.codes/job/sqre/job/validate_drp/dataset=decam,label=centos-7,python=py2/lastBuild/console

      HSC:

      [py2] $ /bin/bash -e /tmp/hudson2290594987658120023.sh
      notice: lsstsw tools have been set up.
      WARNING: Git LFS is using a deprecated API, which will be removed in v2.0.
               Consider enabling the latest API by running: `git config lfs.batch true`.
       
      Git LFS: (0 of 9736 files) 0 B / 307.07 GB                                     
      root INFO: Loading config overrride file u'/home/jenkins-slave/workspace/sqre/validate_drp/dataset/hsc/label/centos-7/python/py2/lsstsw/stack/Linux64/obs_subaru/13.0-39-gf25a3b0+3/config/ingest.py'
      CameraMapper INFO: Loading Posix exposure registry from /home/jenkins-slave/workspace/sqre/validate_drp/dataset/hsc/label/centos-7/python/py2/validate_drp/data_hsc
      CameraMapper INFO: Loading calib registry from /home/jenkins-slave/workspace/sqre/validate_drp/dataset/hsc/label/centos-7/python/py2/validate_drp/data_hsc/CALIB/calibRegistry.sqlite3
      ingest WARN: Failed to ingest file /home/jenkins-slave/workspace/sqre/validate_drp/dataset/hsc/label/centos-7/python/py2/validation_data_hsc/raw/HSCA90333200.fits: 
        File "src/fits.cc", line 970, in lsst::afw::fits::Fits::Fits(const string&, const string&, int)
          cfitsio error: error reading from FITS file (108) : Opening file '/home/jenkins-slave/workspace/sqre/validate_drp/dataset/hsc/label/centos-7/python/py2/validation_data_hsc/raw/HSCA90333200.fits' with mode 'r' {0}
      lsst::afw::fits::FitsError: 'cfitsio error: error reading from FITS file (108) : Opening file '/home/jenkins-slave/workspace/sqre/validate_drp/dataset/hsc/label/centos-7/python/py2/validation_data_hsc/raw/HSCA90333200.fits' with mode 'r''
       
      ingest WARN: Failed to ingest file /home/jenkins-slave/workspace/sqre/validate_drp/dataset/hsc/label/centos-7/python/py2/validation_data_hsc/raw/HSCA90333201.fits: 
        File "src/fits.cc", line 970, in lsst::afw::fits::Fits::Fits(const string&, const string&, int)
          cfitsio error: error reading from FITS file (108) : Opening file '/home/jenkins-slave/workspace/sqre/validate_drp/dataset/hsc/label/centos-7/python/py2/validation_data_hsc/raw/HSCA90333201.fits' with mode 'r' {0}
      lsst::afw::fits::FitsError: 'cfitsio error: error reading from FITS file (108) : Opening file '/home/jenkins-slave/workspace/sqre/validate_drp/dataset/hsc/label/centos-7/python/py2/validation_data_hsc/raw/HSCA90333201.fits' with mode 'r''
      [...]
      

      Decam:

      [py2] $ /bin/bash -e /tmp/hudson4019348510730166432.sh
      notice: lsstsw tools have been set up.
      Ingesting Raw data
      root INFO: Loading config overrride file u'/home/jenkins-slave/workspace/sqre/validate_drp/dataset/decam/label/centos-7/python/py2/lsstsw/stack/Linux64/obs_decam/13.0-19-ga4adf72+3/config/ingest.py'
      CameraMapper INFO: Loading Posix exposure registry from /home/jenkins-slave/workspace/sqre/validate_drp/dataset/decam/label/centos-7/python/py2/validate_drp/Decam/input
      CameraMapper INFO: Loading Posix calib registry from /home/jenkins-slave/workspace/sqre/validate_drp/dataset/decam/label/centos-7/python/py2/validate_drp/Decam/input
      ingest.parse WARN: Unable to find value for ccdnum (derived from CCDNUM)
      ingest.parse WARN: Unable to find value for ccd (derived from CCDNUM)
      ingest INFO: /home/jenkins-slave/workspace/sqre/validate_drp/dataset/decam/label/centos-7/python/py2/lsstsw/stack/Linux64/validation_data_decam/master-g52ac2b0d78/instcal/instcal0176837.fits.fz --<link>--> /home/jenkins-slave/workspace/sqre/validate_drp/dataset/decam/label/centos-7/python/py2/validate_drp/Decam/input/0176837/instcal0176837.fits.fz
      ingest INFO: /home/jenkins-slave/workspace/sqre/validate_drp/dataset/decam/label/centos-7/python/py2/lsstsw/stack/Linux64/validation_data_decam/master-g52ac2b0d78/dqmask/dqmask0176837.fits.fz --<link>--> /home/jenkins-slave/workspace/sqre/validate_drp/dataset/decam/label/centos-7/python/py2/validate_drp/Decam/input/0176837/dqmask0176837.fits.fz
      ingest INFO: /home/jenkins-slave/workspace/sqre/validate_drp/dataset/decam/label/centos-7/python/py2/lsstsw/stack/Linux64/validation_data_decam/master-g52ac2b0d78/wtmap/wtmap0176837.fits.fz --<link>--> /home/jenkins-slave/workspace/sqre/validate_drp/dataset/decam/label/centos-7/python/py2/validate_drp/Decam/input/0176837/wtmap0176837.fits.fz
      ingest.parse WARN: Unable to find value for ccdnum (derived from CCDNUM)
      ingest.parse WARN: Unable to find value for ccd (derived from CCDNUM)
      ingest INFO: /home/jenkins-slave/workspace/sqre/validate_drp/dataset/decam/label/centos-7/python/py2/lsstsw/stack/Linux64/validation_data_decam/master-g52ac2b0d78/instcal/instcal0176846.fits.fz --<link>--> /home/jenkins-slave/workspace/sqre/validate_drp/dataset/decam/label/centos-7/python/py2/validate_drp/Decam/input/0176846/instcal0176846.fits.fz
      ingest INFO: /home/jenkins-slave/workspace/sqre/validate_drp/dataset/decam/label/centos-7/python/py2/lsstsw/stack/Linux64/validation_data_decam/master-g52ac2b0d78/dqmask/dqmask0176846.fits.fz --<link>--> /home/jenkins-slave/workspace/sqre/validate_drp/dataset/decam/label/centos-7/python/py2/validate_drp/Decam/input/0176846/dqmask0176846.fits.fz
      ingest INFO: /home/jenkins-slave/workspace/sqre/validate_drp/dataset/decam/label/centos-7/python/py2/lsstsw/stack/Linux64/validation_data_decam/master-g52ac2b0d78/wtmap/wtmap0176846.fits.fz --<link>--> /home/jenkins-slave/workspace/sqre/validate_drp/dataset/decam/label/centos-7/python/py2/validate_drp/Decam/input/0176846/wtmap0176846.fits.fz
      running processCcd
      cp: cannot stat ‘/home/jenkins-slave/workspace/sqre/validate_drp/dataset/decam/label/centos-7/python/py2/validate_drp/Decam_output_z.json’: No such file or directory
      

        Attachments

          Issue Links

            Activity

            Hide
            jhoblitt Joshua Hoblitt added a comment - - edited

            The issue seems to be that LoadAstrometryNetObjectsTask is not in meas_astrom: https://github.com/lsst/validate_drp/blob/72a0388c96f003426934867e8c9c11405ba6edf0/config/hscConfig.py#L9

            I believe the import line should probably be from lsst.meas.extensions.astrometryNet import LoadAstrometryNetObjectsTask, based on code in other repos but I'm at a bit of loss as to data set was able to be run by hand.

            Show
            jhoblitt Joshua Hoblitt added a comment - - edited The issue seems to be that LoadAstrometryNetObjectsTask is not in meas_astrom : https://github.com/lsst/validate_drp/blob/72a0388c96f003426934867e8c9c11405ba6edf0/config/hscConfig.py#L9 I believe the import line should probably be from lsst.meas.extensions.astrometryNet import LoadAstrometryNetObjectsTask , based on code in other repos but I'm at a bit of loss as to data set was able to be run by hand.
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            Per discussion on slack in #dm-squash, it appears that the cfht failure is due to the drp output filename being _r.json rather than the expected Cfht_r.json. This appears to be a regression caused by DM-11300.

            The latest run of HSC failed as the jenkins agent disk filed up. This has been fixed and HSC is running again.

            The decam failure mode appears to be different from the other datasets.

            Show
            jhoblitt Joshua Hoblitt added a comment - Per discussion on slack in #dm-squash, it appears that the cfht failure is due to the drp output filename being _r.json rather than the expected Cfht_r.json . This appears to be a regression caused by DM-11300 . The latest run of HSC failed as the jenkins agent disk filed up. This has been fixed and HSC is running again. The decam failure mode appears to be different from the other datasets.
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            The `HSC` build, which presumably would have failed due to DM-11410, ran out of disk space again. This is happens when a `stack-os-matrix` workspace has grown rather large and there is less than 800GiB of free space left, but `validate_drp/hsc` does not have an existing checkout. Simply requiring ~800GiB of free space before starting a build won't solve the problem as this would not be inclusive of a pre-exisitng workspace.

            I've done some manual node workspace cleanup and sent an email to the jenkins-user group asking if anyone has solved this issue. If anything actionable comes of it, I'll split that off into a new issue.

            Show
            jhoblitt Joshua Hoblitt added a comment - The `HSC` build, which presumably would have failed due to DM-11410 , ran out of disk space again. This is happens when a `stack-os-matrix` workspace has grown rather large and there is less than 800GiB of free space left, but `validate_drp/hsc` does not have an existing checkout. Simply requiring ~800GiB of free space before starting a build won't solve the problem as this would not be inclusive of a pre-exisitng workspace. I've done some manual node workspace cleanup and sent an email to the jenkins-user group asking if anyone has solved this issue. If anything actionable comes of it, I'll split that off into a new issue.
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            (I'm adding 1sp to this ticket for additional time spent debugging, including resolving test issue in DM-11410)

            Show
            jhoblitt Joshua Hoblitt added a comment - (I'm adding 1sp to this ticket for additional time spent debugging, including resolving test issue in DM-11410 )
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            The cfht dataset has been passing for couple of days now. hsc seems to have a new failure mode while decam is still getting the status code 6 exit. I'm going to declare the original (several) problems under the scope of this ticket resolved and open new issues.

            Show
            jhoblitt Joshua Hoblitt added a comment - The cfht dataset has been passing for couple of days now. hsc seems to have a new failure mode while decam is still getting the status code 6 exit. I'm going to declare the original (several) problems under the scope of this ticket resolved and open new issues.

              People

              • Assignee:
                jhoblitt Joshua Hoblitt
                Reporter:
                wmwood-vasey Michael Wood-Vasey
                Watchers:
                Joshua Hoblitt, Michael Wood-Vasey
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: