Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-26700

Duplicate DiaObjects breaking diaPipe

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: ap_association
    • Labels:
      None
    • Story Points:
      8
    • Sprint:
      AP F20-5 (October), AP F20-6 (November), AP S21-1 (December), AP S21-2 (January)
    • Team:
      Alert Production
    • Urgent?:
      No

      Description

      Per our discussion today. For example,

      apPipe.diaPipe INFO: Running DiaPipeline...
      numexpr.utils INFO: Note: NumExpr detected 24 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
      numexpr.utils INFO: NumExpr defaulting to 8 threads.
      apPipe.diaPipe.associator.diaCalculation WARN: Error in ap_HTMIndex.calculate: __init__(): incompatible constructor arguments. The following argument types are supported:
          1. lsst.geom.SpherePoint()
          2. lsst.geom.SpherePoint(longitude: lsst.geom.Angle, latitude: lsst.geom.Angle)
          3. lsst.geom.SpherePoint(longitude: float, latitude: float, units: lsst.geom.AngleUnit)
          4. lsst.geom.SpherePoint(vector: lsst.sphgeom._sphgeom.Vector3d)
          5. lsst.geom.SpherePoint(unitVector: lsst.sphgeom._sphgeom.UnitVector3d)
          6. lsst.geom.SpherePoint(lonLat: lsst.sphgeom._sphgeom.LonLat)
          7. lsst.geom.SpherePoint(other: lsst.geom.SpherePoint)
       
       
      Invoked with: array([2.61635 rad, 2.61635 rad], dtype=object), array([0.0325585 rad, 0.0325585 rad], dtype=object)
      apPipe.diaPipe.associator.diaCalculation WARN: Error in ap_chi2Flux.calculate: operands could not be broadcast together with shapes (15,) (2,) 
      apPipe.diaPipe.associator.diaCalculation WARN: Error in ap_stetsonJ.calculate: operands could not be broadcast together with shapes (15,) (2,) 
      apPipe.diaPipe.diaForcedSource.forcedMeasurement INFO: Performing forced measurement on 8217 sources
      apPipe.diaPipe.diaForcedSource.forcedMeasurement INFO: Performing forced measurement on 8217 sources
      apPipe FATAL: Failed on dataId={'filter': 'HSC-G', 'ccd': 19, 'visit': 101450, 'field': 'SSP_UDEEP_COSMOS', 'dateObs': '2017-02-01', 'pointing': 1858, 'taiObs': '2017-02-01', 'expTime': 3
      00.0}: IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "PK_DiaObject"
      DETAIL:  Key ("diaObjectId", "validityStart")=(50809544716059144, 2017-02-01 09:08:31.3485) already exists. 

        Attachments

          Activity

          Hide
          mrawls Meredith Rawls added a comment -

          Agreed, thanks for looking into this, I think the investigation here is done and we will need a new ticket for making sure future batch processing doesn't run into this issue. I also passed along the suggestion to try adding --no-requeue to the sbatch call to Benjamin Racine.

          Show
          mrawls Meredith Rawls added a comment - Agreed, thanks for looking into this, I think the investigation here is done and we will need a new ticket for making sure future batch processing doesn't run into this issue. I also passed along the suggestion to try adding --no-requeue to the sbatch call to Benjamin Racine.
          Hide
          cmorrison Chris Morrison [X] (Inactive) added a comment -

          I looked at the final results of the processing and, as I mentioned above, the fact there were not failed inserts or failures when setting previous DiaObjects to invalid, all of the errors were on DiaSource insert which makes sense as the data is getting re-run from the start and encounters DiaSources that have already been inserted and thus runs into a insertion error.

          If we want to test the specific failures again, I can turn down the DB timeout and then submit a long running job with a lot of vists to get things completely replicated. Might be a waste of time at this point.

          My suggestion to close out this ticket is to create a new ticket that improves the error catching in ap_pipe/DiaPipe to better target when these errors happen and send more information to the logs.

          Show
          cmorrison Chris Morrison [X] (Inactive) added a comment - I looked at the final results of the processing and, as I mentioned above, the fact there were not failed inserts or failures when setting previous DiaObjects to invalid, all of the errors were on DiaSource insert which makes sense as the data is getting re-run from the start and encounters DiaSources that have already been inserted and thus runs into a insertion error. If we want to test the specific failures again, I can turn down the DB timeout and then submit a long running job with a lot of vists to get things completely replicated. Might be a waste of time at this point. My suggestion to close out this ticket is to create a new ticket that improves the error catching in ap_pipe/DiaPipe to better target when these errors happen and send more information to the logs.
          Hide
          cmorrison Chris Morrison [X] (Inactive) added a comment -

          Recapping what I said on the meeting and on this ticket: From the experience running Ian Sullivan's suggested test and my previous tests with a larger volume of data, the Integrity/OperationalErrors thrown by the DB and pipeline are likely a combination of the pipeline jobs getting re-run after failing in a previous process run and, for instance writing to the DB only partially after getting a timeout error from running a previous dataId.

          For instance: the DB write to update previous DiaObjects to have a ValidityEnd value succeeds but the DiaObject write fails.

          Considering the length of time and processing required to properly fully recreate this failure state, I would suggest two things:

          • Make sure long runs to slurm are setup with --no-requeue to future AP jobs.
          • Spawn a ticket off of this one to make the error catching for DB failures more robust and precise including possibly setting a debug mode config to DiaPipe to output it the catalogs it failed on to disk.
          Show
          cmorrison Chris Morrison [X] (Inactive) added a comment - Recapping what I said on the meeting and on this ticket: From the experience running Ian Sullivan 's suggested test and my previous tests with a larger volume of data, the Integrity/OperationalErrors thrown by the DB and pipeline are likely a combination of the pipeline jobs getting re-run after failing in a previous process run and, for instance writing to the DB only partially after getting a timeout error from running a previous dataId. For instance: the DB write to update previous DiaObjects to have a ValidityEnd value succeeds but the DiaObject write fails. Considering the length of time and processing required to properly fully recreate this failure state, I would suggest two things: Make sure long runs to slurm are setup with --no-requeue to future AP jobs. Spawn a ticket off of this one to make the error catching for DB failures more robust and precise including possibly setting a debug mode config to DiaPipe to output it the catalogs it failed on to disk.
          Hide
          sullivan Ian Sullivan added a comment -

          Thanks Chris for your extensive investigations into this error. Please go ahead and create the ticket to improve the database error reporting.

          Show
          sullivan Ian Sullivan added a comment - Thanks Chris for your extensive investigations into this error. Please go ahead and create the ticket to improve the database error reporting.
          Hide
          cmorrison Chris Morrison [X] (Inactive) added a comment -

          Created ticket DM-28555

          Show
          cmorrison Chris Morrison [X] (Inactive) added a comment - Created ticket DM-28555

            People

            Assignee:
            cmorrison Chris Morrison [X] (Inactive)
            Reporter:
            mrawls Meredith Rawls
            Reviewers:
            Ian Sullivan, Meredith Rawls
            Watchers:
            Chris Morrison [X] (Inactive), Ian Sullivan, Krzysztof Findeisen, Meredith Rawls
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.