Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-30730

DC2 Reprocessing with w_2021_24 (gen2)

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Story Points:
      8
    • Epic Link:
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      Process the DC2 dataset with the w_2021_24 stack.  Will first attempt to follow the procedure described in the lsst-dm/gen2gen repo here.

        Attachments

          Issue Links

            Activity

            Hide
            lauren Lauren MacArthur added a comment - - edited

            Hit a snag with brigherFatter right out of the gate:

            AttributeError: 'ImsimMapper' object has no attribute 'map_bfKernel'
            ...
            Traceback (most recent call last):
             File "/software/lsstsw/stack_20210520/stack/miniconda3-py38_4.9.2-0.6.0/Linux64/ip_isr/21.0.0-18-g4012351+bafe306c76/python/lsst/
             ip/isr/isrTask.py", line 1143, in readIsrData
             brighterFatterKernel = dataRef.get("brighterFatterKernel")
             File "/software/lsstsw/stack_20210520/stack/miniconda3-py38_4.9.2-0.6.0/Linux64/daf_persistence/21.0.0-8-g5674e7b+744a5e8720/pyth
             on/lsst/daf/persistence/butlerSubset.py", line 203, in get
             return self.butlerSubset.butler.get(datasetType, self.dataId, **rest)
             File "/software/lsstsw/stack_20210520/stack/miniconda3-py38_4.9.2-0.6.0/Linux64/daf_persistence/21.0.0-8-g5674e7b+744a5e8720/pyth
             on/lsst/daf/persistence/butler.py", line 1401, in get
             raise NoResults("No locations for get:", datasetType, dataId)
             lsst.daf.persistence.butlerExceptions.NoResults: No locations for get: datasetType:brighterFatterKernel dataId:DataId(initialdata=\{
             'visit': 159471, 'run': '159471', 'raftName': 'R02', 'expId': 159471, 'detectorName': 'S02', 'detector': 11}, tag=set())
            

            Show
            lauren Lauren MacArthur added a comment - - edited Hit a snag with brigherFatter right out of the gate: AttributeError: 'ImsimMapper' object has no attribute 'map_bfKernel' ... Traceback (most recent call last): File " / software / lsstsw / stack_20210520 / stack / miniconda3 - py38_4. 9.2 - 0.6 . 0 / Linux64 / ip_isr / 21.0 . 0 - 18 - g4012351 + bafe306c76 / python / lsst / ip / isr / isrTask.py", line 1143 , in readIsrData brighterFatterKernel = dataRef.get( "brighterFatterKernel" ) File " / software / lsstsw / stack_20210520 / stack / miniconda3 - py38_4. 9.2 - 0.6 . 0 / Linux64 / daf_persistence / 21.0 . 0 - 8 - g5674e7b + 744a5e8720 / pyth on / lsst / daf / persistence / butlerSubset.py", line 203 , in get return self .butlerSubset.butler.get(datasetType, self .dataId, * * rest) File " / software / lsstsw / stack_20210520 / stack / miniconda3 - py38_4. 9.2 - 0.6 . 0 / Linux64 / daf_persistence / 21.0 . 0 - 8 - g5674e7b + 744a5e8720 / pyth on / lsst / daf / persistence / butler.py", line 1401 , in get raise NoResults( "No locations for get:" , datasetType, dataId) lsst.daf.persistence.butlerExceptions.NoResults: No locations for get: datasetType:brighterFatterKernel dataId:DataId(initialdata = \{ 'visit' : 159471 , 'run' : '159471' , 'raftName' : 'R02' , 'expId' : 159471 , 'detectorName' : 'S02' , 'detector' : 11 }, tag = set ())
            Hide
            lauren Lauren MacArthur added a comment - - edited

            I tried adding the following to policy/imsim/imsimMapper.yaml in obs_lsst:

            calibrations:
              bfKernel:
                level: None
                persistable: ignored
                python: numpy.ndarray
                storage: PickleStorage
                template: bfkernels/bfKernel-%(raftName)s-%(detectorName)s-det%(detector)03d.pkl
            

            but then I get:

            processCcd.isr INFO: Old style brighter-fatter kernel (np.array) loaded
            

            but

            print(type(brighterFatterKernel))
            <class 'lsst.ip.isr.brighterFatterKernel.BrighterFatterKernel'>print(dir(brighterFatterKernel))
            ['_OBSTYPE', '_SCHEMA', '_VERSION', '__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', 'apply', 'calibInfoFromDict', 'fromDetector', 'fromDict', 'fromTable', 'getLengths', 'getMetadata', 'initFromCamera', 'kernel', 'level', 'makeDetectorKernelFromAmpwiseKernels', 'readFits', 'readText', 'replaceDetectorKernelWithAmpKernel', 'requiredAttributes', 'setMetadata', 'toDict', 'toTable', 'updateMetadata', 'validate', 'writeFits', 'writeText']
            ...
              File "/home/lauren/LSST/ip_isr/python/lsst/ip/isr/isrTask.py", line 1157, in readIsrData
                if brighterFatterKernel.detectorKernel:
            AttributeError: 'BrighterFatterKernel' object has no attribute 'detectorKernel'
            

            (i.e. this is not being loaded as an np.array and does not have the needed attributes)

            Show
            lauren Lauren MacArthur added a comment - - edited I tried adding the following to  policy/imsim/imsimMapper.yaml  in  obs_lsst : calibrations: bfKernel: level: None persistable: ignored python: numpy.ndarray storage: PickleStorage template: bfkernels / bfKernel - % (raftName)s - % (detectorName)s - det % (detector) 03d .pkl but then I get: processCcd.isr INFO: Old style brighter - fatter kernel (np.array) loaded but print ( type (brighterFatterKernel)) < class 'lsst.ip.isr.brighterFatterKernel.BrighterFatterKernel' > print ( dir (brighterFatterKernel)) [ '_OBSTYPE' , '_SCHEMA' , '_VERSION' , '__abstractmethods__' , '__class__' , '__delattr__' , '__dict__' , '__dir__' , '__doc__' , '__eq__' , '__format__' , '__ge__' , '__getattribute__' , '__gt__' , '__hash__' , '__init__' , '__init_subclass__' , '__le__' , '__lt__' , '__module__' , '__ne__' , '__new__' , '__reduce__' , '__reduce_ex__' , '__repr__' , '__setattr__' , '__sizeof__' , '__slots__' , '__str__' , '__subclasshook__' , '__weakref__' , '_abc_impl' , 'apply' , 'calibInfoFromDict' , 'fromDetector' , 'fromDict' , 'fromTable' , 'getLengths' , 'getMetadata' , 'initFromCamera' , 'kernel' , 'level' , 'makeDetectorKernelFromAmpwiseKernels' , 'readFits' , 'readText' , 'replaceDetectorKernelWithAmpKernel' , 'requiredAttributes' , 'setMetadata' , 'toDict' , 'toTable' , 'updateMetadata' , 'validate' , 'writeFits' , 'writeText' ] ... File "/home/lauren/LSST/ip_isr/python/lsst/ip/isr/isrTask.py" , line 1157 , in readIsrData if brighterFatterKernel.detectorKernel: AttributeError: 'BrighterFatterKernel' object has no attribute 'detectorKernel' (i.e. this is not being loaded as an  np.array  and does not have the needed attributes)
            Hide
            lauren Lauren MacArthur added a comment -

            It looks like the gen2 repo does not have the “new” cpp-style BF kernels, and the “old” ones are no longer functional?  Paging Christopher Waters...

            Show
            lauren Lauren MacArthur added a comment - It looks like the gen2 repo does not have the “new” cpp-style BF kernels, and the “old” ones are no longer functional?  Paging Christopher Waters ...
            Hide
            lauren Lauren MacArthur added a comment -

            I set up the branches of DM-30738 and reset the processing, so am past the BF hurdle.  The next hitch is that the fix for the shapeHSM issue of DM-30426 did not make it in the w_2021_24 stack.  This is an issue for both middleware runs, but it is particularly detrimental to the gen2 run since, as soon as a single failure occurs it brings down the entire singleFrameDriver job.  I have nevertheless continued the remainder of the processing steps, but there will be many visit/detecor combos missing from this run (so coadd depth/coverage will be...interesting!)  I plan to rerun at least singleFrameDriver on the w_2021_25 branch in order to better assess the comparison on DM-30812 (and can continue with the full processing if we find it is desired for further gen2/gen3 comparisons before the next processing run when w_2021_28 lands.)

            Show
            lauren Lauren MacArthur added a comment - I set up the branches of DM-30738 and reset the processing, so am past the BF hurdle.  The next hitch is that the fix for the shapeHSM issue of DM-30426 did not make it in the w_2021_24 stack.  This is an issue for both middleware runs, but it is particularly detrimental to the gen2 run since, as soon as a single failure occurs it brings down the entire singleFrameDriver  job.  I have nevertheless continued the remainder of the processing steps, but there will be many visit/detecor combos missing from this run (so coadd depth/coverage will be...interesting!)  I plan to rerun at least singleFrameDriver on the w_2021_25 branch in order to better assess the comparison on DM-30812 (and can continue with the full processing if we find it is desired for further gen2/gen3 comparisons before the next processing run when  w_2021_28 lands.)
            Hide
            lauren Lauren MacArthur added a comment -

            This run has now completed but for the two r-band matchedVisitMetrics.py which fail with:

            slurmstepd: error: Detected 1 oom-kill event(s) in StepId=53764.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
            

            sacct tells me:

            $ sacct -u lauren --units=G --format=jobid,jobname%35,CPUTime,Elapsed,Start,MaxRSS,State%20
            53763.batch batch 9-12:27:36 09:31:09 2021-07-01T16:45:19 112.98G OUT_OF_MEMORY 
            53764.batch batch 9-14:12:24 09:35:31 2021-07-01T16:45:19 108.13G OUT_OF_MEMORY
            

            I don't think there are any slurm nodes available with larger memory (sinfo -Nl tells me they are all the same at 128Gb).

            Do we care about this? Will it cause problems with gen2 -> gen3 conversion with those outputs missing?

            Show
            lauren Lauren MacArthur added a comment - This run has now completed but for the two r-band  matchedVisitMetrics.py which fail with: slurmstepd: error: Detected 1 oom - kill event(s) in StepId = 53764.batch cgroup. Some of your processes may have been killed by the cgroup out - of - memory handler. sacct tells me: $ sacct - u lauren - - units = G - - format = jobid,jobname % 35 ,CPUTime,Elapsed,Start,MaxRSS,State % 20 53763.batch batch 9 - 12 : 27 : 36 09 : 31 : 09 2021 - 07 - 01T16 : 45 : 19 112.98G OUT_OF_MEMORY 53764.batch batch 9 - 14 : 12 : 24 09 : 35 : 31 2021 - 07 - 01T16 : 45 : 19 108.13G OUT_OF_MEMORY I don't think there are any slurm nodes available with larger memory (sinfo -Nl tells me they are all the same at 128Gb). Do we care about this? Will it cause problems with gen2 -> gen3 conversion with those outputs missing?
            Hide
            lauren Lauren MacArthur added a comment -

            Ok, I bit the bullet and just ran those last two jobs on the head node of lsst-condorprod-sub01.ncsa.illinois.edu  (no one else was on it, so hopefully it didn't interfere with anyone else's work!), so we now have the full suite of usual outputs. All logs can be found in:

            /datasets/DC2/repoRun2.2i/rerun/w_2021_24/DM-30730/logs
            

            I also ran dispatch_verify.py on all the json files in the validateDrp subdirectory and they all seemed to run fine. Is there an "ingest into SQuaSH" step that now needs to be done? However, I will note that I happened to notice some puzzling entries in the json files along the lines of:

                    "instrument": "HSC"
                    "filter_name": "HSC-I",
                    "instrument": "HSC"
                    "filter_name": "HSC-U",
            

            and only one with

            "instrument": "LSST-ImSim"
            

            which is perhaps actually the only one getting populated with metrics? That is what is being passed in as a config parameter, but I'm thinking it should really be LSSTCam-imSim. I can see the instrument being hard-coded to "HSC" here rather than assigning the config parameter (the filter_name was less obvious, but may be a trickle-down effect?) Should this be fixed (i.e. to we even care about validate_drp anymore?) and perhaps my original dispatch get wiped clean before ingesting to SQuaSH (I'm not sure how all that works...)??

            Show
            lauren Lauren MacArthur added a comment - Ok, I bit the bullet and just ran those last two jobs on the head node of lsst-condorprod-sub01.ncsa.illinois.edu   (no one else was on it, so hopefully it didn't interfere with anyone else's work!), so we now have the full suite of usual outputs. All logs can be found in: / datasets / DC2 / repoRun2. 2i / rerun / w_2021_24 / DM - 30730 / logs I also ran dispatch_verify.py on all the json files in the validateDrp subdirectory and they all seemed to run fine. Is there an "ingest into SQuaSH" step that now needs to be done? However, I will note that I happened to notice some puzzling entries in the json files along the lines of: "instrument" : "HSC" "filter_name" : "HSC-I" , "instrument" : "HSC" "filter_name" : "HSC-U" , and only one with "instrument" : "LSST-ImSim" which is perhaps actually the only one getting populated with metrics? That is what is being passed in as a config parameter, but I'm thinking it should really be LSSTCam-imSim . I can see the instrument being hard-coded to "HSC" here rather than assigning the config parameter (the filter_name was less obvious, but may be a trickle-down effect?) Should this be fixed (i.e. to we even care about validate_drp anymore?) and perhaps my original dispatch get wiped clean before ingesting to SQuaSH (I'm not sure how all that works...)??
            Hide
            lauren Lauren MacArthur added a comment -

            The output repo is linked here, where one can access the logs and view the validate_drp plots, e.g. PA1.

            Show
            lauren Lauren MacArthur added a comment - The output repo is linked here , where one can access the logs and view the validate_drp plots , e.g. PA1 .
            Hide
            lauren Lauren MacArthur added a comment - - edited

            A slightly overdue update on this ticket.

            The metrics were re-dispatched using the following environment variable settings:

            export DATASET="DC2_test-med-1"
            export DATASET_REPO_URL="https://jira.lsstcorp.org/browse/DM-30730"
            export RUN_ID="DM-30730"
            export RUN_ID_URL="https://jira.lsstcorp.org/browse/DM-30730"
            export VERSION_TAG="w_2021_24"
            

            I've just setup an initial attempt at a chronograf dashboard (an edited clone of the existing "DRP metrics monthly DC2 test-med-1") called:
            DRP monthly metrics for DC2 (Gen2)
            I'm not sure why the PF1 measurements are not showing up (I may have messed up the query, but can't see where),
            nor why I seem to be picking up extra TE1 values (it seems they are present for two previous runs, whereas none of the others are?), nor why no i band data seem to be showing up (despite the json files being present and dispatched). But otherwise all looks ok

            Side note: should we be concerned that the matchedVisit_band.json files are of order 6.1G?

            I have been updating the pipe_analysis scripts so they all run on DC2 data (some plots can be seen here). I will finish that up and include the PRs on a separate ticket.

            Show
            lauren Lauren MacArthur added a comment - - edited A slightly overdue update on this ticket. The metrics were re-dispatched using the following environment variable settings: export DATASET = "DC2_test-med-1" export DATASET_REPO_URL = "https://jira.lsstcorp.org/browse/DM-30730" export RUN_ID = "DM-30730" export RUN_ID_URL = "https://jira.lsstcorp.org/browse/DM-30730" export VERSION_TAG = "w_2021_24" I've just setup an initial attempt at a chronograf dashboard  (an edited clone of the existing "DRP metrics monthly DC2 test-med-1") called: DRP monthly metrics for DC2 (Gen2) I'm not sure why the PF1 measurements are not showing up (I may have messed up the query, but can't see where), nor why I seem to be picking up extra TE1 values (it seems they are present for two previous runs, whereas none of the others are?), nor why no i band data seem to be showing up (despite the json files being present and dispatched). But otherwise all looks ok Side note: should we be concerned that the matchedVisit_band.json files are of order 6.1G? I have been updating the pipe_analysis scripts so they all run on DC2 data (some plots can be seen here ). I will finish that up and include the PRs on a separate ticket.
            Hide
            lauren Lauren MacArthur added a comment -

            Please let me know if you'd like to see anything else on this ticket. Otherwise, I think it's ready to be closed out.

            Show
            lauren Lauren MacArthur added a comment - Please let me know if you'd like to see anything else on this ticket. Otherwise, I think it's ready to be closed out.
            Hide
            dtaranu Dan Taranu added a comment -

            The PF1 query looks fine; it looks like the measurements just didn't succeed.

            grep -C 2 'validate_drp.PA1"' /datasets/DC2/repoRun2.2i/rerun/w_2021_24/DM-30730/validateDrp/matchedVisitMetrics/3828/g/matchedVisit_g.json

            ... outputs:

                  "identifier": "784f5c4e871c457eab61af8c2b0985db",
                  "metric": "validate_drp.PA1",
                  "unit": "mmag",
                  "value": 9.469147892027074
            

            Meanwhile, the same grep for validate_drp.PF1_design_gri gets nothing. So it's probably a validate_drp bug that nobody is ever going to fix.

            I think the jsons are giant because validate_drp saves blobs with entire columns worth of data in ASCII.

            Show
            dtaranu Dan Taranu added a comment - The PF1 query looks fine; it looks like the measurements just didn't succeed. grep -C 2 'validate_drp.PA1"' /datasets/DC2/repoRun2.2i/rerun/w_2021_24/ DM-30730 /validateDrp/matchedVisitMetrics/3828/g/matchedVisit_g.json ... outputs: "identifier": "784f5c4e871c457eab61af8c2b0985db", "metric": "validate_drp.PA1", "unit": "mmag", "value": 9.469147892027074 Meanwhile, the same grep for validate_drp.PF1_design_gri gets nothing. So it's probably a validate_drp bug that nobody is ever going to fix. I think the jsons are giant because validate_drp saves blobs with entire columns worth of data in ASCII.
            Hide
            lauren Lauren MacArthur added a comment - - edited

            Ah, right...thanks Dan.  I just checked some older logs for the PF1_design_gri entries...missing there too, so probably a long-standing (and, indeed, never-to-be-fixed) bug in validate_drp.

            Show
            lauren Lauren MacArthur added a comment - - edited Ah, right...thanks Dan.  I just checked some older logs for the PF1_design_gri entries...missing there too, so probably a long-standing (and, indeed, never-to-be-fixed) bug in validate_drp .

              People

              Assignee:
              lauren Lauren MacArthur
              Reporter:
              lauren Lauren MacArthur
              Reviewers:
              Yusra AlSayyad
              Watchers:
              Dan Taranu, Jim Bosch, Lauren MacArthur, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.