Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-15857

jointcal of w_2018_38 fails to run

    XMLWordPrintable

    Details

    • Story Points:
      2
    • Sprint:
      AP F18-4, AP F18-5
    • Team:
      Alert Production

      Description

      Running w_2018_38 jointcal with HSC-RC2 data failed, as follows:

      INFO  2018-09-25T01:02:15.015 jointcal.Associations ()(src/Associations.cc:141)- Unmatched objects: 95
      FATAL 2018-09-25T01:02:29.029 jointcal ()(jointcal.py:103)- Failed processing tract 9615, TypeError: _basicget(): incompatible function arguments. The following argument types are supported:
          1. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyB) -> ndarray::Array<unsigned char, 1, 0>
          2. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyU) -> ndarray::Array<unsigned short, 1, 0>
          3. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyI) -> ndarray::Array<int, 1, 0>
          4. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyL) -> ndarray::Array<long, 1, 0>
          5. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyF) -> ndarray::Array<float, 1, 0>
          6. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyD) -> ndarray::Array<double, 1, 0>
          7. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyFlag) -> ndarray::Array<bool const, 1, 1>
          8. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyArrayB) -> ndarray::Array<unsigned char, 2, 1>
          9. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyArrayU) -> ndarray::Array<unsigned short, 2, 1>
          10. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyArrayI) -> ndarray::Array<int, 2, 1>
          11. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyArrayF) -> ndarray::Array<float, 2, 1>
          12. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyArrayD) -> ndarray::Array<double, 2, 1>
          13. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyAngle) -> ndarray::Array<double, 1, 0>
       
      Invoked with: <lsst.afw.table.simple.simple.SimpleColumnView object at 0x7f41807f77d8>, None
      

      The command I ran to get the above error was:

      jointcal.py /datasets/hsc/repo --calib /datasets/hsc/repo/CALIB --rerun RC/w_2018_38/DM-15690:private/your-user-name/w38jointcal --id ccd=0..8^10..103 visit=380^384^388^404^408^424^426^436^440 filter=HSC-Y tract=9615 -C=/home/hchiang2/weeklies/w38/w38-jointcal-config.py 
      

      The config override file has:

      config.astrometryRefObjLoader.ref_dataset_name='ps1_pv3_3pi_20170110'
      config.photometryRefObjLoader.ref_dataset_name='ps1_pv3_3pi_20170110'
      config.astrometryRefObjLoader.filterMap={'B': 'g', 'r2': 'r', 'N1010': 'z', 'N816': 'i', 'I': 'i', 'N387': 'g', 'i2': 'i', 'R': 'r', 'N921': 'z', 'N515': 'g', 'V': 'r'}
      config.photometryRefObjLoader.filterMap={'B': 'g', 'r2': 'r', 'N1010': 'z', 'N816': 'i', 'I': 'i', 'N387': 'g', 'i2': 'i', 'R': 'r', 'N921': 'z', 'N515': 'g', 'V': 'r'}
      

        Attachments

          Issue Links

            Activity

            Hide
            Parejkoj John Parejko added a comment -

            Russell Owen: it looks like this is due to the change to SimpleCatalog in DM-8828 not also happening to a.net catalogs.

            Show
            Parejkoj John Parejko added a comment - Russell Owen : it looks like this is due to the change to SimpleCatalog in DM-8828 not also happening to a.net catalogs.
            Hide
            Parejkoj John Parejko added a comment -

            The error comes in the call to refCat.get(filtKeys[1]) at line jointcal.py:544. Somehow the second lsst.afw.table.Key returned by getRefFluxKeys is not being accepted by refcat.get.

            Russell Owen: do you have an idea of how to fix this in the a.net refcat loader?

            Show
            Parejkoj John Parejko added a comment - The error comes in the call to refCat.get(filtKeys [1] ) at line jointcal.py:544. Somehow the second lsst.afw.table.Key returned by getRefFluxKeys is not being accepted by refcat.get. Russell Owen : do you have an idea of how to fix this in the a.net refcat loader?
            Hide
            Parejkoj John Parejko added a comment -

            And here's a command to get to the failure in a handful of seconds:

            jointcal.py /datasets/hsc/repo --calib /datasets/hsc/repo/CALIB --rerun RC/w_2018_38/DM-15690:private/parejkoj/w38jointcal --id ccd=0..8 visit=380^384 filter=HSC-Y tract=9615 -C=/home/hchiang2/weeklies/w38/w38-jointcal-config.py --doraise -c doAstrometry=False

            Show
            Parejkoj John Parejko added a comment - And here's a command to get to the failure in a handful of seconds: jointcal.py /datasets/hsc/repo --calib /datasets/hsc/repo/CALIB --rerun RC/w_2018_38/ DM-15690 :private/parejkoj/w38jointcal --id ccd=0..8 visit=380^384 filter=HSC-Y tract=9615 -C=/home/hchiang2/weeklies/w38/w38-jointcal-config.py --doraise -c doAstrometry=False
            Hide
            rowen Russell Owen added a comment - - edited

            For the record: all reference catalogs created before DM-8828 are SourceCatalogs, but are read as if they are SimpleCatalogs. I'm not aware of any problems caused by this as long as code does not try to access functions provided by SourceCatalog. Certainly all unit tests and ci_hsc passed. Also the format of astrometry.net reference catalogs did not change at all.

            I suggest you put a debugger at that point and do the following:

            • execute the code that fails by typing it into the terminal to make sure it fails (sometimes it's a later iteration that needs to be caught, if so, put a try/except block around the code temporarily and put the debugger breakpoint in the "except')
            • see if the catalog is contiguous
            • try to get each key of the catalog – at least one should fail
            • try to get the key of a record
            • look at the schema to see if there's an obvious mismatch between your expectation and the key
            Show
            rowen Russell Owen added a comment - - edited For the record: all reference catalogs created before DM-8828 are SourceCatalogs, but are read as if they are SimpleCatalogs. I'm not aware of any problems caused by this as long as code does not try to access functions provided by SourceCatalog. Certainly all unit tests and ci_hsc passed. Also the format of astrometry.net reference catalogs did not change at all. I suggest you put a debugger at that point and do the following: execute the code that fails by typing it into the terminal to make sure it fails (sometimes it's a later iteration that needs to be caught, if so, put a try/except block around the code temporarily and put the debugger breakpoint in the "except') see if the catalog is contiguous try to get each key of the catalog – at least one should fail try to get the key of a record look at the schema to see if there's an obvious mismatch between your expectation and the key
            Hide
            Parejkoj John Parejko added a comment -

            Ok, I've figured out the problem here: the refcat has a v1 schema, which means it is getting aliases like this:

            'g_instFlux'->'g_flux'
            'g_instFluxErr'->'g_fluxSigma'

            which are incorrect aliases for refcats (refcats don't have instFlux!), and which cause getRefFluxKeys() to return None for the fluxErrKey. Part of the bug lies in the above incorrect alias setting, and part in how getRefFluxKeys() determines the Err field name. I'll explore a few possible solutions.

            Show
            Parejkoj John Parejko added a comment - Ok, I've figured out the problem here: the refcat has a v1 schema, which means it is getting aliases like this: 'g_instFlux'->'g_flux' 'g_instFluxErr'->'g_fluxSigma' which are incorrect aliases for refcats (refcats don't have instFlux!), and which cause getRefFluxKeys() to return None for the fluxErrKey . Part of the bug lies in the above incorrect alias setting, and part in how getRefFluxKeys() determines the Err field name. I'll explore a few possible solutions.
            Hide
            Parejkoj John Parejko added a comment -

            Progress report, since this is blocking Hsin-Fang Chiang from running jointcal on hsc: I've got a working unittest that demonstrates the problem (but I haven't implemented a fix yet), and a script written to update our old refcats to be SimpleCatalogs instead of SourceCatalogs (which will be necessary for the planned fix to work).

            Show
            Parejkoj John Parejko added a comment - Progress report, since this is blocking Hsin-Fang Chiang from running jointcal on hsc: I've got a working unittest that demonstrates the problem (but I haven't implemented a fix yet), and a script written to update our old refcats to be SimpleCatalogs instead of SourceCatalogs (which will be necessary for the planned fix to work).
            Hide
            Parejkoj John Parejko added a comment -

            Jim Bosch: do you mind reviewing this? It's fairly short, and I figured you'd be best placed to decide if it's a good approach, given how its mucking around in the guts of afw.table.

            The new bin.src/ script needs to be run on existing refcats to update their header AFW_TYPE key.

            PR since it's not showing up for me: https://github.com/lsst/afw/pull/397

            Jenkins run: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/28712/pipeline

            Show
            Parejkoj John Parejko added a comment - Jim Bosch : do you mind reviewing this? It's fairly short, and I figured you'd be best placed to decide if it's a good approach, given how its mucking around in the guts of afw.table. The new bin.src/ script needs to be run on existing refcats to update their header AFW_TYPE key. PR since it's not showing up for me: https://github.com/lsst/afw/pull/397 Jenkins run: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/28712/pipeline
            Hide
            erykoff Eli Rykoff added a comment - - edited

            So I'm not keen on having to update the reference catalogs everywhere they exist, if we can avoid it.  And I think that sort of change does require an RFC, as I don't think we can modify any of the files in the common repos (or even add to the common repos) without an RFC.

            However, I was looking at the issue at hand (if affects FGCM as well, also uncaught by the tests using the a.net catalogs in testdata_jointcal), and it seems to me that there's an unfortunate mix of the Sigma -> Err change with the flux -> instFlux change (which is indeed getting inappropriately applied here), combined with the fact that getRefFluxField (https://github.com/lsst/meas_algorithms/blob/6a7bfb3b1a1aa5a3c7ece46c8b3bedbd1fa687c3/python/lsst/meas/algorithms/loadReferenceObjects.py#L40 ) is ignorant of the change. Therefore, when calling (e.g.) loadSkyCircle it returns a flux field (g_flux) that is the correct field, but now g_fluxErr no longer works: there was a brief and harmonious window of time when g_fluxErr -> g_fluxSigma but now it thinks that g_instFluxErr -> g_fluxSigma which breaks everything.

            A fix that I think would work (but would be a bit ugly, and I could see people yelping) would be to update getRefFluxField to use instFlux if it finds it. If this was first on the list, it would grab this (erroneously named, yes, but workable and completely internal-facing) instFlux from the aliases and everything would continue as normal. With future, updated catalogs, then the version/type would be updated, and it would never find instFlux in the schema.

            Another possible fix, though I don't know how the aliasing works, is to do "two-stage" aliasing for each of the Sigma -> Err and flux -> instFlux updates. Then for each band the schema would look like 'g_instFlux'->'g_flux', 'g_instFluxErr' -> 'g_fluxSigma', 'g_fluxErr'->'g_fluxSigma'.

            Show
            erykoff Eli Rykoff added a comment - - edited So I'm not keen on having to update the reference catalogs everywhere they exist, if we can avoid it.  And I think that sort of change does require an RFC, as I don't think we can modify any of the files in the common repos (or even add to the common repos) without an RFC. However, I was looking at the issue at hand (if affects FGCM as well, also uncaught by the tests using the a.net catalogs in testdata_jointcal ), and it seems to me that there's an unfortunate mix of the Sigma -> Err change with the flux -> instFlux change (which is indeed getting inappropriately applied here), combined with the fact that getRefFluxField ( https://github.com/lsst/meas_algorithms/blob/6a7bfb3b1a1aa5a3c7ece46c8b3bedbd1fa687c3/python/lsst/meas/algorithms/loadReferenceObjects.py#L40 ) is ignorant of the change. Therefore, when calling (e.g.) loadSkyCircle it returns a flux field ( g_flux ) that is the correct field, but now g_fluxErr no longer works: there was a brief and harmonious window of time when g_fluxErr -> g_fluxSigma but now it thinks that g_instFluxErr -> g_fluxSigma which breaks everything. A fix that I think would work (but would be a bit ugly, and I could see people yelping) would be to update getRefFluxField to use instFlux if it finds it. If this was first on the list, it would grab this (erroneously named, yes, but workable and completely internal-facing) instFlux from the aliases and everything would continue as normal. With future, updated catalogs, then the version/type would be updated, and it would never find instFlux in the schema. Another possible fix, though I don't know how the aliasing works, is to do "two-stage" aliasing for each of the Sigma -> Err and flux -> instFlux updates. Then for each band the schema would look like 'g_instFlux'->'g_flux' , 'g_instFluxErr' -> 'g_fluxSigma' , 'g_fluxErr'->'g_fluxSigma' .
            Hide
            jbosch Jim Bosch added a comment -

            John Parejko and I are now in the process of testing a very promising new idea: only add the instFlux aliases when the old field has units of "count" (or "dn", or "adu", just in case).  Happily we seem to be quite good about doing that where we should.

            Show
            jbosch Jim Bosch added a comment - John Parejko and I are now in the process of testing a very promising new idea: only add the instFlux aliases when the old field has units of "count" (or "dn", or "adu", just in case).  Happily we seem to be quite good about doing that where we should.
            Hide
            erykoff Eli Rykoff added a comment -

            Oh, that sounds much better than my idea.

            Show
            erykoff Eli Rykoff added a comment - Oh, that sounds much better than my idea.
            Hide
            jbosch Jim Bosch added a comment -

            John Parejko, I think this is all working.  If you look over my commits, we can call it reviewed, squash some commits (I have not done any of that so far), and kick off a Jenkins run.  My only comment looking at your code is that if you want to keep the script to convert refcats from SOURCE to SIMPLE (fine with me, though moving it to another ticket to do the full conversion as we discussed would also make sense), you should add a .gitignore entry for the bin/ version.

            Show
            jbosch Jim Bosch added a comment - John Parejko , I think this is all working.  If you look over my commits, we can call it reviewed, squash some commits (I have not done any of that so far), and kick off a Jenkins run.  My only comment looking at your code is that if you want to keep the script to convert refcats from SOURCE to SIMPLE (fine with me, though moving it to another ticket to do the full conversion as we discussed would also make sense), you should add a .gitignore entry for the bin/ version.
            Hide
            Parejkoj John Parejko added a comment -

            Since the relevant changes are now from you, I switched the reviewer and assignee. See my comments on github.

            Thanks again for coming up with a better solution.

            Show
            Parejkoj John Parejko added a comment - Since the relevant changes are now from you, I switched the reviewer and assignee. See my comments on github. Thanks again for coming up with a better solution.
            Hide
            jbosch Jim Bosch added a comment -

            So, I mentioned on GitHub that there were some downstream failures in daf_butler and meas_astrom due to test data that didn't have its units set the way real processing outputs do.  I was going to fix that with more workaround code in afw, but that wasn't sufficient for meas_astrom, because the SourceSelectors apparently duplicate some of the slot-alias logic (not a dig at them; they may not have a choice).

            Anyhow, I've instead just gone and fixed the units in the test data in daf_butler and meas_astrom, which gets me further - but I'm now seeing "unable to match" test failures in meas_astrom, which I don't understand at all, because all of the files being read in look okay in memory.  It could be that some code isn't using aliases the way it should, or maybe something totally different hidden by an overzealous exception specification that leads to a bad error message.  Debugging will probably require having master and ticket versions of meas_astrom running side-by-side in pdb, which I'm afraid I don't have time for today (and perhaps not this week).  John Parejko, please feel free to pick that up if you have time; if not, I'll return to this when I can.

            Show
            jbosch Jim Bosch added a comment - So, I mentioned on GitHub that there were some downstream failures in daf_butler and meas_astrom due to test data that didn't have its units set the way real processing outputs do.  I was going to fix that with more workaround code in afw, but that wasn't sufficient for meas_astrom, because the SourceSelectors apparently duplicate some of the slot-alias logic (not a dig at them; they may not have a choice). Anyhow, I've instead just gone and fixed the units in the test data in daf_butler and meas_astrom, which gets me further - but I'm now seeing "unable to match" test failures in meas_astrom , which I don't understand at all, because all of the files being read in look okay in memory.  It could be that some code isn't using aliases the way it should, or maybe something totally different hidden by an overzealous exception specification that leads to a bad error message.  Debugging will probably require having master and ticket versions of meas_astrom running side-by-side in pdb, which I'm afraid I don't have time for today (and perhaps not this week).  John Parejko , please feel free to pick that up if you have time; if not, I'll return to this when I can.
            Hide
            Parejkoj John Parejko added a comment -

            Thanks to astropy for the `io.fits.setval()` method that lets you change headers directly, I was able to update some test catalogs (actually, the same two test catalogs that live in meas_astrom, pipe_tasks, and meas_extensions_anet) get this to build through on my machine. Fixing those also showed up some incorrect flux->instFlux changes I'd made in other unittests, which are now fixed.

            New Jenkins run: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/28782/pipeline

            Show
            Parejkoj John Parejko added a comment - Thanks to astropy for the `io.fits.setval()` method that lets you change headers directly, I was able to update some test catalogs (actually, the same two test catalogs that live in meas_astrom, pipe_tasks, and meas_extensions_anet) get this to build through on my machine. Fixing those also showed up some incorrect flux->instFlux changes I'd made in other unittests, which are now fixed. New Jenkins run: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/28782/pipeline
            Hide
            jbosch Jim Bosch added a comment -

            Great!  And you can consider those latest changes reviewed.  Good to merge whenever you're ready.

            Show
            jbosch Jim Bosch added a comment - Great!  And you can consider those latest changes reviewed.  Good to merge whenever you're ready.
            Hide
            Parejkoj John Parejko added a comment -

            Thanks Jim Bosch. Jenkins passed: I hope we got everything this time!

            Merged and done.

            Show
            Parejkoj John Parejko added a comment - Thanks Jim Bosch . Jenkins passed: I hope we got everything this time! Merged and done.
            Hide
            lauren Lauren MacArthur added a comment -

            John Parejko and I are now in the process of testing a very promising new idea: only add the {{instFlux}}aliases when the old field has units of "count" (or "dn", or "adu", just in case).  Happily we seem to be quite good about doing that where we should.

            I fear I may have just uncovered an important exception to the above. No units for:

            (Field['D'](name="modelfit_CModel_flux", doc="flux from the final cmodel fit"), Key<D>(offset=984, nElements=1))
            (Field['D'](name="modelfit_CModel_fluxErr", doc="flux uncertainty from the final cmodel fit"), Key<D>(offset=992, nElements=1))
            

            So I'm getting

            Field or subfield with name 'modelfit_CModel_instFlux' not found with type 'D'. {0}
            lsst::pex::exceptions::NotFoundError: 'Field or subfield with name 'modelfit_CModel_instFlux' not found with type 'D'.'
            

            with a simple

            butler.get("deepCoadd_forced_src", dataIdI, immediate=True)
            

            Note that this is for w_2018_40 (I'm using Eli's stack at /project/erykoff/stack/)

            Show
            lauren Lauren MacArthur added a comment - John Parejko and I are now in the process of testing a very promising new idea: only add the {{instFlux}}aliases when the old field has units of "count" (or "dn", or "adu", just in case).  Happily we seem to be quite good about doing that where we should. I fear I may have just uncovered an important exception to the above. No units for: (Field[ 'D' ](name= "modelfit_CModel_flux" , doc= "flux from the final cmodel fit" ), Key<D>(offset= 984 , nElements= 1 )) (Field[ 'D' ](name= "modelfit_CModel_fluxErr" , doc= "flux uncertainty from the final cmodel fit" ), Key<D>(offset= 992 , nElements= 1 )) So I'm getting Field or subfield with name 'modelfit_CModel_instFlux' not found with type 'D' . { 0 } lsst::pex::exceptions::NotFoundError: 'Field or subfield with name ' modelfit_CModel_instFlux ' not found with type ' D '.' with a simple butler.get( "deepCoadd_forced_src" , dataIdI, immediate=True) Note that this is for w_2018_40 (I'm using Eli's stack at /project/erykoff/stack/ )
            Hide
            lauren Lauren MacArthur added a comment -

            A few more cases (but not a thorough search):

            (Field['D'](name="modelfit_CModel_initial_flux_inner", doc="flux within the fit region, with no extrapolation"), Key<D>(offset=928, nElements=1))
            (Field['D'](name="modelfit_CModel_dev_flux_inner", doc="flux within the fit region, with no extrapolation"), Key<D>(offset=976, nElements=1))
            (Field['D'](name="modelfit_CModel_flux_inner", doc="flux within the fit region, with no extrapolation"), Key<D>(offset=1000, nElements=1))
            

             

            Show
            lauren Lauren MacArthur added a comment - A few more cases (but not a thorough search): (Field[ 'D' ](name= "modelfit_CModel_initial_flux_inner" , doc= "flux within the fit region, with no extrapolation" ), Key<D>(offset= 928 , nElements= 1 )) (Field[ 'D' ](name= "modelfit_CModel_dev_flux_inner" , doc= "flux within the fit region, with no extrapolation" ), Key<D>(offset= 976 , nElements= 1 )) (Field[ 'D' ](name= "modelfit_CModel_flux_inner" , doc= "flux within the fit region, with no extrapolation" ), Key<D>(offset= 1000 , nElements= 1 ))  
            Hide
            Parejkoj John Parejko added a comment -

            Please file a new ticket for those. I'm not sure how to handle that, but it does strongly suggest that we need to fix modelfit to output units!

            Show
            Parejkoj John Parejko added a comment - Please file a new ticket for those. I'm not sure how to handle that, but it does strongly suggest that we need to fix modelfit to output units!
            Hide
            lauren Lauren MacArthur added a comment -

            See DM-16068.

            Show
            lauren Lauren MacArthur added a comment - See DM-16068 .

              People

              Assignee:
              Parejkoj John Parejko
              Reporter:
              hchiang2 Hsin-Fang Chiang
              Reviewers:
              John Parejko
              Watchers:
              Eli Rykoff, Hsin-Fang Chiang, Jim Bosch, John Parejko, Lauren MacArthur, Russell Owen, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins Builds

                  No builds found.