# jointcal of w_2018_38 fails to run

XMLWordPrintable

#### Details

• Type: Bug
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Story Points:
2
• Sprint:
AP F18-4, AP F18-5
• Team:

#### Description

Running w_2018_38 jointcal with HSC-RC2 data failed, as follows:

 INFO 2018-09-25T01:02:15.015 jointcal.Associations ()(src/Associations.cc:141)- Unmatched objects: 95 FATAL 2018-09-25T01:02:29.029 jointcal ()(jointcal.py:103)- Failed processing tract 9615, TypeError: _basicget(): incompatible function arguments. The following argument types are supported:  1. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyB) -> ndarray::Array  2. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyU) -> ndarray::Array  3. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyI) -> ndarray::Array  4. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyL) -> ndarray::Array  5. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyF) -> ndarray::Array  6. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyD) -> ndarray::Array  7. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyFlag) -> ndarray::Array  8. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyArrayB) -> ndarray::Array  9. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyArrayU) -> ndarray::Array  10. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyArrayI) -> ndarray::Array  11. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyArrayF) -> ndarray::Array  12. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyArrayD) -> ndarray::Array  13. (self: lsst.afw.table.baseColumnView.baseColumnView._BaseColumnViewBase, arg0: lsst.afw.table.schema.schema.KeyAngle) -> ndarray::Array   Invoked with: , None 

The command I ran to get the above error was:

 jointcal.py /datasets/hsc/repo --calib /datasets/hsc/repo/CALIB --rerun RC/w_2018_38/DM-15690:private/your-user-name/w38jointcal --id ccd=0..8^10..103 visit=380^384^388^404^408^424^426^436^440 filter=HSC-Y tract=9615 -C=/home/hchiang2/weeklies/w38/w38-jointcal-config.py 

The config override file has:

 config.astrometryRefObjLoader.ref_dataset_name='ps1_pv3_3pi_20170110' config.photometryRefObjLoader.ref_dataset_name='ps1_pv3_3pi_20170110' config.astrometryRefObjLoader.filterMap={'B': 'g', 'r2': 'r', 'N1010': 'z', 'N816': 'i', 'I': 'i', 'N387': 'g', 'i2': 'i', 'R': 'r', 'N921': 'z', 'N515': 'g', 'V': 'r'} config.photometryRefObjLoader.filterMap={'B': 'g', 'r2': 'r', 'N1010': 'z', 'N816': 'i', 'I': 'i', 'N387': 'g', 'i2': 'i', 'R': 'r', 'N921': 'z', 'N515': 'g', 'V': 'r'} 

#### Activity

Hide
John Parejko added a comment -

Russell Owen: it looks like this is due to the change to SimpleCatalog in DM-8828 not also happening to a.net catalogs.

Show
John Parejko added a comment - Russell Owen : it looks like this is due to the change to SimpleCatalog in DM-8828 not also happening to a.net catalogs.
Hide
John Parejko added a comment -

The error comes in the call to refCat.get(filtKeys[1]) at line jointcal.py:544. Somehow the second lsst.afw.table.Key returned by getRefFluxKeys is not being accepted by refcat.get.

Russell Owen: do you have an idea of how to fix this in the a.net refcat loader?

Show
John Parejko added a comment - The error comes in the call to refCat.get(filtKeys [1] ) at line jointcal.py:544. Somehow the second lsst.afw.table.Key returned by getRefFluxKeys is not being accepted by refcat.get. Russell Owen : do you have an idea of how to fix this in the a.net refcat loader?
Hide
John Parejko added a comment -

And here's a command to get to the failure in a handful of seconds:

jointcal.py /datasets/hsc/repo --calib /datasets/hsc/repo/CALIB --rerun RC/w_2018_38/DM-15690:private/parejkoj/w38jointcal --id ccd=0..8 visit=380^384 filter=HSC-Y tract=9615 -C=/home/hchiang2/weeklies/w38/w38-jointcal-config.py --doraise -c doAstrometry=False

Show
John Parejko added a comment - And here's a command to get to the failure in a handful of seconds: jointcal.py /datasets/hsc/repo --calib /datasets/hsc/repo/CALIB --rerun RC/w_2018_38/ DM-15690 :private/parejkoj/w38jointcal --id ccd=0..8 visit=380^384 filter=HSC-Y tract=9615 -C=/home/hchiang2/weeklies/w38/w38-jointcal-config.py --doraise -c doAstrometry=False
Hide
Russell Owen added a comment - - edited

For the record: all reference catalogs created before DM-8828 are SourceCatalogs, but are read as if they are SimpleCatalogs. I'm not aware of any problems caused by this as long as code does not try to access functions provided by SourceCatalog. Certainly all unit tests and ci_hsc passed. Also the format of astrometry.net reference catalogs did not change at all.

I suggest you put a debugger at that point and do the following:

• execute the code that fails by typing it into the terminal to make sure it fails (sometimes it's a later iteration that needs to be caught, if so, put a try/except block around the code temporarily and put the debugger breakpoint in the "except')
• see if the catalog is contiguous
• try to get each key of the catalog – at least one should fail
• try to get the key of a record
• look at the schema to see if there's an obvious mismatch between your expectation and the key
Show
Russell Owen added a comment - - edited For the record: all reference catalogs created before DM-8828 are SourceCatalogs, but are read as if they are SimpleCatalogs. I'm not aware of any problems caused by this as long as code does not try to access functions provided by SourceCatalog. Certainly all unit tests and ci_hsc passed. Also the format of astrometry.net reference catalogs did not change at all. I suggest you put a debugger at that point and do the following: execute the code that fails by typing it into the terminal to make sure it fails (sometimes it's a later iteration that needs to be caught, if so, put a try/except block around the code temporarily and put the debugger breakpoint in the "except') see if the catalog is contiguous try to get each key of the catalog – at least one should fail try to get the key of a record look at the schema to see if there's an obvious mismatch between your expectation and the key
Hide
John Parejko added a comment -

Ok, I've figured out the problem here: the refcat has a v1 schema, which means it is getting aliases like this:

'g_instFlux'->'g_flux'
'g_instFluxErr'->'g_fluxSigma'

which are incorrect aliases for refcats (refcats don't have instFlux!), and which cause getRefFluxKeys() to return None for the fluxErrKey. Part of the bug lies in the above incorrect alias setting, and part in how getRefFluxKeys() determines the Err field name. I'll explore a few possible solutions.

Show
John Parejko added a comment - Ok, I've figured out the problem here: the refcat has a v1 schema, which means it is getting aliases like this: 'g_instFlux'->'g_flux' 'g_instFluxErr'->'g_fluxSigma' which are incorrect aliases for refcats (refcats don't have instFlux!), and which cause getRefFluxKeys() to return None for the fluxErrKey . Part of the bug lies in the above incorrect alias setting, and part in how getRefFluxKeys() determines the Err field name. I'll explore a few possible solutions.
Hide
John Parejko added a comment -

Progress report, since this is blocking Hsin-Fang Chiang from running jointcal on hsc: I've got a working unittest that demonstrates the problem (but I haven't implemented a fix yet), and a script written to update our old refcats to be SimpleCatalogs instead of SourceCatalogs (which will be necessary for the planned fix to work).

Show
John Parejko added a comment - Progress report, since this is blocking Hsin-Fang Chiang from running jointcal on hsc: I've got a working unittest that demonstrates the problem (but I haven't implemented a fix yet), and a script written to update our old refcats to be SimpleCatalogs instead of SourceCatalogs (which will be necessary for the planned fix to work).
Hide
John Parejko added a comment -

Jim Bosch: do you mind reviewing this? It's fairly short, and I figured you'd be best placed to decide if it's a good approach, given how its mucking around in the guts of afw.table.

The new bin.src/ script needs to be run on existing refcats to update their header AFW_TYPE key.

PR since it's not showing up for me: https://github.com/lsst/afw/pull/397

Show
John Parejko added a comment - Jim Bosch : do you mind reviewing this? It's fairly short, and I figured you'd be best placed to decide if it's a good approach, given how its mucking around in the guts of afw.table. The new bin.src/ script needs to be run on existing refcats to update their header AFW_TYPE key. PR since it's not showing up for me: https://github.com/lsst/afw/pull/397 Jenkins run: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/28712/pipeline
Hide
Eli Rykoff added a comment - - edited

So I'm not keen on having to update the reference catalogs everywhere they exist, if we can avoid it.  And I think that sort of change does require an RFC, as I don't think we can modify any of the files in the common repos (or even add to the common repos) without an RFC.

However, I was looking at the issue at hand (if affects FGCM as well, also uncaught by the tests using the a.net catalogs in testdata_jointcal), and it seems to me that there's an unfortunate mix of the Sigma -> Err change with the flux -> instFlux change (which is indeed getting inappropriately applied here), combined with the fact that getRefFluxField (https://github.com/lsst/meas_algorithms/blob/6a7bfb3b1a1aa5a3c7ece46c8b3bedbd1fa687c3/python/lsst/meas/algorithms/loadReferenceObjects.py#L40 ) is ignorant of the change. Therefore, when calling (e.g.) loadSkyCircle it returns a flux field (g_flux) that is the correct field, but now g_fluxErr no longer works: there was a brief and harmonious window of time when g_fluxErr -> g_fluxSigma but now it thinks that g_instFluxErr -> g_fluxSigma which breaks everything.

A fix that I think would work (but would be a bit ugly, and I could see people yelping) would be to update getRefFluxField to use instFlux if it finds it. If this was first on the list, it would grab this (erroneously named, yes, but workable and completely internal-facing) instFlux from the aliases and everything would continue as normal. With future, updated catalogs, then the version/type would be updated, and it would never find instFlux in the schema.

Another possible fix, though I don't know how the aliasing works, is to do "two-stage" aliasing for each of the Sigma -> Err and flux -> instFlux updates. Then for each band the schema would look like 'g_instFlux'->'g_flux', 'g_instFluxErr' -> 'g_fluxSigma', 'g_fluxErr'->'g_fluxSigma'.

Show
Eli Rykoff added a comment - - edited So I'm not keen on having to update the reference catalogs everywhere they exist, if we can avoid it.  And I think that sort of change does require an RFC, as I don't think we can modify any of the files in the common repos (or even add to the common repos) without an RFC. However, I was looking at the issue at hand (if affects FGCM as well, also uncaught by the tests using the a.net catalogs in testdata_jointcal ), and it seems to me that there's an unfortunate mix of the Sigma -> Err change with the flux -> instFlux change (which is indeed getting inappropriately applied here), combined with the fact that getRefFluxField ( https://github.com/lsst/meas_algorithms/blob/6a7bfb3b1a1aa5a3c7ece46c8b3bedbd1fa687c3/python/lsst/meas/algorithms/loadReferenceObjects.py#L40 ) is ignorant of the change. Therefore, when calling (e.g.) loadSkyCircle it returns a flux field ( g_flux ) that is the correct field, but now g_fluxErr no longer works: there was a brief and harmonious window of time when g_fluxErr -> g_fluxSigma but now it thinks that g_instFluxErr -> g_fluxSigma which breaks everything. A fix that I think would work (but would be a bit ugly, and I could see people yelping) would be to update getRefFluxField to use instFlux if it finds it. If this was first on the list, it would grab this (erroneously named, yes, but workable and completely internal-facing) instFlux from the aliases and everything would continue as normal. With future, updated catalogs, then the version/type would be updated, and it would never find instFlux in the schema. Another possible fix, though I don't know how the aliasing works, is to do "two-stage" aliasing for each of the Sigma -> Err and flux -> instFlux updates. Then for each band the schema would look like 'g_instFlux'->'g_flux' , 'g_instFluxErr' -> 'g_fluxSigma' , 'g_fluxErr'->'g_fluxSigma' .
Hide
Jim Bosch added a comment -

John Parejko and I are now in the process of testing a very promising new idea: only add the instFlux aliases when the old field has units of "count" (or "dn", or "adu", just in case).  Happily we seem to be quite good about doing that where we should.

Show
Jim Bosch added a comment - John Parejko and I are now in the process of testing a very promising new idea: only add the instFlux aliases when the old field has units of "count" (or "dn", or "adu", just in case).  Happily we seem to be quite good about doing that where we should.
Hide
Eli Rykoff added a comment -

Oh, that sounds much better than my idea.

Show
Eli Rykoff added a comment - Oh, that sounds much better than my idea.
Hide
Jim Bosch added a comment -

John Parejko, I think this is all working.  If you look over my commits, we can call it reviewed, squash some commits (I have not done any of that so far), and kick off a Jenkins run.  My only comment looking at your code is that if you want to keep the script to convert refcats from SOURCE to SIMPLE (fine with me, though moving it to another ticket to do the full conversion as we discussed would also make sense), you should add a .gitignore entry for the bin/ version.

Show
Jim Bosch added a comment - John Parejko , I think this is all working.  If you look over my commits, we can call it reviewed, squash some commits (I have not done any of that so far), and kick off a Jenkins run.  My only comment looking at your code is that if you want to keep the script to convert refcats from SOURCE to SIMPLE (fine with me, though moving it to another ticket to do the full conversion as we discussed would also make sense), you should add a .gitignore entry for the bin/ version.
Hide
John Parejko added a comment -

Since the relevant changes are now from you, I switched the reviewer and assignee. See my comments on github.

Thanks again for coming up with a better solution.

Show
John Parejko added a comment - Since the relevant changes are now from you, I switched the reviewer and assignee. See my comments on github. Thanks again for coming up with a better solution.
Hide
Jim Bosch added a comment -

So, I mentioned on GitHub that there were some downstream failures in daf_butler and meas_astrom due to test data that didn't have its units set the way real processing outputs do.  I was going to fix that with more workaround code in afw, but that wasn't sufficient for meas_astrom, because the SourceSelectors apparently duplicate some of the slot-alias logic (not a dig at them; they may not have a choice).

Anyhow, I've instead just gone and fixed the units in the test data in daf_butler and meas_astrom, which gets me further - but I'm now seeing "unable to match" test failures in meas_astrom, which I don't understand at all, because all of the files being read in look okay in memory.  It could be that some code isn't using aliases the way it should, or maybe something totally different hidden by an overzealous exception specification that leads to a bad error message.  Debugging will probably require having master and ticket versions of meas_astrom running side-by-side in pdb, which I'm afraid I don't have time for today (and perhaps not this week).  John Parejko, please feel free to pick that up if you have time; if not, I'll return to this when I can.

Show
Jim Bosch added a comment - So, I mentioned on GitHub that there were some downstream failures in daf_butler and meas_astrom due to test data that didn't have its units set the way real processing outputs do.  I was going to fix that with more workaround code in afw, but that wasn't sufficient for meas_astrom, because the SourceSelectors apparently duplicate some of the slot-alias logic (not a dig at them; they may not have a choice). Anyhow, I've instead just gone and fixed the units in the test data in daf_butler and meas_astrom, which gets me further - but I'm now seeing "unable to match" test failures in meas_astrom , which I don't understand at all, because all of the files being read in look okay in memory.  It could be that some code isn't using aliases the way it should, or maybe something totally different hidden by an overzealous exception specification that leads to a bad error message.  Debugging will probably require having master and ticket versions of meas_astrom running side-by-side in pdb, which I'm afraid I don't have time for today (and perhaps not this week).  John Parejko , please feel free to pick that up if you have time; if not, I'll return to this when I can.
Hide
John Parejko added a comment -

Thanks to astropy for the io.fits.setval() method that lets you change headers directly, I was able to update some test catalogs (actually, the same two test catalogs that live in meas_astrom, pipe_tasks, and meas_extensions_anet) get this to build through on my machine. Fixing those also showed up some incorrect flux->instFlux changes I'd made in other unittests, which are now fixed.

Show
John Parejko added a comment - Thanks to astropy for the io.fits.setval() method that lets you change headers directly, I was able to update some test catalogs (actually, the same two test catalogs that live in meas_astrom, pipe_tasks, and meas_extensions_anet) get this to build through on my machine. Fixing those also showed up some incorrect flux->instFlux changes I'd made in other unittests, which are now fixed. New Jenkins run: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/28782/pipeline
Hide
Jim Bosch added a comment -

Great!  And you can consider those latest changes reviewed.  Good to merge whenever you're ready.

Show
Jim Bosch added a comment - Great!  And you can consider those latest changes reviewed.  Good to merge whenever you're ready.
Hide
John Parejko added a comment -

Thanks Jim Bosch. Jenkins passed: I hope we got everything this time!

Merged and done.

Show
John Parejko added a comment - Thanks Jim Bosch . Jenkins passed: I hope we got everything this time! Merged and done.
Hide
Lauren MacArthur added a comment -

John Parejko and I are now in the process of testing a very promising new idea: only add the {{instFlux}}aliases when the old field has units of "count" (or "dn", or "adu", just in case).  Happily we seem to be quite good about doing that where we should.

I fear I may have just uncovered an important exception to the above. No units for:

 (Field['D'](name="modelfit_CModel_flux", doc="flux from the final cmodel fit"), Key(offset=984, nElements=1)) (Field['D'](name="modelfit_CModel_fluxErr", doc="flux uncertainty from the final cmodel fit"), Key(offset=992, nElements=1)) 

So I'm getting

 Field or subfield with name 'modelfit_CModel_instFlux' not found with type 'D'. {0} lsst::pex::exceptions::NotFoundError: 'Field or subfield with name 'modelfit_CModel_instFlux' not found with type 'D'.' 

with a simple

 butler.get("deepCoadd_forced_src", dataIdI, immediate=True) 

Note that this is for w_2018_40 (I'm using Eli's stack at /project/erykoff/stack/)

Show
Lauren MacArthur added a comment - John Parejko and I are now in the process of testing a very promising new idea: only add the {{instFlux}}aliases when the old field has units of "count" (or "dn", or "adu", just in case).  Happily we seem to be quite good about doing that where we should. I fear I may have just uncovered an important exception to the above. No units for: (Field[ 'D' ](name= "modelfit_CModel_flux" , doc= "flux from the final cmodel fit" ), Key<D>(offset= 984 , nElements= 1 )) (Field[ 'D' ](name= "modelfit_CModel_fluxErr" , doc= "flux uncertainty from the final cmodel fit" ), Key<D>(offset= 992 , nElements= 1 )) So I'm getting Field or subfield with name 'modelfit_CModel_instFlux' not found with type 'D' . { 0 } lsst::pex::exceptions::NotFoundError: 'Field or subfield with name ' modelfit_CModel_instFlux ' not found with type ' D '.' with a simple butler.get( "deepCoadd_forced_src" , dataIdI, immediate=True) Note that this is for w_2018_40 (I'm using Eli's stack at /project/erykoff/stack/ )
Hide
Lauren MacArthur added a comment -

A few more cases (but not a thorough search):

 (Field['D'](name="modelfit_CModel_initial_flux_inner", doc="flux within the fit region, with no extrapolation"), Key(offset=928, nElements=1)) (Field['D'](name="modelfit_CModel_dev_flux_inner", doc="flux within the fit region, with no extrapolation"), Key(offset=976, nElements=1)) (Field['D'](name="modelfit_CModel_flux_inner", doc="flux within the fit region, with no extrapolation"), Key(offset=1000, nElements=1)) 

Show
Lauren MacArthur added a comment - A few more cases (but not a thorough search): (Field[ 'D' ](name= "modelfit_CModel_initial_flux_inner" , doc= "flux within the fit region, with no extrapolation" ), Key<D>(offset= 928 , nElements= 1 )) (Field[ 'D' ](name= "modelfit_CModel_dev_flux_inner" , doc= "flux within the fit region, with no extrapolation" ), Key<D>(offset= 976 , nElements= 1 )) (Field[ 'D' ](name= "modelfit_CModel_flux_inner" , doc= "flux within the fit region, with no extrapolation" ), Key<D>(offset= 1000 , nElements= 1 ))
Hide
John Parejko added a comment -

Please file a new ticket for those. I'm not sure how to handle that, but it does strongly suggest that we need to fix modelfit to output units!

Show
John Parejko added a comment - Please file a new ticket for those. I'm not sure how to handle that, but it does strongly suggest that we need to fix modelfit to output units!
Hide
Lauren MacArthur added a comment -

See DM-16068.

Show
Lauren MacArthur added a comment - See DM-16068 .

#### People

Assignee:
John Parejko
Reporter:
Hsin-Fang Chiang
Reviewers:
John Parejko
Watchers:
Eli Rykoff, Hsin-Fang Chiang, Jim Bosch, John Parejko, Lauren MacArthur, Russell Owen, Yusra AlSayyad
0 Vote for this issue
Watchers:
7 Start watching this issue

#### Dates

Created:
Updated:
Resolved:

#### Jenkins Builds

No builds found.