Details
-
Type:
RFC
-
Status: Implemented
-
Resolution: Done
-
Component/s: DM
-
Labels:None
Description
In TransformObjectCatalogTask we put together object tables from different bands into the same table, with the band prefix as decided in RFC-807 (e.g. g_cModelFlux, etc.). For some tracts (particularly at the edges of a survey) there may be missing bands in a given processing. In this case we fill the missing values in the transformed table with NaN values with the following code:
dfDict[filt] = pd.DataFrame().reindex_like(templateDf)
|
While this line has the intended effect of creating a new dataframe filled with NaN s and the same index as the reference band, it has the unfortunate side-effect of changing the datatype of all columns to float64 (including float32, all integer columns, and all flag columns). This then creates a problem when trying to put together object tables from different tracts, because the schema is different.
In this RFC I propose a relatively simple modification to solve this problem for object tables while maintaining schema consistency as well as full numpy/astropy compatibility.
- A copy of the template dataframe will be made to maintain indexes.
- All floating-point columns will be filled with a configurable value, default to NaN.
- All signed integer columns will be filled with a configurable value, default to -1.
All unsigned integer columns will be filled with a configurable value, default to 0. We do not allow unsigned integers in our Parquet tables.- Most flag columns will be filled with True, as most flags are "bad" flags that signify a failure of some measurement (and these blank columns are all essentially failures).
- A specific list of "good" flag columns (goodFlags) will be specified that will default to False. Currently, in our output schema this list is ['calib_astrometry_used', 'calib_photometry_reserved', 'calib_photometry_used', 'calib_psf_candidate', 'calib_psf_used']. Not all of these columns must be present in the object table. In this way, users will never accidentally select these blank objects when looking at which objects were used for psf estimation, for example.
An implementation of this straw-person proposal is here: https://github.com/lsst/pipe_tasks/commit/d560b2deaa52671537cb80230c648da02ad0b24b
Although it would in principle be possible to use something like pandas.NA to fill the missing values, this has some significant drawbacks. In particular, it requires changing the datatype of our columns from, e.g., np.int32 to pandas.Int32Dtype and bool to pandas.BooleanDtype. This would therefore require us to transform the datatypes of many of our columns, even in the case that we have complete coverage. Furthermore, these datatypes no longer round-trip to and from numpy/astropy/afw columns, thus significantly complicating many analyses.
A bit of follow-up that I should have checked originally. There is currently only 1 integer column in the per-band part of the object tables, {band}_inputCount. Presumably this could actually be set to zero and not a sentinel value (though would have to be special cased). However, anything negative would also be useful for anybody quickly checking if an object has {band}_inputCount > 0 so it would pass this obvious check.
The main question is what to do with the flag columns, and specifically for the multi-band Object tables. I would prefer that we not break compatibility with numpy and afw as I have stressed above, which would rule out tri-state True/False/NA (pandas) or True/False/None (arrow) booleans. Changing to an integer type also breaks typical True/False flag checks and requires explicit value tests (and would also not translate back to afw schemas).