Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-808

Set sentinel values for non-floating point columns for missing bands in Object tables

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Implemented
    • Resolution: Done
    • Component/s: DM
    • Labels:
      None

      Description

      In TransformObjectCatalogTask we put together object tables from different bands into the same table, with the band prefix as decided in RFC-807 (e.g. g_cModelFlux, etc.). For some tracts (particularly at the edges of a survey) there may be missing bands in a given processing. In this case we fill the missing values in the transformed table with NaN values with the following code:

      dfDict[filt] = pd.DataFrame().reindex_like(templateDf)
      

      (from https://github.com/lsst/pipe_tasks/blob/c9d3f519c031c50fc06a9b7c8a75bea172e393c7/python/lsst/pipe/tasks/postprocess.py#L863 )

      While this line has the intended effect of creating a new dataframe filled with NaN s and the same index as the reference band, it has the unfortunate side-effect of changing the datatype of all columns to float64 (including float32, all integer columns, and all flag columns). This then creates a problem when trying to put together object tables from different tracts, because the schema is different.

      In this RFC I propose a relatively simple modification to solve this problem for object tables while maintaining schema consistency as well as full numpy/astropy compatibility.

      • A copy of the template dataframe will be made to maintain indexes.
      • All floating-point columns will be filled with a configurable value, default to NaN.
      • All signed integer columns will be filled with a configurable value, default to -1.
      • All unsigned integer columns will be filled with a configurable value, default to 0. We do not allow unsigned integers in our Parquet tables.
      • Most flag columns will be filled with True, as most flags are "bad" flags that signify a failure of some measurement (and these blank columns are all essentially failures).
      • A specific list of "good" flag columns (goodFlags) will be specified that will default to False. Currently, in our output schema this list is ['calib_astrometry_used', 'calib_photometry_reserved', 'calib_photometry_used', 'calib_psf_candidate', 'calib_psf_used']. Not all of these columns must be present in the object table. In this way, users will never accidentally select these blank objects when looking at which objects were used for psf estimation, for example.

      An implementation of this straw-person proposal is here: https://github.com/lsst/pipe_tasks/commit/d560b2deaa52671537cb80230c648da02ad0b24b

      Although it would in principle be possible to use something like pandas.NA to fill the missing values, this has some significant drawbacks. In particular, it requires changing the datatype of our columns from, e.g., np.int32 to pandas.Int32Dtype and bool to pandas.BooleanDtype. This would therefore require us to transform the datatypes of many of our columns, even in the case that we have complete coverage. Furthermore, these datatypes no longer round-trip to and from numpy/astropy/afw columns, thus significantly complicating many analyses.

        Attachments

          Issue Links

            Activity

            Hide
            erykoff Eli Rykoff added a comment -

            A bit of follow-up that I should have checked originally. There is currently only 1 integer column in the per-band part of the object tables, {band}_inputCount. Presumably this could actually be set to zero and not a sentinel value (though would have to be special cased). However, anything negative would also be useful for anybody quickly checking if an object has {band}_inputCount > 0 so it would pass this obvious check.

            The main question is what to do with the flag columns, and specifically for the multi-band Object tables. I would prefer that we not break compatibility with numpy and afw as I have stressed above, which would rule out tri-state True/False/NA (pandas) or True/False/None (arrow) booleans. Changing to an integer type also breaks typical True/False flag checks and requires explicit value tests (and would also not translate back to afw schemas).

            Show
            erykoff Eli Rykoff added a comment - A bit of follow-up that I should have checked originally. There is currently only 1 integer column in the per-band part of the object tables, { band}_inputCount . Presumably this could actually be set to zero and not a sentinel value (though would have to be special cased). However, anything negative would also be useful for anybody quickly checking if an object has { band}_inputCount > 0 so it would pass this obvious check. The main question is what to do with the flag columns, and specifically for the multi-band Object tables. I would prefer that we not break compatibility with numpy and afw as I have stressed above, which would rule out tri-state True/False/NA (pandas) or True/False/None (arrow) booleans. Changing to an integer type also breaks typical True/False flag checks and requires explicit value tests (and would also not translate back to afw schemas).
            Hide
            lauren Lauren MacArthur added a comment -

            As one of the early “got bit by this” users, I should probably weigh in here (but note that I have very little intuition for any/all of the potential unintended consequences down the line…)  I first posted the issues I encountered on Slack, which I think is just one example of many cases where this is a serious issue that we must deal with as best we can ASAP (read: I feel like this is a clear case where we can’t let a — perhaps ellisuve — goal of identifying a “perfect” solution get in the way of a “much less horrible” one). It seems the only real sticking point to Eli’s proposal at this stage is how to handle the “goodFlags” and any potential for confusion therein (perhaps a lesson learned here is to never allow for a mixture of bad & good flags in a given context!) While I totally agree with Jim about the fragility of maintaining a list of good flags, I can’t think of any other solution, so I am inclined to suggest we do it and, as part of this RFC, add an explicit “(further) addition of “good” flags is explicitly forbidden” clause to our developer guides. (I realize this isn’t exactly foolproof…)

            There was some follow-up to the above thread here and, in particular, I wanted to post a reminder of a proposal already in the works from Colin Slater on DM-25926: https://github.com/lsst/LDM-153/pull/4/files (if just to make sure these propasals jive with each other).

            Show
            lauren Lauren MacArthur added a comment - As one of the early “got bit by this” users, I should probably weigh in here (but note that I have very little intuition for any/all of the potential unintended consequences down the line…)  I first posted the issues I encountered on Slack , which I think is just one example of many cases where this is a serious issue that we must deal with as best we can ASAP (read: I feel like this is a clear case where we can’t let a — perhaps ellisuve — goal of identifying a “perfect” solution get in the way of a “much less horrible” one). It seems the only real sticking point to Eli’s proposal at this stage is how to handle the “goodFlags” and any potential for confusion therein (perhaps a lesson learned here is to never allow for a mixture of bad & good flags in a given context!) While I totally agree with Jim about the fragility of maintaining a list of good flags, I can’t think of any other solution, so I am inclined to suggest we do it and, as part of this RFC, add an explicit “(further) addition of “good” flags is explicitly forbidden” clause to our developer guides. (I realize this isn’t exactly foolproof…) There was some follow-up to the above thread here and, in particular, I wanted to post a reminder of a proposal already in the works from Colin Slater on DM-25926 : https://github.com/lsst/LDM-153/pull/4/files (if just to make sure these propasals jive with each other).
            Hide
            erykoff Eli Rykoff added a comment -

            Thank you Lauren MacArthur for pointing out that proposal in the works! One key thing that stands out is that we are planning to explicitly disallow unsigned integers from our Parquet files. (And we don't currently have any). So I can trivially pull that from this RFC.

            Other than that, I don't know if we have consensus on what to do about the flags. I think Jim Bosch is in the "preserve nulls in the Parquet tables" camp, but that would be a non-trivial change.

            Show
            erykoff Eli Rykoff added a comment - Thank you Lauren MacArthur for pointing out that proposal in the works! One key thing that stands out is that we are planning to explicitly disallow unsigned integers from our Parquet files. (And we don't currently have any). So I can trivially pull that from this RFC. Other than that, I don't know if we have consensus on what to do about the flags. I think Jim Bosch is in the "preserve nulls in the Parquet tables" camp, but that would be a non-trivial change.
            Hide
            jbosch Jim Bosch added a comment -

            I would indeed prefer to preserve null is the parquet, but I don't want that to be the enemy of the good enough for now. Please proceed with the "good flags" approach, and we can revisit this on DM-25926.

            Show
            jbosch Jim Bosch added a comment - I would indeed prefer to preserve null is the parquet, but I don't want that to be the enemy of the good enough for now. Please proceed with the "good flags" approach, and we can revisit this on DM-25926 .
            Hide
            erykoff Eli Rykoff added a comment -

            Adopting this for signed integer/flag columns, to be implemented on DM-32198. More comprehensive null value handling will be determined as part of DM-25926.

            Show
            erykoff Eli Rykoff added a comment - Adopting this for signed integer/flag columns, to be implemented on DM-32198 . More comprehensive null value handling will be determined as part of DM-25926 .

              People

              Assignee:
              erykoff Eli Rykoff
              Reporter:
              erykoff Eli Rykoff
              Watchers:
              Colin Slater, Eli Rykoff, Jim Bosch, John Parejko, Kian-Tat Lim, Krzysztof Findeisen, Lauren MacArthur
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Planned End:

                  Jenkins

                  No builds found.