Details
-
Type:
RFC
-
Status: Implemented
-
Resolution: Done
-
Component/s: DM
-
Labels:None
Description
We are making a column name consistency push for w_2021_40 with an eye on what users will see in the Parquet Tables with DP0.2. There is a prototype implementation on ticket DM-31825.
Two major users:
- DAX for ingest into qserv and
- science users loading columns into notebooks for analysis.
Tables in use currently are Source and Object which require an RFC.
sourceTable_visit in particular is input to jointcal and FGCM which will require modifications.
In your comments, save your fingers from typing: "I think everything should be consistent." We all agree on that! What is under debate here is, where names are not
consistent, which should yield. What is on offer here is an iteration of of edits to our Parquet specification yaml files,
transform* tasks, any tasks in pipe_tasks/postprocess.py. I am not offering: to edit all our
afw_table source catalogs and plugin names and fields, to change implementation on how fields (such as ccdVisitId) are computed.
Proposal:
Source:
- Start all columns lowercase
- ixxPsf -> ixxPSF (and friends) to match DIASourceTable specification.
Object:
- Galaxy fluxes:
- remove ugStd, grStd, riStd, izStd, zyStd. We've already argued that nanojansky fluxes are the way to go. Magdiff cols have redundant and less information than the model fluxes.
- direct users to use `g_CModelFlux`/`g_GaapFlux_1_0` or use a signifier alias that folks will look for: `g_ModelFlux`.
- Add an underscore between band and column name, and if the column is in Source or ForcedSource, make it start lower case.
e.g. (g_ra, g_psfFlux) instead of current (gRa, gPsFlux). Benefit is that you can reuse your code snippets between ForcedSource/Source and Object tables by just prepending `g_` instead of messing around with capitals. - Remove all apCorr columns.
- Add more GaaP apertures (Arun responsible for ticket)
- Add more aperture flux apertures
- There are a lot of other changes that are not consistency/naming issues, but content issues and are not enumerated here
All:
- psFlux --> psfFlux. Everyone is in the habit of calling it psfFlux. Multifit is dead. Let's not fight expectation here.
- One boolean column per flag in all Parquet tables (Leave bitpacked flags in the APDB).
- Set the DataFrame index to the primary key which is not replicated as a column.
e.g. to get objectId from the Object table, you object.index rather than object['objectId'] unless you do object.reset_index() first - Replace filterName with band and physicalFilter.
- (visit, detector, tract, patch`) instead of ({{visitId, detectorId, tractId, patchId)
- Do not include base_PixelFlags_flag. No one knows what this is, and I've even seen it trick savvy pipelines folks.
Spoiler alert: it is True if the plugin that sets the PixelFlags fails. Too meta
Major changes not in the basic proposal because they would be tougher to get done in a week, but I'd push for if I hear overwhelming enthusiasm:
- Proposal to rename the CcdVisit table to DetectorVisit with a primary key of detectorVisit.
- Proposal to change all column names to snake_case.
For more info on the basic proposal (without changing all cols to snake_case the CcdVisit Table to DetectorVisit Table)
I strongly support most of these changes!
However, there are a couple modifications I would like to propose.
1) The bullet "direct users to use g_CModelFlux/g_GaapFlux_1_0 or use a signifier alias that folks will look for: g_ModelFlux" I hope is supposed to be "g_cModelFlux/g_gaapFlux_1_0 and g_modelFlux" to match the leading lower-case style proposed?
2) Unless the proposal "and if the column is in Source or ForcedSource, make it start lower case" means that some of the columns start lower-case (if in Source) and some upper-case (if not in Source)? I don't support this. I think that they should all be band_lowerCase for improved consistency (because a priori how do we know which are in Source and ForceSource without doing a separate lookup?). Ah! Looking at the before/after column list I think that the proposal is correct, but the wording in the RFC is confusing.
3) "Set the DataFrame index to the primary key which is not replicated as a column" which I read to remove the objectId column. I do not love this, as it ties us to a particular implementation of reading parquet files (via pandas). However, the unique id is still persisted as the id column if you read via another method, so these are still usable. On the other hand, I would prefer something clearly named like objectId even at the cost of redundancy.