Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-807

Modify column names in Source/Object DRP Parquet Tables

    XMLWordPrintable

Details

    • RFC
    • Status: Implemented
    • Resolution: Done
    • DM
    • None

    Description

      We are making a column name consistency push for w_2021_40 with an eye on what users will see in the Parquet Tables with DP0.2. There is a prototype implementation on ticket DM-31825.

      Two major users:

      • DAX for ingest into qserv and
      • science users loading columns into notebooks for analysis.

      Tables in use currently are Source and Object which require an RFC.
      sourceTable_visit in particular is input to jointcal and FGCM which will require modifications.

      In your comments, save your fingers from typing: "I think everything should be consistent." We all agree on that! What is under debate here is, where names are not
      consistent, which should yield. What is on offer here is an iteration of of edits to our Parquet specification yaml files,
      transform* tasks, any tasks in pipe_tasks/postprocess.py. I am not offering: to edit all our
      afw_table source catalogs and plugin names and fields, to change implementation on how fields (such as ccdVisitId) are computed.

      Proposal:

      Source:

      • Start all columns lowercase
      • ixxPsf -> ixxPSF (and friends) to match DIASourceTable specification.

      Object:

      • Galaxy fluxes:
        • remove ugStd, grStd, riStd, izStd, zyStd. We've already argued that nanojansky fluxes are the way to go. Magdiff cols have redundant and less information than the model fluxes.
        • direct users to use `g_CModelFlux`/`g_GaapFlux_1_0` or use a signifier alias that folks will look for: `g_ModelFlux`.
      • Add an underscore between band and column name, and if the column is in Source or ForcedSource, make it start lower case.
        e.g. (g_ra, g_psfFlux) instead of current (gRa, gPsFlux). Benefit is that you can reuse your code snippets between ForcedSource/Source and Object tables by just prepending `g_` instead of messing around with capitals.
      • Remove all apCorr columns.
      • Add more GaaP apertures (Arun responsible for ticket)
      • Add more aperture flux apertures
      • There are a lot of other changes that are not consistency/naming issues, but content issues and are not enumerated here

      All:

      • psFlux --> psfFlux. Everyone is in the habit of calling it psfFlux. Multifit is dead. Let's not fight expectation here.
      • One boolean column per flag in all Parquet tables (Leave bitpacked flags in the APDB). 
      • Set the DataFrame index to the primary key which is not replicated as a column.
        e.g. to get objectId from the Object table, you object.index rather than object['objectId'] unless you do object.reset_index() first
      • Replace filterName with band and physicalFilter.
      • (visit, detector, tract, patch`) instead of ({{visitId, detectorId, tractId, patchId)
      • Do not include base_PixelFlags_flag. No one knows what this is, and I've even seen it trick savvy pipelines folks.
        Spoiler alert: it is True if the plugin that sets the PixelFlags fails. Too meta

       

      Major changes not in the basic proposal because they would be tougher to get done in a week, but I'd push for if I hear overwhelming enthusiasm:

      • Proposal to rename the CcdVisit table to DetectorVisit with a primary key of detectorVisit.
      • Proposal to change all column names to snake_case.

      For more info on the basic proposal (without changing all cols to snake_case the CcdVisit Table to DetectorVisit Table)

      Attachments

        Issue Links

          Activity

            erykoff Eli Rykoff added a comment -

            I strongly support most of these changes!

            However, there are a couple modifications I would like to propose.

            1) The bullet "direct users to use g_CModelFlux/g_GaapFlux_1_0 or use a signifier alias that folks will look for: g_ModelFlux" I hope is supposed to be "g_cModelFlux/g_gaapFlux_1_0 and g_modelFlux" to match the leading lower-case style proposed?
            2) Unless the proposal "and if the column is in Source or ForcedSource, make it start lower case" means that some of the columns start lower-case (if in Source) and some upper-case (if not in Source)? I don't support this. I think that they should all be band_lowerCase for improved consistency (because a priori how do we know which are in Source and ForceSource without doing a separate lookup?). Ah! Looking at the before/after column list I think that the proposal is correct, but the wording in the RFC is confusing.
            3) "Set the DataFrame index to the primary key which is not replicated as a column" which I read to remove the objectId column. I do not love this, as it ties us to a particular implementation of reading parquet files (via pandas). However, the unique id is still persisted as the id column if you read via another method, so these are still usable. On the other hand, I would prefer something clearly named like objectId even at the cost of redundancy.

            erykoff Eli Rykoff added a comment - I strongly support most of these changes! However, there are a couple modifications I would like to propose. 1) The bullet "direct users to use g_CModelFlux / g_GaapFlux_1_0 or use a signifier alias that folks will look for: g_ModelFlux " I hope is supposed to be " g_cModelFlux / g_gaapFlux_1_0 and g_modelFlux " to match the leading lower-case style proposed? 2) Unless the proposal "and if the column is in Source or ForcedSource, make it start lower case" means that some of the columns start lower-case (if in Source) and some upper-case (if not in Source)? I don't support this. I think that they should all be band_lowerCase for improved consistency (because a priori how do we know which are in Source and ForceSource without doing a separate lookup?). Ah! Looking at the before/after column list I think that the proposal is correct, but the wording in the RFC is confusing. 3) "Set the DataFrame index to the primary key which is not replicated as a column" which I read to remove the objectId column. I do not love this, as it ties us to a particular implementation of reading parquet files (via pandas). However, the unique id is still persisted as the id column if you read via another method, so these are still usable. On the other hand, I would prefer something clearly named like objectId even at the cost of redundancy.

            ap_association also uses DiaSource Parquet tables (goodSeeingDiff_diaSrcTable) as input. Are these covered by this RFC? At the very least I know some of the proposed "All" changes could apply.

            krzys Krzysztof Findeisen added a comment - ap_association also uses DiaSource Parquet tables ( goodSeeingDiff_diaSrcTable ) as input. Are these covered by this RFC? At the very least I know some of the proposed "All" changes could apply.
            yusra Yusra AlSayyad added a comment - - edited

            Yes, "all" applies to the Visit, CcdVisit, DIASource, DIAObject, ForcedSource, and ForcedSourceOnDiaObjects.  See attached notebook (that we looked at for team meeting last week) for more info. 

            But this RFC only covers the DIASource and DIAObject tables that are generated during data release processing and not alert production. This means that in order for the DRP DIASource table to be consistent with the other tables output during DRP, it's schema would be inconsistent with the APDB in terms of boolean vs integer bit-packed flags

             

            yusra Yusra AlSayyad added a comment - - edited Yes, "all" applies to the Visit, CcdVisit, DIASource, DIAObject, ForcedSource, and ForcedSourceOnDiaObjects.  See attached notebook (that we looked at for team meeting last week) for more info.  But this RFC only covers the DIASource and DIAObject tables that are generated during data release processing and not alert production. This means that in order for the DRP DIASource table to be consistent with the other tables output during DRP, it's schema would be inconsistent with the APDB in terms of boolean vs integer bit-packed flags  
            yusra Yusra AlSayyad added a comment - - edited

             3) "Set the DataFrame index to the primary key which is not replicated as a column" which I read to remove the objectId column. I do not love this, as it ties us to a particular implementation of reading parquet files (via pandas).  However, the unique id is still persisted as the id column if you read via another method, so these are still usable. On the other hand, I would prefer something clearly named like objectId even at the cost of redundancy. 

            Look at the very end of the attached notebook for the 3 options.  I'm proposing the first option.  Continuing the same example,  If you read in the table stored with the primary key as index only in with pyarrow instead of the butler,  the diaObjectIds are still there:

            >>> import pyarrow.parquet as pq 
            >>> pq.read_table("file:///repo/dc2/u/yusra/dc2/DM-31825/20210919T200251Z/goodSeeingDiff_diaObjTable/3828/3/goodSeeingDiff_diaObjTable_3828_3_DC2_u_yusra_dc2_DM-31825_20210919T200251Z.parq")
            pyarrow.Table
            ra: double
            decl: double
            nDiaSources: int64
            diaObjectId: int64
             

            Edit: and to the question of what df.to_csv() says, diaObjectId is there too:

            >>> df.to_csv()
            'diaObjectId,ra,decl,nDiaSources\n3299854297281331201,56.645500777573766,-37.01269473596373,53\n [snip]
            

            yusra Yusra AlSayyad added a comment - - edited  3) "Set the DataFrame index to the primary key which is not replicated as a column" which I read to remove the  objectId  column. I do not love this, as it ties us to a particular implementation of reading parquet files (via pandas).  However, the unique id is still persisted as the  id  column if you read via another method, so these are still usable. On the other hand, I would prefer something clearly named like  objectId  even at the cost of redundancy.  Look at the very end of the attached notebook for the 3 options.  I'm proposing the first option.  Continuing the same example,  If you read in the table stored with the primary key as index only in with pyarrow instead of the butler,  the diaObjectIds are still there: >>> import pyarrow.parquet as pq >>> pq.read_table("file:///repo/dc2/u/yusra/dc2/DM-31825/20210919T200251Z/goodSeeingDiff_diaObjTable/3828/3/goodSeeingDiff_diaObjTable_3828_3_DC2_u_yusra_dc2_DM-31825_20210919T200251Z.parq") pyarrow.Table ra: double decl: double nDiaSources: int64 diaObjectId: int64 Edit: and to the question of what df.to_csv() says, diaObjectId is there too: >>> df.to_csv() 'diaObjectId,ra,decl,nDiaSources\n3299854297281331201,56.645500777573766,-37.01269473596373,53\n [snip]
            erykoff Eli Rykoff added a comment -

            Oh I misunderstood. I read this as that you're dropping objectId in favor of id as the table index (which I don't like). Instead, what you're saying that objectId should be made the index column. Is this correct?

            I'm not sure I understand the formatting of the notebook in this respect, though. I guess I didn't know that pandas wrote primary key columns on a different line in the header, but now that I've figured that out it makes sense.

            erykoff Eli Rykoff added a comment - Oh I misunderstood. I read this as that you're dropping objectId in favor of id as the table index (which I don't like). Instead, what you're saying that objectId should be made the index column. Is this correct? I'm not sure I understand the formatting of the notebook in this respect, though. I guess I didn't know that pandas wrote primary key columns on a different line in the header, but now that I've figured that out it makes sense.

            But this RFC only covers the DIASource and DIAObject tables that are generated during data release processing and not alert production. This means that in order for the DRP DIASource table to be consistent with the other tables output during DRP, it's schema would be inconsistent with the APDB in terms of boolean vs integer bit-packed flags

            Sorry, I'm not sure I follow. Assuming I've understood the notebook correctly, it cites ap_association as the source for the schema. Are you suggesting we split this, and have two different DiaSource table datasets that follow different schemas?

            krzys Krzysztof Findeisen added a comment - But this RFC only covers the DIASource and DIAObject tables that are generated during data release processing and not alert production. This means that in order for the DRP DIASource table to be consistent with the other tables output during DRP, it's schema would be inconsistent with the APDB in terms of boolean vs integer bit-packed flags Sorry, I'm not sure I follow. Assuming I've understood the notebook correctly, it cites ap_association as the source for the schema. Are you suggesting we split this, and have two different DiaSource table datasets that follow different schemas?
            tjenness Tim Jenness added a comment -

            This proposal looks good to me and I strongly support a ccd -> detector migration because of the obvious issue that butler calls them detectors.

            tjenness Tim Jenness added a comment - This proposal looks good to me and I strongly support a ccd -> detector migration because of the obvious issue that butler calls them detectors.
            • Set the DataFrame index to the primary key which is not replicated as a column.
              e.g. to get objectId from the Object table, you object.index rather than object['objectId'] unless you do object.reset_index() first

            Just to make sure I'm understanding: we'll still have objectId (or whichever ID column is relevant to a particular table) in the database (as opposed to Parquet) view of these tables?

            gpdf Gregory Dubois-Felsmann added a comment - Set the DataFrame index to the primary key which is not replicated as a column. e.g. to get objectId from the Object table, you object.index  rather than object ['objectId']  unless you do object.reset_index() first Just to make sure I'm understanding: we'll still have objectId (or whichever ID column is relevant to a particular table) in the database (as opposed to Parquet) view of these tables?

            Yes, it is a column in the database and a column in the Parquet and a column when you write to csv. (see snippet from comment above).

            >>> df.to_csv()
            'diaObjectId,ra,decl,nDiaSources\n3299854297281331201,56.645500777573766,-37.01269473596373,53\n [snip]
            

            yusra Yusra AlSayyad added a comment - Yes, it is a column in the database and a column in the Parquet and a column when you write to csv. (see snippet from comment above). >>> df.to_csv() 'diaObjectId,ra,decl,nDiaSources\n3299854297281331201,56.645500777573766,-37.01269473596373,53\n [snip]
            yusra Yusra AlSayyad added a comment - - edited

            Assuming I've understood the notebook correctly, it cites ap_association as the source for the schema. Are you suggesting we split this, and have two different DiaSource table datasets that follow different schemas?

            krzys, For flags, I put a example implementation on ap_association which adds a config parameter that switches flags style. The answer is really that I'm suggesting whatever you want me to do.

            I would especially appreciate advice on what to do with 'psFlux' -> 'psfFlux' in the Parquet DIASource Table. Should we rename it 'psfFlux' in the APDB to match? Should I leave it 'psFlux' in this parquet table for now?

            yusra Yusra AlSayyad added a comment - - edited Assuming I've understood the notebook correctly, it cites ap_association as the source for the schema. Are you suggesting we split this, and have two different DiaSource table datasets that follow different schemas? krzys , For flags, I put a example implementation on ap_association which adds a config parameter that switches flags style. The answer is really that I'm suggesting whatever you want me to do. I would especially appreciate advice on what to do with 'psFlux' -> 'psfFlux' in the Parquet DIASource Table. Should we rename it 'psfFlux' in the APDB to match? Should I leave it 'psFlux' in this parquet table for now?
            yusra Yusra AlSayyad added a comment - - edited

            Because half the comments have been confusion about how DataFrame indexes get translated to csv and parquet, here’s snippet with an example. It’s the skinniest table we have to amplify any storage differences.

            To recap, I’m proposing consistency on whether the primary key column is index-only, column only, or both in all our DataFrames. Let’s cross-off column-only now. Whether “both” or “index-only” I do not care.

            To summarize the questions I’ve heard so far:
            1. If we do “both” does it make the Parquet table bigger on disk? Yes, it does. See attached example.
            2. If we do index-only, does the key column appear in the csv and Parquet Tables as a column. Yes it does. Some background: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html index parameter defaults to None, which compresses the index to a range if it is the boring default 0,1,2,3..N and stores it as a column otherwise.
            3. Can I do

             parameters={"columns": ['diaObjectId', 'ra', 'decl’]} 

            if it’s index-only? Yes, you can. And it gives you back the key as an index, consistent with reading in the whole parquet.

            yusra Yusra AlSayyad added a comment - - edited Because half the comments have been confusion about how DataFrame indexes get translated to csv and parquet, here’s snippet with an example. It’s the skinniest table we have to amplify any storage differences. To recap, I’m proposing consistency on whether the primary key column is index-only, column only, or both in all our DataFrames. Let’s cross-off column-only now. Whether “both” or “index-only” I do not care. To summarize the questions I’ve heard so far: 1. If we do “both” does it make the Parquet table bigger on disk? Yes, it does. See attached example. 2. If we do index-only, does the key column appear in the csv and Parquet Tables as a column. Yes it does. Some background: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html index parameter defaults to None, which compresses the index to a range if it is the boring default 0,1,2,3..N and stores it as a column otherwise. 3. Can I do parameters={"columns": ['diaObjectId', 'ra', 'decl’]} if it’s index-only? Yes, you can. And it gives you back the key as an index, consistent with reading in the whole parquet.
            krzys Krzysztof Findeisen added a comment - - edited

            I would especially appreciate advice on what to do with 'psFlux' -> 'psfFlux' in the Parquet DIASource Table. Should we rename it 'psfFlux' in the APDB to match? Should I leave it 'psFlux' in this parquet table for now?

            I suggest asking cmorrison, since he's the one who created both AP schemas.

            krzys Krzysztof Findeisen added a comment - - edited I would especially appreciate advice on what to do with 'psFlux' -> 'psfFlux' in the Parquet DIASource Table. Should we rename it 'psfFlux' in the APDB to match? Should I leave it 'psFlux' in this parquet table for now? I suggest asking cmorrison , since he's the one who created both AP schemas.

            cmorrison explained things to me, and I have no further objections. Sorry for the disruption!

            krzys Krzysztof Findeisen added a comment - cmorrison explained things to me, and I have no further objections. Sorry for the disruption!

            I did talk to  Chris (like you suggested) about psfFlux.  If I'm transcribing right, the conclusion was that we would ultimately like for the APDB to migrate to psfFlux as well, but that's not going to happen this week.  In the meantime, I'm going to leave the DIASource table with psFlux.  It'll be inconsistent with the other Parquet DRP tables for now. 

            yusra Yusra AlSayyad added a comment - I did talk to  Chris (like you suggested) about psfFlux.  If I'm transcribing right, the conclusion was that we would ultimately like for the APDB to migrate to psfFlux as well, but that's not going to happen this week.  In the meantime, I'm going to leave the DIASource table with psFlux .  It'll be inconsistent with the other Parquet DRP tables for now. 

            See notebook with the outcome of DM-31825.

            https://github.com/yalsayyad/dm_notebooks/blob/master/object-table/ci_imsim_check_for_DRP_Table_and_Column.ipynb

            Outcome on DIASource table:
            There are arguments for keeping the DIASourceTable the similar to that in the APDB andtThe assumption that diaSourceId was a column was thoughout code I was unfamiliar with. Instead of risking breaking ap_association, I left this table as the only non-conformant in terms of:

            • primary key as index-only
            • psfFlux instead of psFlux
            yusra Yusra AlSayyad added a comment - See notebook with the outcome of DM-31825 . https://github.com/yalsayyad/dm_notebooks/blob/master/object-table/ci_imsim_check_for_DRP_Table_and_Column.ipynb Outcome on DIASource table: There are arguments for keeping the DIASourceTable the similar to that in the APDB andtThe assumption that diaSourceId was a column was thoughout code I was unfamiliar with. Instead of risking breaking ap_association, I left this table as the only non-conformant in terms of: primary key as index-only psfFlux instead of psFlux

            People

              yusra Yusra AlSayyad
              yusra Yusra AlSayyad
              Colin Slater, Eli Rykoff, Gregory Dubois-Felsmann, John Parejko, Krzysztof Findeisen, Tim Jenness, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:
                Planned End:

                Jenkins

                  No builds found.