Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-672

Include cat (as sdm_schemas) in lsst_distrib

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Implemented
    • Resolution: Done
    • Component/s: DM
    • Labels:
      None

      Description

      cat (https://github.com/lsst/cat) is a very light package and contains the schema definitions of the catalog data products. This specifies (or will specify) the Science Data Model (SDM) of the LSST data release catalogs. The definitions are in the felis (https://felis.lsst.io) format.    The idea is that other LSST components, such as database systems (Qserv), TAP services, validation tools, etc., can look up the cat package to know the expected schemas of the catalogs.

      We foresee more schemas to be added to cat and eventually it will cover all DPDD catalogs. As of today, there is only 1 table schema that the Science Pipelines team might care about: cat/yml/hsc.yaml    This yaml file describes the Object table that is generated by the Science Pipelines (writeObjectTable.py/transformObjectCatalog.py/consolidateObjectTable.py) and that will be used in Qserv catalog loading. Currently a minimal check of the schema column names between the pipeline outputs and the cat definitions is done in ci_hsc_gen2. If somebody wants to merge a pipeline ticket that will change the column names, they will also need to update cat/yml/hsc.yml accordingly.

      As the schema definition is tied with the upstream pipelines, it would be nice if cat is versioned and gets weekly tags like other Science Pipelines packages. This also makes using ci_hsc_gen2 easier (because ci_hsc_gen2 is versioned but its dependency, cat, is not.) Including cat in lsst_distrib ensures they are tagged consistently.

      (added) Following discussion in other fora, it is proposed that as part of this RFC the repository be renamed to sdm_schemas, with appropriate documentation in the repository and elsewhere that ensures that users understand what "SDM" signifies (Science Data Model) and what the role of this repo is in that regard.

        Attachments

          Issue Links

            Activity

            No builds found.
            hchiang2 Hsin-Fang Chiang created issue -
            hchiang2 Hsin-Fang Chiang made changes -
            Field Original Value New Value
            Link This issue is triggering DM-23529 [ DM-23529 ]
            tjenness Tim Jenness made changes -
            Description {{cat}} ([https://github.com/lsst/cat]) is a very light package and contains the schema definitions of the catalog data products. The definitions are in the felis ([https://felis.lsst.io|https://felis.lsst.io/]) format.    The idea is that other LSST components, such as database systems (Qserv), TAP services, validation tools, etc., can look up the {{cat}} package to know the expected schemas of the catalogs.

            We foresee more schemas to be added to {{cat}} and eventually it will cover all DPDD catalogs. As of today, there is only 1 table schema that the Science Pipelines team might care about: [cat/yml/hsc.yaml|https://github.com/lsst/cat/blob/master/yml/hsc.yaml]    This yaml file describes the Object table that is generated by the Science Pipelines (writeObjectTable.py/transformObjectCatalog.py/consolidateObjectTable.py) and that will be used in Qserv catalog loading. Currently a minimal check of the schema column names between the pipeline outputs and the {{cat}} definitions is done in {{ci_hsc_gen2}}. If somebody wants to merge a pipeline ticket that will change the column names, s/he will also need to update {{cat/yml/hsc.yml}} accordingly.

            As the schema definition is tied with the upstream pipelines, it would be nice if {{cat}} is versioned and gets weekly tags like other Science Pipelines packages. This also makes using {{ci_hsc_gen2}} easier (because {{ci_hsc_gen2}} is versioned but its dependency, {{cat}}, is not.) Including {{cat}} in {{lsst_distrib}} ensures they are tagged consistently.
            {{cat}} ([https://github.com/lsst/cat]) is a very light package and contains the schema definitions of the catalog data products. The definitions are in the felis ([https://felis.lsst.io|https://felis.lsst.io/]) format.    The idea is that other LSST components, such as database systems (Qserv), TAP services, validation tools, etc., can look up the {{cat}} package to know the expected schemas of the catalogs.

            We foresee more schemas to be added to {{cat}} and eventually it will cover all DPDD catalogs. As of today, there is only 1 table schema that the Science Pipelines team might care about: [cat/yml/hsc.yaml|https://github.com/lsst/cat/blob/master/yml/hsc.yaml]    This yaml file describes the Object table that is generated by the Science Pipelines (writeObjectTable.py/transformObjectCatalog.py/consolidateObjectTable.py) and that will be used in Qserv catalog loading. Currently a minimal check of the schema column names between the pipeline outputs and the {{cat}} definitions is done in {{ci_hsc_gen2}}. If somebody wants to merge a pipeline ticket that will change the column names, they will also need to update {{cat/yml/hsc.yml}} accordingly.

            As the schema definition is tied with the upstream pipelines, it would be nice if {{cat}} is versioned and gets weekly tags like other Science Pipelines packages. This also makes using {{ci_hsc_gen2}} easier (because {{ci_hsc_gen2}} is versioned but its dependency, {{cat}}, is not.) Including {{cat}} in {{lsst_distrib}} ensures they are tagged consistently.
            Hide
            Parejkoj John Parejko added a comment -

            Why is it `hsc.yml` and not `objectTable.yml`? Is it only used for HSC?

            Show
            Parejkoj John Parejko added a comment - Why is it `hsc.yml` and not `objectTable.yml`? Is it only used for HSC?
            Hide
            swinbank John Swinbank added a comment -

            It seems like a good idea to version this along with Science Pipelines code. It's not obvious that means it has to go in lsst_distrib, though — we have other repositories that receive tags but aren't part of the distribution (https://github.com/lsst/lsst, for example).

            If this does belong in lsst_distrib (or even if it doesn't, but it should be a prerequisite for inclusion), we do need a bit more documentation about what we're looking at. John already asked about the filename, but beyond that some explanation of the process or references to the ingestion procedure would be very helpful.

            Show
            swinbank John Swinbank added a comment - It seems like a good idea to version this along with Science Pipelines code. It's not obvious that means it has to go in lsst_distrib, though — we have other repositories that receive tags but aren't part of the distribution ( https://github.com/lsst/lsst , for example). If this does belong in lsst_distrib (or even if it doesn't, but it should be a prerequisite for inclusion), we do need a bit more documentation about what we're looking at. John already asked about the filename, but beyond that some explanation of the process or references to the ingestion procedure would be very helpful.
            Hide
            hchiang2 Hsin-Fang Chiang added a comment -

            Right now it's only used for HSC and we don't know yet whether it would be useful for other cameras as-os. It's not objectTable.yml because we want to add other table definitions to it in the future.
            However I see these names can change as pipelines & design & our understanding evolve. If later we realize all Stack-generated catalogs can follow the same schema we will name it differently. Hopefully tagging it doesn't mean it cannot change.

            Show
            hchiang2 Hsin-Fang Chiang added a comment - Right now it's only used for HSC and we don't know yet whether it would be useful for other cameras as-os. It's not objectTable.yml because we want to add other table definitions to it in the future. However I see these names can change as pipelines & design & our understanding evolve. If later we realize all Stack-generated catalogs can follow the same schema we will name it differently. Hopefully tagging it doesn't mean it cannot change.
            Hide
            hchiang2 Hsin-Fang Chiang added a comment -

            If it can be tagged without being added into lsst_distrib, that's fine to me too.

            I can add more documentations to the package readme or dev guide. I think we want a place as the source of truth where different teams can look up the schemas. Most of the processes are still yet to be formed, but we wanted to start having a tagged repo while the details are taking shape.

            John Swinbank since "ingestion" is such an overloaded term.... which "ingestion" are you thinking in this case? Not Qserv ingest right?

            Show
            hchiang2 Hsin-Fang Chiang added a comment - If it can be tagged without being added into lsst_distrib, that's fine to me too. I can add more documentations to the package readme or dev guide. I think we want a place as the source of truth where different teams can look up the schemas. Most of the processes are still yet to be formed, but we wanted to start having a tagged repo while the details are taking shape. John Swinbank since "ingestion" is such an overloaded term.... which "ingestion" are you thinking in this case? Not Qserv ingest right?
            hchiang2 Hsin-Fang Chiang made changes -
            Description {{cat}} ([https://github.com/lsst/cat]) is a very light package and contains the schema definitions of the catalog data products. The definitions are in the felis ([https://felis.lsst.io|https://felis.lsst.io/]) format.    The idea is that other LSST components, such as database systems (Qserv), TAP services, validation tools, etc., can look up the {{cat}} package to know the expected schemas of the catalogs.

            We foresee more schemas to be added to {{cat}} and eventually it will cover all DPDD catalogs. As of today, there is only 1 table schema that the Science Pipelines team might care about: [cat/yml/hsc.yaml|https://github.com/lsst/cat/blob/master/yml/hsc.yaml]    This yaml file describes the Object table that is generated by the Science Pipelines (writeObjectTable.py/transformObjectCatalog.py/consolidateObjectTable.py) and that will be used in Qserv catalog loading. Currently a minimal check of the schema column names between the pipeline outputs and the {{cat}} definitions is done in {{ci_hsc_gen2}}. If somebody wants to merge a pipeline ticket that will change the column names, they will also need to update {{cat/yml/hsc.yml}} accordingly.

            As the schema definition is tied with the upstream pipelines, it would be nice if {{cat}} is versioned and gets weekly tags like other Science Pipelines packages. This also makes using {{ci_hsc_gen2}} easier (because {{ci_hsc_gen2}} is versioned but its dependency, {{cat}}, is not.) Including {{cat}} in {{lsst_distrib}} ensures they are tagged consistently.
            {{cat}} ([https://github.com/lsst/cat]) is a very light package and contains the schema definitions of the catalog data products. This specifies (or will specify) the Science Data Model (SDM) of the LSST data release catalogs. The definitions are in the felis ([https://felis.lsst.io|https://felis.lsst.io/]) format.    The idea is that other LSST components, such as database systems (Qserv), TAP services, validation tools, etc., can look up the {{cat}} package to know the expected schemas of the catalogs.

            We foresee more schemas to be added to {{cat}} and eventually it will cover all DPDD catalogs. As of today, there is only 1 table schema that the Science Pipelines team might care about: [cat/yml/hsc.yaml|https://github.com/lsst/cat/blob/master/yml/hsc.yaml]    This yaml file describes the Object table that is generated by the Science Pipelines (writeObjectTable.py/transformObjectCatalog.py/consolidateObjectTable.py) and that will be used in Qserv catalog loading. Currently a minimal check of the schema column names between the pipeline outputs and the {{cat}} definitions is done in {{ci_hsc_gen2}}. If somebody wants to merge a pipeline ticket that will change the column names, they will also need to update {{cat/yml/hsc.yml}} accordingly.

            As the schema definition is tied with the upstream pipelines, it would be nice if {{cat}} is versioned and gets weekly tags like other Science Pipelines packages. This also makes using {{ci_hsc_gen2}} easier (because {{ci_hsc_gen2}} is versioned but its dependency, {{cat}}, is not.) Including {{cat}} in {{lsst_distrib}} ensures they are tagged consistently.
            Hide
            hchiang2 Hsin-Fang Chiang added a comment -

            Discussions on Slack brought up some doubts of the package name "cat".  

             

            We'd like to rename "cat" to "sdm_schemas"  as this package contains the catalog schema part of the LSST science data model (SDM). 

            Show
            hchiang2 Hsin-Fang Chiang added a comment - Discussions on Slack brought up some doubts of the package name "cat".     We'd like to rename "cat" to "sdm_schemas"  as this package contains the catalog schema part of the LSST science data model (SDM). 
            hchiang2 Hsin-Fang Chiang made changes -
            Summary Include cat in lsst_distrib Include cat (sdm_schemas) in lsst_distrib
            Hide
            swinbank John Swinbank added a comment -

            John Swinbank since "ingestion" is such an overloaded term.... which "ingestion" are you thinking in this case? Not Qserv ingest right?

            Qserv ingest was on my mind, but don't make this request too specific.

            The uneducated user — certainly including me — looking at this package sees a bunch of YAML with no very clear idea of what it is or what it's for. Before this gets added to lsst_distrib, there needs to be a clear description of what it is we're actually looking at and why we care.

            Show
            swinbank John Swinbank added a comment - John Swinbank since "ingestion" is such an overloaded term.... which "ingestion" are you thinking in this case? Not Qserv ingest right? Qserv ingest was on my mind, but don't make this request too specific. The uneducated user — certainly including me — looking at this package sees a bunch of YAML with no very clear idea of what it is or what it's for. Before this gets added to lsst_distrib, there needs to be a clear description of what it is we're actually looking at and why we care.
            gpdf Gregory Dubois-Felsmann made changes -
            Description {{cat}} ([https://github.com/lsst/cat]) is a very light package and contains the schema definitions of the catalog data products. This specifies (or will specify) the Science Data Model (SDM) of the LSST data release catalogs. The definitions are in the felis ([https://felis.lsst.io|https://felis.lsst.io/]) format.    The idea is that other LSST components, such as database systems (Qserv), TAP services, validation tools, etc., can look up the {{cat}} package to know the expected schemas of the catalogs.

            We foresee more schemas to be added to {{cat}} and eventually it will cover all DPDD catalogs. As of today, there is only 1 table schema that the Science Pipelines team might care about: [cat/yml/hsc.yaml|https://github.com/lsst/cat/blob/master/yml/hsc.yaml]    This yaml file describes the Object table that is generated by the Science Pipelines (writeObjectTable.py/transformObjectCatalog.py/consolidateObjectTable.py) and that will be used in Qserv catalog loading. Currently a minimal check of the schema column names between the pipeline outputs and the {{cat}} definitions is done in {{ci_hsc_gen2}}. If somebody wants to merge a pipeline ticket that will change the column names, they will also need to update {{cat/yml/hsc.yml}} accordingly.

            As the schema definition is tied with the upstream pipelines, it would be nice if {{cat}} is versioned and gets weekly tags like other Science Pipelines packages. This also makes using {{ci_hsc_gen2}} easier (because {{ci_hsc_gen2}} is versioned but its dependency, {{cat}}, is not.) Including {{cat}} in {{lsst_distrib}} ensures they are tagged consistently.
            {{cat}} ([https://github.com/lsst/cat]) is a very light package and contains the schema definitions of the catalog data products. This specifies (or will specify) the Science Data Model (SDM) of the LSST data release catalogs. The definitions are in the felis ([https://felis.lsst.io|https://felis.lsst.io/]) format.    The idea is that other LSST components, such as database systems (Qserv), TAP services, validation tools, etc., can look up the {{cat}} package to know the expected schemas of the catalogs.

            We foresee more schemas to be added to {{cat}} and eventually it will cover all DPDD catalogs. As of today, there is only 1 table schema that the Science Pipelines team might care about: [cat/yml/hsc.yaml|https://github.com/lsst/cat/blob/master/yml/hsc.yaml]    This yaml file describes the Object table that is generated by the Science Pipelines (writeObjectTable.py/transformObjectCatalog.py/consolidateObjectTable.py) and that will be used in Qserv catalog loading. Currently a minimal check of the schema column names between the pipeline outputs and the {{cat}} definitions is done in {{ci_hsc_gen2}}. If somebody wants to merge a pipeline ticket that will change the column names, they will also need to update {{cat/yml/hsc.yml}} accordingly.

            As the schema definition is tied with the upstream pipelines, it would be nice if {{cat}} is versioned and gets weekly tags like other Science Pipelines packages. This also makes using {{ci_hsc_gen2}} easier (because {{ci_hsc_gen2}} is versioned but its dependency, {{cat}}, is not.) Including {{cat}} in {{lsst_distrib}} ensures they are tagged consistently.

            (added) Following discussion in other fora, it is proposed that as part of this RFC the repository be renamed to {{sdm_schemas}}, with appropriate documentation in the repository and elsewhere that ensures that users understand what "SDM" signifies (Science Data Model) and what the role of this repo is in that regard.
            gpdf Gregory Dubois-Felsmann made changes -
            Summary Include cat (sdm_schemas) in lsst_distrib Include cat (as sdm_schemas) in lsst_distrib
            Hide
            hchiang2 Hsin-Fang Chiang added a comment -

            I've created DM-23614 for adding more documentation to the cat package and it blocks DM-23529. The proposal here is still to add cat, as its new name sdm_schemas, to lsst_distrib (not just tagging it).

            I'm delaying the RFC end date to give us more time to discuss this.

            Show
            hchiang2 Hsin-Fang Chiang added a comment - I've created DM-23614 for adding more documentation to the cat package and it blocks DM-23529 . The proposal here is still to add cat , as its new name sdm_schemas , to lsst_distrib (not just tagging it). I'm delaying the RFC end date to give us more time to discuss this.
            hchiang2 Hsin-Fang Chiang made changes -
            Link This issue is triggering DM-23614 [ DM-23614 ]
            hchiang2 Hsin-Fang Chiang made changes -
            Planned End 26/Feb/20 7:00 PM 28/Feb/20 7:00 PM
            Hide
            tjenness Tim Jenness added a comment -

            Not entirely sure why this is the case but at the moment ci_hsc_gen2 depends on the cat package. Is that a real dependency?

            Show
            tjenness Tim Jenness added a comment - Not entirely sure why this is the case but at the moment ci_hsc_gen2 depends on the cat package. Is that a real dependency?
            Hide
            hchiang2 Hsin-Fang Chiang added a comment -

            The pipeline writes the object table and it's not trivial to know its schema without looking at the outputs. In ci_hsc_gen2 we want to check if the schema of the pipeline-generated object table matches the one we think it has, so we read cat/yml/hsc.yaml (the expected schema) and verify that.

            Show
            hchiang2 Hsin-Fang Chiang added a comment - The pipeline writes the object table and it's not trivial to know its schema without looking at the outputs. In ci_hsc_gen2 we want to check if the schema of the pipeline-generated object table matches the one we think it has, so we read cat/yml/hsc.yaml (the expected schema) and verify that.
            Hide
            hchiang2 Hsin-Fang Chiang added a comment -

            As there have been no new comments or objections, I'm adopting this.  

            We will add more documentations in the repository (DM-23614), and afterwards we will add the package to lsst_distrib as  sdm_schemas (DM-23529).  

            Show
            hchiang2 Hsin-Fang Chiang added a comment - As there have been no new comments or objections, I'm adopting this.   We will add more documentations in the repository ( DM-23614 ), and afterwards we will add the package to lsst_distrib as   sdm_schemas ( DM-23529 ).  
            hchiang2 Hsin-Fang Chiang made changes -
            Status Proposed [ 10805 ] Adopted [ 10806 ]
            gcomoretto Gabriele Comoretto [X] (Inactive) made changes -
            Remote Link This issue links to "Page (Confluence)" [ 23395 ]
            hchiang2 Hsin-Fang Chiang made changes -
            Resolution Done [ 10000 ]
            Status Adopted [ 10806 ] Implemented [ 11105 ]
            gcomoretto Gabriele Comoretto [X] (Inactive) made changes -
            Remote Link This issue links to "Page (Confluence)" [ 25155 ]

              People

              Assignee:
              hchiang2 Hsin-Fang Chiang
              Reporter:
              hchiang2 Hsin-Fang Chiang
              Watchers:
              Colin Slater, Gregory Dubois-Felsmann, Hsin-Fang Chiang, John Parejko, John Swinbank, Tim Jenness
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Planned End:

                  Jenkins

                  No builds found.