Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-24939

Remember which dataset types are in various collections

    XMLWordPrintable

    Details

    • Story Points:
      1
    • Epic Link:
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      We may be able to optimize dataset lookups by remembering which dataset types are present in various collections.

      This could happen down in daf_butler, for every collection, via some kind of materialized summary view of the dataset_collection_ tables (which could be managed either by Python code or by database support for materialized joins). It could also happen by using CHAINED collections more as the "user visible" collections created by the ingest and conversion scripts, because CHAINED collections already permit the dataset types looked up in child collection to be restricted. The former is more work but seems cleaner long-term (there may be some potential for surprising behavior if we rely only on the latter), and we may want to do the latter as well anyway.

      This would probably require schema changes, but probably only additions, and it may be doable in a way that allows the old schema and new schema to both be supported via configuration changes.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            I am taking a big step towards this on DM-27251, and I think I have a clear plan. I hope to make the schema changes that would support this functionality (at least) on the DM-27033 integration branch prior to the middleware stable release, but I need to finish the real goal of DM-26692 first.

            Because I'm now also planning to rememberthe "governor" dimensions of the data IDs in each collection, doing this ticket will then be a step towards DM-27153.

            Show
            jbosch Jim Bosch added a comment - I am taking a big step towards this on DM-27251 , and I think I have a clear plan. I hope to make the schema changes that would support this functionality (at least) on the DM-27033 integration branch prior to the middleware stable release, but I need to finish the real goal of DM-26692 first. Because I'm now also planning to rememberthe "governor" dimensions of the data IDs in each collection, doing this ticket will then be a step towards DM-27153 .
            Hide
            jbosch Jim Bosch added a comment -

            I'm using this ticket's branch for just one (big) piece of the system it describes: I've added tables that track the dataset types and governor dimension values (e.g. instruments) present in each RUN, TAGGED, or CALIBRATION collection, as well as code to populate those tables on insert.

            I'm deferring the code to actually use those summaries to inform query generation until after other schema-change tickets are done, and the same is true of updating the summaries when datasets are deleted.  In fact, rigorously keeping the summary tables updated in the presence of deletes looks really hard, and I don't plan to try, because I don't think we need it - we'll define the summary tables as representing what may be in a collection, and because it'll be rare to delete datasets in such a way that changes the summary, we won't lose much in terms of writing intelligent queries.  I'm planning to add an interface later for explicitly recalculating the summary for a given collection.

            Right now all changes are in daf_butler, and in one medium-size commit.  I think it's likely to stay that way, but Jenkins will tell.

            Show
            jbosch Jim Bosch added a comment - I'm using this ticket's branch for just one (big) piece of the system it describes: I've added tables that track the dataset types and governor dimension values (e.g. instruments) present in each RUN, TAGGED, or CALIBRATION collection, as well as code to populate those tables on insert. I'm deferring the code to actually use those summaries to inform query generation until after other schema-change tickets are done, and the same is true of updating the summaries when datasets are deleted.  In fact, rigorously keeping the summary tables updated in the presence of deletes looks really hard, and I don't plan to try, because I don't think we need it - we'll define the summary tables as representing what may be in a collection, and because it'll be rare to delete datasets in such a way that changes the summary, we won't lose much in terms of writing intelligent queries.  I'm planning to add an interface later for explicitly recalculating the summary for a given collection. Right now all changes are in daf_butler , and in one medium-size commit.  I think it's likely to stay that way, but Jenkins will tell.
            Hide
            jbosch Jim Bosch added a comment -

            Mostly a note to self: Jenkins has passed with just the daf_butler branch, with this sitting on top of DM-27390 but not DM-27397 or DM-24660.

            Show
            jbosch Jim Bosch added a comment - Mostly a note to self: Jenkins has passed with just the daf_butler branch, with this sitting on top of DM-27390 but not DM-27397 or DM-24660 .
            Hide
            tjenness Tim Jenness added a comment -

            Looks ok. One comment on PR.

            Show
            tjenness Tim Jenness added a comment - Looks ok. One comment on PR.

              People

              Assignee:
              jbosch Jim Bosch
              Reporter:
              jbosch Jim Bosch
              Reviewers:
              Tim Jenness
              Watchers:
              Jim Bosch, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.