Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-25180

Add ingest time to registry

    XMLWordPrintable

    Details

    • Story Points:
      1
    • Team:
      Architecture
    • Urgent?:
      No

      Description

      It would be very helpful for the OODS if butler recorded ingest time as well as observation time. Storing ingest time and allowing queries such as "give me all datasets in this collection that were stored more than N days ago" would significantly streamline OODS data expiry. The returned refs could immediately be passed to pruneDatasets. Currently OODS works by going behind gen2 butler's back and deleting the files directly from datastore without involving butler.

        Attachments

          Issue Links

            Activity

            Hide
            tjenness Tim Jenness added a comment -

            Jim Bosch this seems like a low hanging fruit schema change that I should try to do.

            I think we decided that this should be registry to allow querying but that it could also go to datastore but that would only be useful if there was some behind the scenes datastore job that could delete artifacts without telling registry.

            What I'm not sure about is which table to date should be attached to. Should it be a new table? Which table should get the new column? Is it the "dataset" table (https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/registry/datasets/byDimensions/tables.py#L200) ? Should we let the database automatically fill in the timestamp?

            Show
            tjenness Tim Jenness added a comment - Jim Bosch this seems like a low hanging fruit schema change that I should try to do. I think we decided that this should be registry to allow querying but that it could also go to datastore but that would only be useful if there was some behind the scenes datastore job that could delete artifacts without telling registry. What I'm not sure about is which table to date should be attached to. Should it be a new table? Which table should get the new column? Is it the "dataset" table ( https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/registry/datasets/byDimensions/tables.py#L200 ) ? Should we let the database automatically fill in the timestamp?
            Hide
            jbosch Jim Bosch added a comment -

            Agree this fits best in Registry. I think the "dataset" table itself is definitely the best place for the new column. I haven't really thought about the pros/cons of having the database fill in the timestamp, and don't really have an opinion on that.

            Show
            jbosch Jim Bosch added a comment - Agree this fits best in Registry. I think the "dataset" table itself is definitely the best place for the new column. I haven't really thought about the pros/cons of having the database fill in the timestamp, and don't really have an opinion on that.
            Hide
            tjenness Tim Jenness added a comment -

            Thinking further, I don't think we need to be worrying about TAI vs UTC or nanosecond precision, so it seems that an automated timestamp added by the database should be more than adequate here without needing an Astropy Time. OODS and the like don't need to worry about missing a couple of files because of a TAI offset confusion.

            Show
            tjenness Tim Jenness added a comment - Thinking further, I don't think we need to be worrying about TAI vs UTC or nanosecond precision, so it seems that an automated timestamp added by the database should be more than adequate here without needing an Astropy Time. OODS and the like don't need to worry about missing a couple of files because of a TAI offset confusion.
            Hide
            jbosch Jim Bosch added a comment -

            There may just be three kinds of times in Registry rather than two, but the begin/end timestamps associated with RUN-type collections (and someday Quanta) seem more like the ingest times you're adding now than the observation-oriented TAI times we use in dimensions and calibration collections.  Nothing populates those right now, but maybe we should change the type to whatever you use here as well.

            Show
            jbosch Jim Bosch added a comment - There may just be three kinds of times in Registry rather than two, but the begin/end timestamps associated with RUN-type collections (and someday Quanta) seem more like the ingest times you're adding now than the observation-oriented TAI times we use in dimensions and calibration collections.  Nothing populates those right now, but maybe we should change the type to whatever you use here as well.
            Hide
            tjenness Tim Jenness added a comment -

            Following discussion on slack just now we have decided that I will add a single automatically created timestamp field to the datasets table: ingest_date. There will not be an additional "date of first ingest" that would propagate through database merging since we don't want to duplicate provenance handling.

             

            Updating queryDatasets to use this time will be for a later ticket.

             

            Show
            tjenness Tim Jenness added a comment - Following discussion on slack just now we have decided that I will add a single automatically created timestamp field to the datasets table: ingest_date. There will not be an additional "date of first ingest" that would propagate through database merging since we don't want to duplicate provenance handling.   Updating queryDatasets to use this time will be for a later ticket.  
            Hide
            tjenness Tim Jenness added a comment -

            I think this trivial change does everything we need. I could also add the querying side of things but I'd need some pointers as to where in the code we could possibly handle it. I think the main use case is in a queryDatasets WHERE expression.

            Show
            tjenness Tim Jenness added a comment - I think this trivial change does everything we need. I could also add the querying side of things but I'd need some pointers as to where in the code we could possibly handle it. I think the main use case is in a queryDatasets WHERE expression.
            Hide
            tjenness Tim Jenness added a comment - - edited

            I assume my PR should go against DM-27033

            Show
            tjenness Tim Jenness added a comment - - edited I assume my PR should go against DM-27033
            Hide
            jbosch Jim Bosch added a comment -

            Yes. I think we may have lost the previous merge commits in the last rebase, so it may be worth fixing that before you merge if we want to bother fixing it at all. Linking all of the tickets that get merged on Jira may be sufficient.

            Show
            jbosch Jim Bosch added a comment - Yes. I think we may have lost the previous merge commits in the last rebase, so it may be worth fixing that before you merge if we want to bother fixing it at all. Linking all of the tickets that get merged on Jira may be sufficient.
            Hide
            tjenness Tim Jenness added a comment -

            Jim Bosch I've rebased this ticket onto the DM-27033 integration branch (and added back the other merge commits).

            Show
            tjenness Tim Jenness added a comment - Jim Bosch I've rebased this ticket onto the DM-27033 integration branch (and added back the other merge commits).
            Hide
            jbosch Jim Bosch added a comment -

            Looks good! Sorry it took me so long to get to reviewing such a tiny change.

            Show
            jbosch Jim Bosch added a comment - Looks good! Sorry it took me so long to get to reviewing such a tiny change.

              People

              Assignee:
              tjenness Tim Jenness
              Reporter:
              tjenness Tim Jenness
              Reviewers:
              Jim Bosch
              Watchers:
              Andy Salnikov, Jim Bosch, Steve Pietrowicz, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.