Fix Version/s: None
It would be very helpful for the OODS if butler recorded ingest time as well as observation time. Storing ingest time and allowing queries such as "give me all datasets in this collection that were stored more than N days ago" would significantly streamline OODS data expiry. The returned refs could immediately be passed to pruneDatasets. Currently OODS works by going behind gen2 butler's back and deleting the files directly from datastore without involving butler.
|Field||Original Value||New Value|
Agree this fits best in Registry. I think the "dataset" table itself is definitely the best place for the new column. I haven't really thought about the pros/cons of having the database fill in the timestamp, and don't really have an opinion on that.
Thinking further, I don't think we need to be worrying about TAI vs UTC or nanosecond precision, so it seems that an automated timestamp added by the database should be more than adequate here without needing an Astropy Time. OODS and the like don't need to worry about missing a couple of files because of a TAI offset confusion.
|Assignee||Tim Jenness [ tjenness ]|
|Team||Data Access and Database [ 10204 ]||Architecture [ 10304 ]|
There may just be three kinds of times in Registry rather than two, but the begin/end timestamps associated with RUN-type collections (and someday Quanta) seem more like the ingest times you're adding now than the observation-oriented TAI times we use in dimensions and calibration collections. Nothing populates those right now, but maybe we should change the type to whatever you use here as well.
|Status||To Do [ 10001 ]||In Progress [ 3 ]|
Following discussion on slack just now we have decided that I will add a single automatically created timestamp field to the datasets table: ingest_date. There will not be an additional "date of first ingest" that would propagate through database merging since we don't want to duplicate provenance handling.
Updating queryDatasets to use this time will be for a later ticket.
I think this trivial change does everything we need. I could also add the querying side of things but I'd need some pointers as to where in the code we could possibly handle it. I think the main use case is in a queryDatasets WHERE expression.
|Reviewers||Jim Bosch [ jbosch ]|
|Status||In Progress [ 3 ]||In Review [ 10004 ]|
Yes. I think we may have lost the previous merge commits in the last rebase, so it may be worth fixing that before you merge if we want to bother fixing it at all. Linking all of the tickets that get merged on Jira may be sufficient.
Looks good! Sorry it took me so long to get to reviewing such a tiny change.
|Status||In Review [ 10004 ]||Reviewed [ 10101 ]|
|Resolution||Done [ 10000 ]|
|Status||Reviewed [ 10101 ]||Done [ 10002 ]|
Jim Bosch this seems like a low hanging fruit schema change that I should try to do.
I think we decided that this should be registry to allow querying but that it could also go to datastore but that would only be useful if there was some behind the scenes datastore job that could delete artifacts without telling registry.
What I'm not sure about is which table to date should be attached to. Should it be a new table? Which table should get the new column? Is it the "dataset" table (https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/registry/datasets/byDimensions/tables.py#L200) ? Should we let the database automatically fill in the timestamp?