You talk about limiting keys for the DataUnits. Is there a system for validating acceptable keys?
At what point? The set of keys is decided at design time (it's effectively
RFC-484). We'll certainly validate keys (and values) when they're passed to get and put. I could imagine it being useful for programs to be able to check that a set of keys is valid for a particular DatasetType at other points, but I haven't come across any concrete use cases for that so I'm not sure what the API ought to be. I imagine anything along those lines would be easy to do.
Is it true that every DataUnit key can have a tuple of values like patch, or will the keys that accept tuples need to be configured somehow?
Actually, none of them will be allowed to have tuple values. Patches will be switched to a single sequential integer (ala
RFC-365). In fact, the value for a particular key will at some level be strongly typed, because they'll appear in a SQL schema. But we also will have separate cell_x and cell_y integers fields in the Patch table that would allow something very much like the old indexes to be used in expressions.
I'm a little worried about doing away with templates completely since different datasets look different by construction, but I don't fully comprehend the implementation you are suggesting, so I'm ok with seeing how things shape up.
We aren't getting rid of them completely - the main (POSIX) Datastore will still use templates in put; we'll just then put the full filename in the database and use a query to look it up from the data ID when reading, so there will be no re-insertion of the data ID values into the template in get.
I note the tasks for the obs WG. I do think it would be good to convene enough of a WG to, at least, look through these documents.
I do think it would be good for more than just me to think about the tasks called out in that page. Maybe we could take a couple of hours at the all hands to get the obs WG together to consider those issues and then disband the WG in favor of a more focused implementation group.
Agreed, though I need to get moving on at least some of these tasks before then, so it's likely there will be at least placeholder interfaces/objects in the works by the time the WG has a chance to take a look. And the various "review DMTN-073" tasks may be better done under the auspices of the more structured review of that document that I believe Fritz Mueller will be organizing, though of course the Obs WG very much includes many of the most important people to look at that.
I'm a little worried that it sounds like you are suggesting adding camera info to the raw data repository. I'd hoped any camera information other than the camera name would go in a calibration repository.
The camera information would definitely go into the Gen3 equivalent of a calibration repository, but there won't be such a strong boundary between calibration repositories and raw data repositories in Gen3; instead there will be one master repository that includes raw data, multiples sets of calibrations, and many sets of processing outputs. Essentially, because Gen3 has a more general system for expressing many-to-many relationships between datasets, the relationship between calibrations and raw data is no longer as special as it once was, so declaring the set of calibrations you want to use in a processing ru will be very similar to declaring a set of intermediate main-pipeline outputs you want to start from.
Re: monotonically increasing exposure ids, I know this would be nice, but it seems like a really easy way to introduce bugs. I.e. how do you check that this requirement is met by a particular dataset. It seems like a better requirement that exposures must supply the observation date and that we have utilities to convert that to a monotonically increasing id.
We will definitely have the observation date, too, but I'd very much like to minimize having multiple integer IDs for the same exposure (i.e. by inventing our own in addition to having an externally meaningful one). So I think the alternative would be to just use timestamps directly for the raw->calibration lookup. That feels less "clean" than using monotonic integer ranges, but I don't have a really concrete argument for why it'd be worse - though Robert Lupton might; this was originally his idea.
I'm a little worried that the concept of a visit is special solely because of the current LSST survey design. Are there other similar concepts used by other surveys that should be called out as special concepts?
Visit very much is special because of the current LSST survey design, but happily it just represents a generalization of what other similar surveys are doing. The only other surveys I can think of at the moment that involve core concepts that I think our data model doesn't capture well are SDSS and Gaia, but I think those really are rather unusual features (drift-scan in very long stripes, simultaneous observations with a rigid angle between them), and that it's not in our interest to try to include them in our data model. The same goes for spectrographic data - I don't think PFS will be able to get away with using our set of DataUnits, for example. But it should be possible to use the rest of the system with our own DataUnit schema, and that's what I'd expect any effort to apply our code to fundamentally different observations to do.
I made some typographic changes to the page. Hopefully that's ok.
I'm closing this ticket now, but I'm certainly happy to continue the conversation (either at LSST2018 or earlier).