Status: To Do
Fix Version/s: None
Component/s: Design Documents
Recommendation 9A from the LSP design review report states
The Project should consider including a priority-2 requirement for a mechanism to allow the random sampling of database tables in a reproducible way.
This would indeed be helpful, and is commonly implemented by adding to each table a column containing a random number in some format. For example, the Gaia data model includes random_index, which "contains a random permutation of the numbers from 0 to N-1, where N is the number of sources in the table." I think one could also do random floats between 0 and 1 to get the same effect. I suspect floats would be easier to implement since individual tracts could generate random numbers in parallel, and we wouldn't have to institute a table-scale process of assigning each new random ID.
We should add a similar column to the DPDD and the Object table schema. I don't think there's as much reason to add this ForcedSource. Source is a borderline case but I'd lean towards not needing it there either.
Following discussion with Colin, yes we are adding this to the baseline. It is one column, it has not been high priority to date. Implementation details still to be worked out.
I would suggest that even if this is materialized it should still be a hash of one or more row values rather than a true or pseudo-random number. Using a hash should make testing much easier.
Questions still to be answered:
- Is this going to be implemented as a UDF or as a materialized column?
- If a column, is it limited to Object (and perhaps DIAObject), as more or less suggested above? If it is a UDF, is it applicable to the unique-ID of every one of the major catalog tables – (DIA)Object, (DIA)Source, ForcedSource, etc.?
- What is the datatype of the column or UDF value? Float/int? 32/64-bit?
- If it is a column, how is it generated? As a hash on the row contents? (This seems preferable, as it yields reproducible results.)
- If it is a column, at what stage of processing is it generated? In the pipelines themselves, i.e., included in the afw.table outputs of the pipelines? Or is it generated/added as part of the SDM Standardization stage, appearing for the first time only in the Parquet outputs of that stage?
Pinging this ticket