Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-20230

Add random index column to DPDD/DB Schema

    XMLWordPrintable

    Details

    • Team:
      DM Science

      Description

      Recommendation 9A from the LSP design review report states

      The Project should consider including a priority-2 requirement for a mechanism to allow the random sampling of database tables in a reproducible way.

      This would indeed be helpful, and is commonly implemented by adding to each table a column containing a random number in some format. For example, the Gaia data model includes random_index, which "contains a random permutation of the numbers from 0 to N-1, where N is the number of sources in the table." I think one could also do random floats between 0 and 1 to get the same effect. I suspect floats would be easier to implement since individual tracts could generate random numbers in parallel, and we wouldn't have to institute a table-scale process of assigning each new random ID.

      We should add a similar column to the DPDD and the Object table schema. I don't think there's as much reason to add this ForcedSource. Source is a borderline case but I'd lean towards not needing it there either.

        Attachments

          Activity

          Hide
          gpdf Gregory Dubois-Felsmann added a comment -

          Pinging this ticket

          Show
          gpdf Gregory Dubois-Felsmann added a comment - Pinging this ticket
          Hide
          lguy Leanne Guy added a comment -

          Following discussion with Colin, yes we are adding this to the baseline. It is one column, it has not been high priority to date. Implementation details still to be worked out.

          Show
          lguy Leanne Guy added a comment - Following discussion with Colin, yes we are adding this to the baseline. It is one column, it has not been high priority to date. Implementation details still to be worked out.
          Hide
          gpdf Gregory Dubois-Felsmann added a comment -

          Are we doing to do this for DP0.2?

          Show
          gpdf Gregory Dubois-Felsmann added a comment - Are we doing to do this for DP0.2?
          Hide
          ktl Kian-Tat Lim added a comment -

          I would suggest that even if this is materialized it should still be a hash of one or more row values rather than a true or pseudo-random number. Using a hash should make testing much easier.

          Show
          ktl Kian-Tat Lim added a comment - I would suggest that even if this is materialized it should still be a hash of one or more row values rather than a true or pseudo-random number. Using a hash should make testing much easier.
          Hide
          gpdf Gregory Dubois-Felsmann added a comment -

          Questions still to be answered:

          • Is this going to be implemented as a UDF or as a materialized column?
          • If a column, is it limited to Object (and perhaps DIAObject), as more or less suggested above? If it is a UDF, is it applicable to the unique-ID of every one of the major catalog tables – (DIA)Object, (DIA)Source, ForcedSource, etc.?
          • What is the datatype of the column or UDF value? Float/int? 32/64-bit?
          • If it is a column, how is it generated? As a hash on the row contents? (This seems preferable, as it yields reproducible results.)
          • If it is a column, at what stage of processing is it generated? In the pipelines themselves, i.e., included in the afw.table outputs of the pipelines? Or is it generated/added as part of the SDM Standardization stage, appearing for the first time only in the Parquet outputs of that stage?
          Show
          gpdf Gregory Dubois-Felsmann added a comment - Questions still to be answered: Is this going to be implemented as a UDF or as a materialized column? If a column, is it limited to Object (and perhaps DIAObject), as more or less suggested above? If it is a UDF, is it applicable to the unique-ID of every one of the major catalog tables – (DIA)Object, (DIA)Source, ForcedSource, etc.? What is the datatype of the column or UDF value? Float/int? 32/64-bit? If it is a column, how is it generated? As a hash on the row contents? (This seems preferable, as it yields reproducible results.) If it is a column, at what stage of processing is it generated? In the pipelines themselves, i.e., included in the afw.table outputs of the pipelines? Or is it generated/added as part of the SDM Standardization stage, appearing for the first time only in the Parquet outputs of that stage?

            People

            Assignee:
            ctslater Colin Slater
            Reporter:
            ctslater Colin Slater
            Watchers:
            Andy Salnikov, Colin Slater, Eric Bellm, Fritz Mueller, Gregory Dubois-Felsmann, Kian-Tat Lim, Leanne Guy
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Dates

              Created:
              Updated:

                Jenkins

                No builds found.