Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-20230

Add random index column to DPDD/DB Schema

    Details

    • Team:
      DM Science

      Description

      Recommendation 9A from the LSP design review report states

      The Project should consider including a priority-2 requirement for a mechanism to allow the random sampling of database tables in a reproducible way.

      This would indeed be helpful, and is commonly implemented by adding to each table a column containing a random number in some format. For example, the Gaia data model includes random_index, which "contains a random permutation of the numbers from 0 to N-1, where N is the number of sources in the table." I think one could also do random floats between 0 and 1 to get the same effect. I suspect floats would be easier to implement since individual tracts could generate random numbers in parallel, and we wouldn't have to institute a table-scale process of assigning each new random ID.

      We should add a similar column to the DPDD and the Object table schema. I don't think there's as much reason to add this ForcedSource. Source is a borderline case but I'd lean towards not needing it there either.

        Attachments

          Activity

          Hide
          fritzm Fritz Mueller added a comment - - edited

          Brian Van Klaveren has in the past suggested an alternative approach: that we provide a hashing (permuting) UDF that could be applied to ObjectIDs. This would have the advantage of no additional storage overhead and no generation/assignment complications.

          Show
          fritzm Fritz Mueller added a comment - - edited Brian Van Klaveren has in the past suggested an alternative approach: that we provide a hashing (permuting) UDF that could be applied to ObjectIDs. This would have the advantage of no additional storage overhead and no generation/assignment complications.
          Hide
          ctslater Colin Slater added a comment -

          Ah that's a cool idea. Trying to think through the implications of either choice, the hash function means you couldn't index on such a column, but I'm not sure if we would have made such an index anyways? Otherwise it sounds like a time-space tradeoff; not sure how costly such a function would be to run.

          Show
          ctslater Colin Slater added a comment - Ah that's a cool idea. Trying to think through the implications of either choice, the hash function means you couldn't index on such a column, but I'm not sure if we would have made such an index anyways? Otherwise it sounds like a time-space tradeoff; not sure how costly such a function would be to run.
          Hide
          fritzm Fritz Mueller added a comment -

          Right. For Qserv, we would not maintain such an index in any case, to date we find we are primarily I/O bound and not CPU bound, and if we went ahead and added a single such UDF it would "just work" with what we have today (and could also be used on Sources, FWIW).

          For non-Qserv, having an actual indexed column(s) would probably be quite beneficial... So what to do wrt. the DPDD, which is ostensibly Qserv-agnostic? I guess we'd need to scope the storage reqs (not hard) and DRP implementation effort?

          Show
          fritzm Fritz Mueller added a comment - Right. For Qserv, we would not maintain such an index in any case, to date we find we are primarily I/O bound and not CPU bound, and if we went ahead and added a single such UDF it would "just work" with what we have today (and could also be used on Sources, FWIW). For non-Qserv, having an actual indexed column(s) would probably be quite beneficial... So what to do wrt. the DPDD, which is ostensibly Qserv-agnostic? I guess we'd need to scope the storage reqs (not hard) and DRP implementation effort?
          Hide
          fritzm Fritz Mueller added a comment -

          Another possibility, I suppose, would be to go ahead and scope it in the DPDD, generate it during DRP, then if we really wanted/needed the space savings for Qserv we could strip it off at ingest and provide the id-permute UDF as an alternative? That way it would still be there ready if somebody wanted to pick up or bulk download the DRP outputs for use with some non-Qserv RDBMS?

          Show
          fritzm Fritz Mueller added a comment - Another possibility, I suppose, would be to go ahead and scope it in the DPDD, generate it during DRP, then if we really wanted/needed the space savings for Qserv we could strip it off at ingest and provide the id-permute UDF as an alternative? That way it would still be there ready if somebody wanted to pick up or bulk download the DRP outputs for use with some non-Qserv RDBMS?
          Hide
          ctslater Colin Slater added a comment -

          Ah, that might make sense to put it in the DPDD as a column, with a footnote that it might in practice be a UDF. That would suitably put us "on the hook" for it without having to commit to an implementation right now.

          Show
          ctslater Colin Slater added a comment - Ah, that might make sense to put it in the DPDD as a column, with a footnote that it might in practice be a UDF. That would suitably put us "on the hook" for it without having to commit to an implementation right now.
          Hide
          gpdf Gregory Dubois-Felsmann added a comment -

          I think we need some followup on this to decide whether we are adding this to the baseline.

          Show
          gpdf Gregory Dubois-Felsmann added a comment - I think we need some followup on this to decide whether we are adding this to the baseline.

            People

            • Assignee:
              ctslater Colin Slater
              Reporter:
              ctslater Colin Slater
              Watchers:
              Andy Salnikov, Colin Slater, Fritz Mueller, Gregory Dubois-Felsmann, Kian-Tat Lim, Leanne Guy
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Summary Panel