Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-5316

Research alert production database design

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Attachments

        Issue Links

          Activity

          Hide
          smonkewitz Serge Monkewitz added a comment -
          • I'm not sure that DiaForcedSource will have the same schema as DiaSource; in the baseline schema it has significantly skinnier rows (~5x), which would decrease total size by a lot.
          • CCDs will be processed in parallel to the extent possible; off the top of my head AP needs to process with 189*2 16MPixel CCDs every 39 seconds (*2 since a visit usually consists of a pair of exposures), although the pipeline should have some advance notice of where the telescope will take an exposure pair, which can be used to prefetch the historical data that will be required by AP ahead of time. So that relaxes the latency (but not throughput) requirements on the spatial queries somewhat.
          • It's not at all clear to me that the number of validity intervals per object will be high enough to make an index on validity-end time worthwhile, and furthermore, using it in keys will slow down AP updates (since the validity end time of superseded records changes), which is potentially bad. My thinking when I wrote this part of the doc was that the number of intervals would be relatively small and that there will always be another index (ID or spatial bounds) that can usefully cut down the number of records examined by the queries we care about.
          • That said, if such an index ends up being necessary, then we can try indexing validity intervals by (ab)using existing MySQL spatial indexing machinery (if nothing better is available in the production timeframe). Right now there is support for r-tree indexing of axis-aligned 2d cartesian boxes, so the idea would be to use the MJD validity interval in one dimension and fix the interval for the other dimension to [0,0] to get indexable boxes.
          Show
          smonkewitz Serge Monkewitz added a comment - I'm not sure that DiaForcedSource will have the same schema as DiaSource; in the baseline schema it has significantly skinnier rows (~5x), which would decrease total size by a lot. CCDs will be processed in parallel to the extent possible; off the top of my head AP needs to process with 189*2 16MPixel CCDs every 39 seconds (*2 since a visit usually consists of a pair of exposures), although the pipeline should have some advance notice of where the telescope will take an exposure pair, which can be used to prefetch the historical data that will be required by AP ahead of time. So that relaxes the latency (but not throughput) requirements on the spatial queries somewhat. It's not at all clear to me that the number of validity intervals per object will be high enough to make an index on validity-end time worthwhile, and furthermore, using it in keys will slow down AP updates (since the validity end time of superseded records changes), which is potentially bad. My thinking when I wrote this part of the doc was that the number of intervals would be relatively small and that there will always be another index (ID or spatial bounds) that can usefully cut down the number of records examined by the queries we care about. That said, if such an index ends up being necessary, then we can try indexing validity intervals by (ab)using existing MySQL spatial indexing machinery (if nothing better is available in the production timeframe). Right now there is support for r-tree indexing of axis-aligned 2d cartesian boxes, so the idea would be to use the MJD validity interval in one dimension and fix the interval for the other dimension to [0,0] to get indexable boxes.
          Hide
          smonkewitz Serge Monkewitz added a comment -
          • I'm not sure wether all DiaSource records associated with a DiaObject/SSObject will get packaged up with an event - there might be a limit along the lines of "at most N most recent associations" or "associated DiaSources younger than X".
          • Keep in mind that AP has the option of not doing a specific computation in the database. For example, finding all DiaSource records associated with a DiaObject can be done outside of the database by having the pipeline fetch all DiaObjects and all DiaSources in the rough vicinity of a CCD and then doing the required join with optimized C++ code that operates on in-memory data structures. SSObjects are trickier, because those move around significantly. Similarly, AP could find "current" objects in C++. Anyway, the point is just that the executor of some of these AP queries can be adjusted to achieve the desired performance goals.
          Show
          smonkewitz Serge Monkewitz added a comment - I'm not sure wether all DiaSource records associated with a DiaObject/SSObject will get packaged up with an event - there might be a limit along the lines of "at most N most recent associations" or "associated DiaSources younger than X". Keep in mind that AP has the option of not doing a specific computation in the database. For example, finding all DiaSource records associated with a DiaObject can be done outside of the database by having the pipeline fetch all DiaObjects and all DiaSources in the rough vicinity of a CCD and then doing the required join with optimized C++ code that operates on in-memory data structures. SSObjects are trickier, because those move around significantly. Similarly, AP could find "current" objects in C++. Anyway, the point is just that the executor of some of these AP queries can be adjusted to achieve the desired performance goals.
          Hide
          smonkewitz Serge Monkewitz added a comment -

          Regarding IDs, my understanding is that the pipelines are responsible for generating them (the likely mechanism to support the DPDD requirement you cite is to reserve some number of higher order bits in each ID for a DR number), and that we should not ever need to generate them in the database.

          Finally, while I haven't thought very hard about it, I suspect that supporting compound primary/foreign keys (for director/child tables) might cause non-trivial complications in the Qserv query analysis and secondary index code paths, so it may be worth pushing back if things seem to be leaning in this direction.

          Show
          smonkewitz Serge Monkewitz added a comment - Regarding IDs, my understanding is that the pipelines are responsible for generating them (the likely mechanism to support the DPDD requirement you cite is to reserve some number of higher order bits in each ID for a DR number), and that we should not ever need to generate them in the database. Finally, while I haven't thought very hard about it, I suspect that supporting compound primary/foreign keys (for director/child tables) might cause non-trivial complications in the Qserv query analysis and secondary index code paths, so it may be worth pushing back if things seem to be leaning in this direction.
          Hide
          salnikov Andy Salnikov added a comment -

          I wanted to get some input from Jacek regarding baseline schema and forced sources but I'm still waiting. I'll marking this as done and will update table above with more realistic numbers once I manage to get them.

          Show
          salnikov Andy Salnikov added a comment - I wanted to get some input from Jacek regarding baseline schema and forced sources but I'm still waiting. I'll marking this as done and will update table above with more realistic numbers once I manage to get them.
          Hide
          salnikov Andy Salnikov added a comment -

          I have updated size estimated based in DiaForcedSource schema from Schema Browser (which still may change).

          Show
          salnikov Andy Salnikov added a comment - I have updated size estimated based in DiaForcedSource schema from Schema Browser (which still may change).

            People

            • Assignee:
              salnikov Andy Salnikov
              Reporter:
              fritzm Fritz Mueller
              Reviewers:
              Serge Monkewitz
              Watchers:
              Andy Salnikov, Fritz Mueller, Serge Monkewitz
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Summary Panel