Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-134

Table scans for older releases

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Withdrawn
    • Resolution: Done
    • Component/s: DM
    • Labels:
      None

      Description

      The current baseline is modeled support the following shared scans:

      • Object for the latest DR only
      • Object_extra for the latest and latest-1 DRs
      • Source for the latest DR only
      • ForcedSource for the latest DR only

      Note that with the exception of Object_extra, we are planning to support efficient full-table-type access to the latest data release only. Short non-scan queries will be supported for any data release available on disk (per baseline, that'd be latest and latest -1).

      I'm soliciting comments and reactions. Changing the model and providing access to older data release is not a technical problem, it is only a question of $ (how much hardware to purchase).

        Attachments

          Issue Links

            Activity

            No builds found.
            jbecla Jacek Becla created issue -
            Hide
            jbosch Jim Bosch added a comment -

            I do think full scans on Object for latest-1 would be useful for some groups - some particularly involved science analyses on the full dataset may take longer than a full year to complete, and they may not want to switch to the latest release in order to maintain any characterization they've already done of our processing.

            Show
            jbosch Jim Bosch added a comment - I do think full scans on Object for latest-1 would be useful for some groups - some particularly involved science analyses on the full dataset may take longer than a full year to complete, and they may not want to switch to the latest release in order to maintain any characterization they've already done of our processing.
            Hide
            jbecla Jacek Becla added a comment -

            I discussed this in details with Mario Juric, here is the summary

            Notation used:

            • "DRx" - the latest DR
            • "DRx-1" - the DR released before the latest DR

            • need to support shared scans for DRx, and DRx-1
            • the scan for DRx-1 can be slower, even 1 every 24h would be ok
            • note that most users will continue using DRx-1 for a while right after we publish DRx. So in practice, right after releasing DRx, the DRx-1 should be configured as high priority (faster than DRx)
            • having a scan for DRx-1 is more important that scan for Object_Extra (true for any DR: DRx, DRx-1) If needed, lower speed/priority of Object_Extra scan
            • actions: model it, estimate cost, propose change to the baseline (Jacek)
            • Science Council is planning to request that we keep ALL data releases on disk
            • request will likely happen in the next 6 months
            • SC should define preferred access patterns for older releases. Likely, slow access, similar to DRx-1 will be enough.
            • current model is not appealing: the DR that becomes public gets removed from disks right away
            Show
            jbecla Jacek Becla added a comment - I discussed this in details with Mario Juric , here is the summary Notation used: "DRx" - the latest DR "DRx-1" - the DR released before the latest DR need to support shared scans for DRx, and DRx-1 the scan for DRx-1 can be slower, even 1 every 24h would be ok note that most users will continue using DRx-1 for a while right after we publish DRx. So in practice, right after releasing DRx, the DRx-1 should be configured as high priority (faster than DRx) having a scan for DRx-1 is more important that scan for Object_Extra (true for any DR: DRx, DRx-1) If needed, lower speed/priority of Object_Extra scan actions: model it, estimate cost, propose change to the baseline (Jacek) Science Council is planning to request that we keep ALL data releases on disk request will likely happen in the next 6 months SC should define preferred access patterns for older releases. Likely, slow access, similar to DRx-1 will be enough. current model is not appealing: the DR that becomes public gets removed from disks right away
            Hide
            mjuric Mario Juric added a comment -

            Two minor comments:

            • I think DRx should still be faster than DRx-1 (there will be a slew of people just waiting for the release, who will hit the database immediately), but access to DRx-1 shouldn't immediately become substantially slower upon release (there should be some transition). I wonder if this can be adjusted dynamically (even at the level of someone manually turning the knobs once a ~week)?
            • The issue with keeping only DRx and DRx-1 on disk has primarily come up in the context of long and complex papers that may take a while to write and publish. By the time the paper is truly out and in use, if we only keep two years worth of DRs spinning, there will be no way to replicate its results (or queries) w/o downloading the data in bulk and re-instantiating a copy of Qserv locally. We're being nudged by the SAC to strongly consider keeping everything on disk and I think an official CR will occur soon.
            Show
            mjuric Mario Juric added a comment - Two minor comments: I think DRx should still be faster than DRx-1 (there will be a slew of people just waiting for the release, who will hit the database immediately), but access to DRx-1 shouldn't immediately become substantially slower upon release (there should be some transition). I wonder if this can be adjusted dynamically (even at the level of someone manually turning the knobs once a ~week)? The issue with keeping only DRx and DRx-1 on disk has primarily come up in the context of long and complex papers that may take a while to write and publish. By the time the paper is truly out and in use, if we only keep two years worth of DRs spinning, there will be no way to replicate its results (or queries) w/o downloading the data in bulk and re-instantiating a copy of Qserv locally. We're being nudged by the SAC to strongly consider keeping everything on disk and I think an official CR will occur soon.
            Hide
            tjenness Tim Jenness added a comment -

            I was under the impression that older DRs are still queryable even if they are on tape. It just takes a lot longer. We really do need to be able to archive queries and associate them with DOIs (which should include the specific data release of course) but that's slightly off topic.

            Show
            tjenness Tim Jenness added a comment - I was under the impression that older DRs are still queryable even if they are on tape. It just takes a lot longer. We really do need to be able to archive queries and associate them with DOIs (which should include the specific data release of course) but that's slightly off topic.
            Hide
            jbecla Jacek Becla added a comment -

            I primarily worry here about disk I/O. Interactive queries don't mess up disk I/O, so we were planning to let them through on any DR (and yes, if it triggers tape access it will be slow). So technically all DRs are queryable even if we don't allow scans on some DRs.

            Show
            jbecla Jacek Becla added a comment - I primarily worry here about disk I/O. Interactive queries don't mess up disk I/O, so we were planning to let them through on any DR (and yes, if it triggers tape access it will be slow). So technically all DRs are queryable even if we don't allow scans on some DRs.
            jbecla Jacek Becla made changes -
            Field Original Value New Value
            Link This issue is triggering DM-5103 [ DM-5103 ]
            Hide
            jbecla Jacek Becla added a comment -

            Based on input received, it looks like we should tweak the baseline and support shared scan access to at least 2 most recent data releases. Next step will be to run it through spreadsheets and come up with the $ number.

            Show
            jbecla Jacek Becla added a comment - Based on input received, it looks like we should tweak the baseline and support shared scan access to at least 2 most recent data releases. Next step will be to run it through spreadsheets and come up with the $ number.
            jbecla Jacek Becla made changes -
            Resolution Done [ 10000 ]
            Status Proposed [ 10805 ] Adopted [ 10806 ]
            Hide
            tjenness Tim Jenness added a comment -

            John Swinbank, Wil O'Mullane Given that the work associated with this RFC was invalidated, should we withdraw this RFC?

            Show
            tjenness Tim Jenness added a comment - John Swinbank , Wil O'Mullane Given that the work associated with this RFC was invalidated, should we withdraw this RFC?
            Hide
            swinbank John Swinbank added a comment -

            I think we should solicit opinions from Fritz Mueller & Colin Slater first.

            Show
            swinbank John Swinbank added a comment - I think we should solicit opinions from Fritz Mueller & Colin Slater first.
            Hide
            ctslater Colin Slater added a comment -

            I haven't seen any places where these choices actually drive implementation choices, so I have understood this as "suggestions" for some future policy closer to ops.

             

            (On a technical side, it seems crazy to have databases that don't allow a table scan at all. I have never fully understood what Jacek was proposing here.)

             

             

            Show
            ctslater Colin Slater added a comment - I haven't seen any places where these choices actually drive implementation choices, so I have understood this as "suggestions" for some future policy closer to ops.   (On a technical side, it seems crazy to have databases that don't allow a table scan at all. I have never fully understood what Jacek was proposing here.)    
            gcomoretto Gabriele Comoretto [X] (Inactive) made changes -
            Remote Link This issue links to "Page (Confluence)" [ 19578 ]
            gcomoretto Gabriele Comoretto [X] (Inactive) made changes -
            Remote Link This issue links to "Page (Confluence)" [ 19609 ]
            ktl Kian-Tat Lim made changes -
            Assignee Jacek Becla [ jbecla ] Kian-Tat Lim [ ktl ]
            Hide
            ktl Kian-Tat Lim added a comment -

            I don't think that the supposed baseline was actually implemented in the sizing model, and in any case the disk drive count in the sizing model is constrained primarily by storage, not bandwidth.

            So I'd suggest that we withdraw this RFC.

            (Sorry about the reassignment; it was due to an accidental typo and fixing it will cause more unnecessary spam.)

            Show
            ktl Kian-Tat Lim added a comment - I don't think that the supposed baseline was actually implemented in the sizing model, and in any case the disk drive count in the sizing model is constrained primarily by storage, not bandwidth. So I'd suggest that we withdraw this RFC. (Sorry about the reassignment; it was due to an accidental typo and fixing it will cause more unnecessary spam.)
            ktl Kian-Tat Lim made changes -
            Resolution Done [ 10000 ]
            Status Adopted [ 10806 ] Flagged [ 10606 ]
            ktl Kian-Tat Lim made changes -
            Resolution Done [ 10000 ]
            Status Flagged [ 10606 ] Withdrawn [ 10605 ]
            gcomoretto Gabriele Comoretto [X] (Inactive) made changes -
            Remote Link This issue links to "Page (Confluence)" [ 19643 ]

              People

              Assignee:
              ktl Kian-Tat Lim
              Reporter:
              jbecla Jacek Becla
              Watchers:
              Colin Slater, Jacek Becla, Jim Bosch, John Swinbank, Kian-Tat Lim, Tim Jenness, Xiuqin Wu [X] (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Planned End:

                  Jenkins

                  No builds found.