Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-538

Update alert sizing in LDM-151

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Implemented
    • Resolution: Done
    • Component/s: DM
    • Labels:
      None

      Description

      New estimates of the alert stream sizing have been developed. Jeff Kantor feels that LDM-151, which already has an estimate of the total alert data size per night, is an appropriate place to record the updated values. The edited text is available at https://github.com/lsst/LDM-151/pull/56 for now, soon to be merged to master and then uploaded to DocuShare. (I'm holding off a bit in case there are other LDM-151 updates that could be batched with this.)

      Some of the expansion (from 600 to 800 GB/night) comes from changes to the DPDD, especially the 12 months of history, including non-detections, and larger postage stamps (up to 50x50). The variable part of the expansion (from 800 to 1200 GB/night) comes from potential Avro overhead. While I wasn't able to generate a complete alert packet in the short time I've devoted to this, I came up with up to 50% overhead based on figures that Maria had provided and a few small test cases. It is possible that we may find a way to shrink this, but it will be quite difficult to reduce it below the 800 GB figure.

        Attachments

          Issue Links

            Activity

            No builds found.
            ktl Kian-Tat Lim created issue -
            ktl Kian-Tat Lim made changes -
            Field Original Value New Value
            Risk Score 0
            ktl Kian-Tat Lim made changes -
            Description New estimates of the alert stream sizing have been developed. [~jkantor] feels that LDM-151, which already has an estimate of the total alert data size per night, is an appropriate place to record the updated values. The edited text is available at https://github.com/lsst/LDM-151/pull/56 for now, soon to be merged to master and then uploaded to DocuShare. (I'm holding off a bit in case there are other LDM-151 updates that could be batched with this.) New estimates of the alert stream sizing have been developed. [~jkantor] feels that LDM-151, which already has an estimate of the total alert data size per night, is an appropriate place to record the updated values. The edited text is available at https://github.com/lsst/LDM-151/pull/56 for now, soon to be merged to master and then uploaded to DocuShare. (I'm holding off a bit in case there are other LDM-151 updates that could be batched with this.)

            Some of the expansion (from 600 to 800 GB/night) comes from changes to the DPDD, especially the 12 months of history, including non-detections, and larger postage stamps (up to 50x50). The variable part of the expansion (from 800 to 1200 GB/night) comes from potential Avro overhead. While I wasn't able to generate a complete alert packet in the short time I've devoted to this, I came up with up to 50% overhead based on figures that Maria had provided and a few small test cases. It is possible that we may find a way to shrink this, but it will be quite difficult to reduce it below the 800 GB figure.
            Hide
            spietrowicz Steve Pietrowicz added a comment - - edited

            In section 3.3.4, on page 24, the old number of 600GB is still used.

            Also, what's the approximate size of each message?

            Show
            spietrowicz Steve Pietrowicz added a comment - - edited In section 3.3.4, on page 24, the old number of 600GB is still used. Also, what's the approximate size of each message?
            Hide
            ktl Kian-Tat Lim added a comment - - edited

            Thanks for finding that other reference. I've updated it.

            The minimum Alert size is 82 KB, from adding up the entries in the DPDD and determining the number of historical DIASource records per Alert based on LSE-81. Maria reportedly had 135 KB per Alert without history, but I'm not certain that was measured correctly, and I think it involved a very inefficient encoding of some metadata. So I think we can get substantially smaller.

            Show
            ktl Kian-Tat Lim added a comment - - edited Thanks for finding that other reference. I've updated it. The minimum Alert size is 82 KB, from adding up the entries in the DPDD and determining the number of historical DIASource records per Alert based on LSE-81. Maria reportedly had 135 KB per Alert without history, but I'm not certain that was measured correctly, and I think it involved a very inefficient encoding of some metadata. So I think we can get substantially smaller.
            Hide
            jkantor Jeff Kantor added a comment -

            I have already provided my input to this, so no further comments from me.

            Show
            jkantor Jeff Kantor added a comment - I have already provided my input to this, so no further comments from me.
            ebellm Eric Bellm made changes -
            Watchers Jeff Kantor, John Swinbank, Kian-Tat Lim, Steve Pietrowicz [ Jeff Kantor, John Swinbank, Kian-Tat Lim, Steve Pietrowicz ] Jeff Kantor, John Swinbank, Kian-Tat Lim, Leanne Guy, Steve Pietrowicz [ Jeff Kantor, John Swinbank, Kian-Tat Lim, Leanne Guy, Steve Pietrowicz ]
            Hide
            ebellm Eric Bellm added a comment -

            I'm not comfortable updating the alert sizing to these values without a demonstrated serialization. The Avro encoding is our implementation baseline and undergirded the completion of the LDM-503-5 milestone. At this point we can make estimates at higher fidelity than simply counting fields and I think it's worth taking the time to do so (and identify where inefficiencies exist and fix them). I'm happy to take this on.

            I also think we need to present more nuance about when the clock stops on OTT1. Summarizing our email discussion, we are in agreement that "packet arrival at the broker" is not workable because of potential network and broker latencies. However I am not convinced that "making the alert available in the queue at NCSA" is satisfactory as an endpoint because it does not account for even ideal transit time out of NCSA.

            Show
            ebellm Eric Bellm added a comment - I'm not comfortable updating the alert sizing to these values without a demonstrated serialization. The Avro encoding is our implementation baseline and undergirded the completion of the LDM-503-5 milestone. At this point we can make estimates at higher fidelity than simply counting fields and I think it's worth taking the time to do so (and identify where inefficiencies exist and fix them). I'm happy to take this on. I also think we need to present more nuance about when the clock stops on OTT1. Summarizing our email discussion, we are in agreement that "packet arrival at the broker" is not workable because of potential network and broker latencies. However I am not convinced that "making the alert available in the queue at NCSA" is satisfactory as an endpoint because it does not account for even ideal transit time out of NCSA.
            ktl Kian-Tat Lim made changes -
            Planned End 20/Oct/18 4:43 AM 27/Oct/18 4:43 AM
            Hide
            ktl Kian-Tat Lim added a comment -

            I haven't had time to work on a measurement of Avro sizing. If Eric Bellm can do so, that would be great.

            I'm not sure there is such a thing as "ideal transit time" out of NCSA unless we also define an "ideal broker" at some physical and network location. Availability at the NCSA network endpoint would seem to be the best that can be measured and managed. If we want to reserve some additional budget for (N_bytes / nominal_bandwidth), we could, but that's not really different from subtracting a few seconds from OTT1 with the network-endpoint definition.

            Show
            ktl Kian-Tat Lim added a comment - I haven't had time to work on a measurement of Avro sizing. If Eric Bellm can do so, that would be great. I'm not sure there is such a thing as "ideal transit time" out of NCSA unless we also define an "ideal broker" at some physical and network location. Availability at the NCSA network endpoint would seem to be the best that can be measured and managed. If we want to reserve some additional budget for (N_bytes / nominal_bandwidth), we could, but that's not really different from subtracting a few seconds from OTT1 with the network-endpoint definition.
            Hide
            ebellm Eric Bellm added a comment -

            Kian-Tat Lim: I'll work on an improved estimate of the Avro sizing.

            I take your point that we can only effectively measure (and control) packet availability at the NCSA endpoint, so I agree we should do so. But I do think we have to allocate (N_bytes / nominal_bandwidth) in OTT1--otherwise we could claim success even if there was no pipe out of NCSA at all

            Show
            ebellm Eric Bellm added a comment - Kian-Tat Lim : I'll work on an improved estimate of the Avro sizing. I take your point that we can only effectively measure (and control) packet availability at the NCSA endpoint, so I agree we should do so. But I do think we have to allocate (N_bytes / nominal_bandwidth) in OTT1--otherwise we could claim success even if there was no pipe out of NCSA at all
            Hide
            ebellm Eric Bellm added a comment -

            Created https://jira.lsstcorp.org/browse/DM-16280 for the estimation.

            Show
            ebellm Eric Bellm added a comment - Created https://jira.lsstcorp.org/browse/DM-16280 for the estimation.
            tjenness Tim Jenness made changes -
            Planned End 27/Oct/18 4:43 AM 06/Nov/18 4:43 AM
            swinbank John Swinbank made changes -
            Link This issue relates to DM-15512 [ DM-15512 ]
            swinbank John Swinbank made changes -
            Link This issue relates to DM-16280 [ DM-16280 ]
            Hide
            swinbank John Swinbank added a comment -

            This RFC arose from a DMLT action of 2018-10-08 on Jeff Kantor & Kian-Tat Lim to (among other things) “... to document the available bandwidth for ... alert streams”.

            Understanding the size of an individual alert stream may be a necessary prerequisite for understanding how many streams some given bandwidth can support — and, arguably, may be useful in reasoning about how much total bandwidth is necessary — but it's not obvious that this RFC as written actually addresses the action requested. What is the plan for baselining a total alert stream bandwidth?

            Show
            swinbank John Swinbank added a comment - This RFC arose from a DMLT action of 2018-10-08 on Jeff Kantor & Kian-Tat Lim to (among other things) “... to document the available bandwidth for ... alert streams”. Understanding the size of an individual alert stream may be a necessary prerequisite for understanding how many streams some given bandwidth can support — and, arguably, may be useful in reasoning about how much total bandwidth is necessary — but it's not obvious that this RFC as written actually addresses the action requested. What is the plan for baselining a total alert stream bandwidth?
            Hide
            jkantor Jeff Kantor added a comment -

            As I understood this, there are two tasks to complete this:

            Eric and K-T need to agree on a sizing for each stream

            NCSA needs to provide an analysis of ALL LDF outbound traffic rates, including alerts

            In the interim, we decided that we could a priori allocate 10 Gbps to alert streams, irrespective of the above.

             

            Show
            jkantor Jeff Kantor added a comment - As I understood this, there are two tasks to complete this: Eric and K-T need to agree on a sizing for each stream NCSA needs to provide an analysis of ALL LDF outbound traffic rates, including alerts In the interim, we decided that we could a priori allocate 10 Gbps to alert streams, irrespective of the above.  
            Hide
            tjenness Tim Jenness added a comment -

            What's the status of this RFC?

            Show
            tjenness Tim Jenness added a comment - What's the status of this RFC?
            Hide
            ebellm Eric Bellm added a comment -

            Quantitative estimates are complete (DM-16280), so I think it's just to Kian-Tat Lim to update the documents?

            Show
            ebellm Eric Bellm added a comment - Quantitative estimates are complete ( DM-16280 ), so I think it's just to Kian-Tat Lim to update the documents?
            Hide
            tjenness Tim Jenness added a comment -

            I think that this RFC has run its course since it's not requesting approval for an update of LDM-151, it seems to be requesting that we at some point update LDM-151. Can we mark it as implemented based on DM-16280 being done and then when Kian-Tat Lim does the update to LDM-151 that would become another RFC?

            Show
            tjenness Tim Jenness added a comment - I think that this RFC has run its course since it's not requesting approval for an update of LDM-151, it seems to be requesting that we at some point update LDM-151. Can we mark it as implemented based on DM-16280 being done and then when Kian-Tat Lim does the update to LDM-151 that would become another RFC?
            Hide
            swinbank John Swinbank added a comment -

            Per my comment above, we still haven't adequately addressed the todo list item which this RFC is (supposedly) in response to. Please don't close this RFC until either the full todo list item has been addressed or some other ticket has been filed to capture the action of documenting available bandwidth.

            Show
            swinbank John Swinbank added a comment - Per my comment above, we still haven't adequately addressed the todo list item which this RFC is (supposedly) in response to. Please don't close this RFC until either the full todo list item has been addressed or some other ticket has been filed to capture the action of documenting available bandwidth.
            Hide
            ktl Kian-Tat Lim added a comment -

            My plan is to modify LDM-151 with the per-stream bandwidth and LDM-148 with the total available baseline outbound alert network bandwidth (in text) and per-stream bandwidth (in figure). This RFC will then morph into proposing those two new document versions for CCB acceptance.

            I'm hoping to complete those modifications and publish new versions in DocuShare later this afternoon.

            Show
            ktl Kian-Tat Lim added a comment - My plan is to modify LDM-151 with the per-stream bandwidth and LDM-148 with the total available baseline outbound alert network bandwidth (in text) and per-stream bandwidth (in figure). This RFC will then morph into proposing those two new document versions for CCB acceptance. I'm hoping to complete those modifications and publish new versions in DocuShare later this afternoon.
            Hide
            ktl Kian-Tat Lim added a comment -

            New non-preferred draft versions of LDM-148 and LDM-151 have been uploaded to DocuShare.

            Show
            ktl Kian-Tat Lim added a comment - New non-preferred draft versions of LDM-148 and LDM-151 have been uploaded to DocuShare.
            Hide
            ktl Kian-Tat Lim added a comment -

            The DM-CCB is requested to approve these changes to LDM-148 and LDM-151.

            Show
            ktl Kian-Tat Lim added a comment - The DM-CCB is requested to approve these changes to LDM-148 and LDM-151.
            ktl Kian-Tat Lim made changes -
            Status Proposed [ 10805 ] Flagged [ 10606 ]
            gcomoretto Gabriele Comoretto [X] (Inactive) made changes -
            Remote Link This issue links to "Page (Confluence)" [ 19580 ]
            gcomoretto Gabriele Comoretto [X] (Inactive) made changes -
            Remote Link This issue links to "Page (Confluence)" [ 19610 ]
            ktl Kian-Tat Lim made changes -
            Status Flagged [ 10606 ] Board Recommended [ 11405 ]
            ktl Kian-Tat Lim made changes -
            Status Board Recommended [ 11405 ] Adopted [ 10806 ]
            ktl Kian-Tat Lim made changes -
            Resolution Done [ 10000 ]
            Status Adopted [ 10806 ] Implemented [ 11105 ]
            gcomoretto Gabriele Comoretto [X] (Inactive) made changes -
            Remote Link This issue links to "Page (Confluence)" [ 19646 ]

              People

              Assignee:
              ktl Kian-Tat Lim
              Reporter:
              ktl Kian-Tat Lim
              Watchers:
              Eric Bellm, Jeff Kantor, John Swinbank, Kian-Tat Lim, Leanne Guy, Steve Pietrowicz, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Planned End:

                  Jenkins

                  No builds found.