Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-25012

Implement the compute() method in the Aggregator class

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Compute summary statistics like count, min, mean, stddev, median, and max for the source fields in a list of incoming messages.

      These stats are fixed in the code, for now. I think it won't be difficult to define these operations in the configuration but that is left to the future.

      Evaluate if we use the Python statistics module, Numpy, or Pandas for that.

        Attachments

          Issue Links

            Activity

            Hide
            afausti Angelo Fausti added a comment - - edited

            We have at least three options for computing summary statistics in Python that we could use for the Kafka aggregator, the Python statistics module, Numpy and Pandas.

            While Numpy is built for speed, Python's statistics module is built for precision, and Pandas has some convenience methods to aggregate time-series data like pandas.Series.resample().

            For the Kafka aggregator, it is not obvious that we need the speed of Numpy or the convenience of Pandas. With Faust we are aggregating data in real-time operating over lists of a few hundred to a thousand data points on aggregation windows of a few seconds, and we have to wait for the window to accumulate the data before we aggregate anyway.

            As long as we don't fall behind we can prioritize accuracy over speed.

            In particular, the statistics.mean() is known to be slow but a simple test shows it is fine for the purpose of the Kafka aggregator (see attachment). Note that if we use Numpy or Pandas we have to take into consideration the time to convert from lists to Numpy arrays or to Pandas Series which is not negligible also.

            Show
            afausti Angelo Fausti added a comment - - edited We have at least three options for computing summary statistics in Python that we could use for the Kafka aggregator, the Python statistics module , Numpy and Pandas . While Numpy is built for speed, Python's statistics module is built for precision, and Pandas has some convenience methods to aggregate time-series data like pandas.Series.resample() . For the Kafka aggregator, it is not obvious that we need the speed of Numpy or the convenience of Pandas. With Faust we are aggregating data in real-time operating over lists of a few hundred to a thousand data points on aggregation windows of a few seconds, and we have to wait for the window to accumulate the data before we aggregate anyway. As long as we don't fall behind we can prioritize accuracy over speed. In particular, the statistics.mean() is known to be slow but a simple test shows it is fine for the purpose of the Kafka aggregator (see attachment). Note that if we use Numpy or Pandas we have to take into consideration the time to convert from lists to Numpy arrays or to Pandas Series which is not negligible also.
            Hide
            afausti Angelo Fausti added a comment -

            We use the Field class to keep information about the field name to be aggregated and the operation that must be applied.

            Show
            afausti Angelo Fausti added a comment - We use the Field class to keep information about the field name to be aggregated and the operation that must be applied.
            Hide
            afausti Angelo Fausti added a comment -

            Jonathan Sick note that PR #5 is built on top of PR #4 to implement the compute method.

            If you prefer you can review PR #5 directly.

            Show
            afausti Angelo Fausti added a comment - Jonathan Sick note that PR #5 is built on top of PR #4 to implement the compute method. If you prefer you can review PR #5 directly.
            Hide
            afausti Angelo Fausti added a comment -

            Would appreciate very much your review here.

            Show
            afausti Angelo Fausti added a comment - Would appreciate very much your review here.
            Hide
            jsick Jonathan Sick added a comment -

            Looks good. Some comments in the PR.

            Show
            jsick Jonathan Sick added a comment - Looks good. Some comments in the PR.
            Hide
            afausti Angelo Fausti added a comment -

            Thanks for the review!

            Show
            afausti Angelo Fausti added a comment - Thanks for the review!

              People

              • Assignee:
                afausti Angelo Fausti
                Reporter:
                afausti Angelo Fausti
                Reviewers:
                Jonathan Sick
                Watchers:
                Angelo Fausti, Jonathan Sick
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel