Details

Type: Story

Status: Done

Resolution: Done

Fix Version/s: None

Component/s: None

Labels:None

Story Points:4.2

Epic Link:

Team:SQuaRE

Urgent?:No
Description
Compute summary statistics like count, min, mean, stddev, median, and max for the source fields in a list of incoming messages.
These stats are fixed in the code, for now. I think it won't be difficult to define these operations in the configuration but that is left to the future.
Evaluate if we use the Python statistics module, Numpy, or Pandas for that.
We have at least three options for computing summary statistics in Python that we could use for the Kafka aggregator, the Python statistics module, Numpy and Pandas.
While Numpy is built for speed, Python's statistics module is built for precision, and Pandas has some convenience methods to aggregate timeseries data like pandas.Series.resample().
For the Kafka aggregator, it is not obvious that we need the speed of Numpy or the convenience of Pandas. With Faust we are aggregating data in realtime operating over lists of a few hundred to a thousand data points on aggregation windows of a few seconds, and we have to wait for the window to accumulate the data before we aggregate anyway.
As long as we don't fall behind we can prioritize accuracy over speed.
In particular, the statistics.mean() is known to be slow but a simple test shows it is fine for the purpose of the Kafka aggregator (see attachment). Note that if we use Numpy or Pandas we have to take into consideration the time to convert from lists to Numpy arrays or to Pandas Series which is not negligible also.