# Investigate change in processing time when using pandas in ap_association

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Story Points:
4
• Sprint:
AP F19-2
• Team:

#### Description

The time of needed for AssociationTask increased by a factor of a few due to the switch over to pandas from afw. This ticket will investigate where the majority of the time is being spent.

#### Attachments

1. output.png
1.49 MB
2. outputSimpleFit.png
1.10 MB
3. outputSimpleFitAllHits.png
1.17 MB
4. outputStack.png
1.23 MB

#### Activity

Hide
Chris Morrison added a comment - - edited

Profiled the code (weekly 30) over the full HiTS2015 dataset and found that the majority of time in ap_association was spent on fitting the time series slope/intercept. I will attempt to simplify the fitting to speed up code.

Show
Chris Morrison added a comment - - edited Profiled the code (weekly 30) over the full HiTS2015 dataset and found that the majority of time in ap_association was spent on fitting the time series slope/intercept. I will attempt to simplify the fitting to speed up code.
Hide
Chris Morrison added a comment - - edited

Ran ap_verify with the full CI-HiTS2015 dataset used as processing (6 ccds worth of data). A/B-ed the current fitter with a simple matrix solver. The two files are attached as outputStack.png (for the current weekly version 30) and a simpler flux fitting model in outputSimpleFit.png.

Looking at the percentages, the new fitter reduces the time spent in _set_flux_stats from 7.96% of processing time to 4.53% of processing time. This reduces the time spent in association.run from 9.77% to 6.40%.

Weekly version:

ticket version:

I've added a this quick fix to a branch associated with this ticket.

Show
Chris Morrison added a comment - - edited Ran ap_verify with the full CI-HiTS2015 dataset used as processing (6 ccds worth of data). A/B-ed the current fitter with a simple matrix solver. The two files are attached as outputStack.png (for the current weekly version 30) and a simpler flux fitting model in outputSimpleFit.png. Looking at the percentages, the new fitter reduces the time spent in _ set_flux_stats  from 7.96% of processing time to 4.53% of processing time. This reduces the time spent in association.run from 9.77% to 6.40%. Weekly version: ticket version: I've added a this quick fix to a branch associated with this ticket.
Hide
Chris Morrison added a comment - - edited

Ran the simple fitter (aka matrix solver) over the full HiTS2015 dataset. update_dia_objects time is reduced by half from 17.10% to 8.30% of the total run time. The estimation of a linear slope/intercept to a light curve no longer shows up on the timing diagram. _set_flux_stats still takes the majority of the run time in associationTask accounting for 7.22% of association.run's 9..16% of total run time.

Going forward, calculation of the stetson_J statistic accounts for 4.09% of total run time and is a prime candidate to port to C++. retrieve_dia_objects only accounts for 0.63% of total run time currently and loading DiaSources does not seem to be a concern yet for this dataset.

Show
Chris Morrison added a comment - - edited Ran the simple fitter (aka matrix solver) over the full HiTS2015 dataset.  update_dia_objects time is reduced by half from 17.10% to 8.30% of the total run time. The estimation of a linear slope/intercept to a light curve no longer shows up on the timing diagram. _ set_flux_stats still takes the majority of the run time in associationTask accounting for 7.22% of association.run's 9..16% of total run time.  Going forward, calculation of the stetson_J statistic accounts for 4.09% of total run time and is a prime candidate to port to C++ . retrieve_dia_objects  only accounts for 0.63% of total run time currently and loading DiaSources does not seem to be a concern yet for this dataset.
Hide
Chris Morrison added a comment -
Show
Chris Morrison added a comment - Jenkins run:  https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/30205/pipeline
Hide
Chris Morrison added a comment -

Summary: The change to pandas seems to be a red herring as much of the processing time spent was due to this fitter. That's not to say that optimizations can't be made with respect to iterating over pandas (One suggestion would be to sort on filter at DiaSource load), however, this fix gets the processing time back close to where it was pre-pandas switch over.

Show
Chris Morrison added a comment - Summary: The change to pandas seems to be a red herring as much of the processing time spent was due to this fitter. That's not to say that optimizations can't be made with respect to iterating over pandas (One suggestion would be to sort on filter at DiaSource load), however, this fix gets the processing time back close to where it was pre-pandas switch over.
Hide
Eric Bellm added a comment -

Great work!

Show
Eric Bellm added a comment - Great work!

#### People

Assignee:
Chris Morrison
Reporter:
Chris Morrison
Reviewers:
Eric Bellm
Watchers:
Chris Morrison, Eric Bellm, John Swinbank