# Profile and optimize NMF deblender code

XMLWordPrintable

## Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s: None
• Labels:
None
• Story Points:
8
• Sprint:
DRP S18-4
• Team:
Data Release Production

## Description

The NMF deblender is now working but slow; profile it and come up with a plan for optimization.

## Attachments

1. snakeviz_profile_v2.png
610 kB
2. snakeviz_profile.png
555 kB

## Activity

Hide
Fred Moolekamp added a comment -

I've gone through a few iterations of profiling and optimization now. Most of the speedup came from DM-13912, which improved the runtime by ~25-30% by switching from scipy.sparse matrices to analytic LinearFilter objects in scarlet. There was also a noticeable (~5-10%) speedup using numpy.zeros in place of numpy.zeros_like, which has an extra copy method to handle initializing arrays of strings.

The above image shows the current profile when running snakeviz. The function that takes the most time is apply_filters, which is the function that performs the fractional translations in x and y. This is not due to the speed of the function (which is about twice as fast as prox_weighted_monotonic, for example) but the number of times it is called. prox_weighted_monotonic is called once per source per iteration while apply_filters is called ~7 times per source per iteration.

Peter Melchior and I have discussed optimizing the proxmin package to improve the algorithm we use to fit the solution now that we are no longer using the slower prox_g proximal operators. Once he implements those changes we have to decide how much time we want to spend trying to optimize scarlet.

What is clear from the profile above is that there aren't any individual functions in the code that will give us large gains in optimization. At this point there are just a lot of small functions that cumulatively add up to a large processing time over many iterations (although a single iteration of scarlet is only a few milliseconds per blend). On the other hand, there might be a few places that we can be more clever about the way we calculate certain quantities.

Show
Fred Moolekamp added a comment - I've gone through a few iterations of profiling and optimization now. Most of the speedup came from DM-13912 , which improved the runtime by ~25-30% by switching from scipy.sparse matrices to analytic LinearFilter objects in scarlet. There was also a noticeable (~5-10%) speedup using numpy.zeros in place of numpy.zeros_like , which has an extra copy method to handle initializing arrays of strings. The above image shows the current profile when running snakeviz. The function that takes the most time is apply_filters , which is the function that performs the fractional translations in x and y. This is not due to the speed of the function (which is about twice as fast as prox_weighted_monotonic , for example) but the number of times it is called. prox_weighted_monotonic is called once per source per iteration while apply_filters is called ~7 times per source per iteration. Peter Melchior and I have discussed optimizing the proxmin package to improve the algorithm we use to fit the solution now that we are no longer using the slower prox_g proximal operators. Once he implements those changes we have to decide how much time we want to spend trying to optimize scarlet. What is clear from the profile above is that there aren't any individual functions in the code that will give us large gains in optimization. At this point there are just a lot of small functions that cumulatively add up to a large processing time over many iterations (although a single iteration of scarlet is only a few milliseconds per blend). On the other hand, there might be a few places that we can be more clever about the way we calculate certain quantities.
Hide
Peter Melchior added a comment -

Proxmin has been updated (v0.5.0) to use streamlined block version of the PGM algorithm instead of ADMM. Besides lower overhead, it also allows for Nesterov acceleration with convergence guarantees.

There's a minor change needed in scarlet, which is in PR 42 and should not interfere with other ongoing optimizations.

Show
Peter Melchior added a comment - Proxmin has been updated (v0.5.0) to use streamlined block version of the PGM algorithm instead of ADMM. Besides lower overhead, it also allows for Nesterov acceleration with convergence guarantees. There's a minor change needed in scarlet, which is in PR 42 and should not interfere with other ongoing optimizations.
Hide
Fred Moolekamp added a comment -

With the added optimization by Peter Melchior of the proxmin pacakge, we have chopped down most of the tall trees. There may be a few more clever improvements to shave another few percent off the runtime, so we are likely to revisit a new profile and optimization ticket after DM-11909 is finished and we have runtime data for an entire patch of HSC data.

Below is the latest snakeviz profile:

Show
Fred Moolekamp added a comment - With the added optimization by Peter Melchior of the proxmin pacakge, we have chopped down most of the tall trees. There may be a few more clever improvements to shave another few percent off the runtime, so we are likely to revisit a new profile and optimization ticket after DM-11909 is finished and we have runtime data for an entire patch of HSC data. Below is the latest snakeviz profile:
Hide
Fred Moolekamp added a comment -

We have done the main optimizations this ticket set out to improve. Once we've finished running scarlet on a larger dataset and we have a better idea about how the newest iteration performs we can open another optimization ticket, if necessary.

Show
Fred Moolekamp added a comment - We have done the main optimizations this ticket set out to improve. Once we've finished running scarlet on a larger dataset and we have a better idea about how the newest iteration performs we can open another optimization ticket, if necessary.
Hide
Peter Melchior added a comment -

Agreed. The main load is done by the translation/PSF convolution operator, which is now implemented by our custom code, but still in python. A number of options exist to speed up the for loop and the operation itself.

Show
Peter Melchior added a comment - Agreed. The main load is done by the translation/PSF convolution operator, which is now implemented by our custom code, but still in python. A number of options exist to speed up the for loop and the operation itself.

## People

• Assignee:
Fred Moolekamp
Reporter:
John Swinbank
Reviewers:
Peter Melchior
Watchers:
Fred Moolekamp, John Swinbank, Peter Melchior