When I initially implemented the code to skip patches without any data, I had to get a couple more things from the repository (the wcs and the bounding box of the source catalog). Getting the extra data made the time to assemble the inputs 3 times longer.
This is significant because I had not profiled carefully enough. It turns out that the data access is comparable to the matching time itself.
So making that 3x bigger now meant that the data access was dominating the run time of the matching task, so skipping patches didn’t actually matter any more.
If I make some assumptions, I can get the data access time to only be ~70% larger. At that point, skipping patches is a win. For one datapoint the same task went from 3m18.166s to 2m48.650s.
Not the factor of a couple I was hoping for, but it’s something.
Takeaway is that data access is not trivial and we will have to worry about it.
When I initially implemented the code to skip patches without any data, I had to get a couple more things from the repository (the wcs and the bounding box of the source catalog). Getting the extra data made the time to assemble the inputs 3 times longer.
This is significant because I had not profiled carefully enough. It turns out that the data access is comparable to the matching time itself.
So making that 3x bigger now meant that the data access was dominating the run time of the matching task, so skipping patches didn’t actually matter any more.
If I make some assumptions, I can get the data access time to only be ~70% larger. At that point, skipping patches is a win. For one datapoint the same task went from 3m18.166s to 2m48.650s.
Not the factor of a couple I was hoping for, but it’s something.
Takeaway is that data access is not trivial and we will have to worry about it.