Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: pipe_analysis
-
Labels:None
-
Story Points:2
-
Epic Link:
-
Sprint:DRP F17-4
-
Team:Data Release Production
Description
The QA scripts in pipe_analysis do a lot of column accessing & assigning of afw tables. A recent community post highlighted the inefficient nature of accessing catalog elements with catalog[fieldName] vs catalog.get(fieldName) (or, even better, catalog.get(filedNameKey)). Please update the code for speed efficiency according to the findings and recommendations of this community post.
Attachments
Issue Links
Activity
Yeah, I saw that. I've already done most of the work, and I think it's still worthwhile as I've switched to using catalog.get(filedNameKey) everywhere (which proved to be faster than catalog.get(fieldName), so may still be the best bet even with the catalog[fieldName] fix?) I'm also figuring going with keys is more foolproof against potential (however unlikely) future regressions with the fieldName getting. If I may, I'll bother you (Eli Rykoff, that is) for the review and you can let me know if I'm mistaken with any of my conjectures?!
Happy to take a look for the review. I'm curious how much faster you find the key access to be? And if you want to be a bit more terse, then catalog[fieldNameKey] should be the same as catalog.get(fieldNameKey) (after the fix!), and looks more "pythonic" to my eye.
A simple timing test on a real catalog does not bode well for your "pythonic" version (last one, oh, but does DM-11871 fix that too?), and the key access (3rd) is ~2x faster than get(fieldName) (2nd):
self.col2 is the field name as a string
|
col2Key = catalog.schema[self.col2].asKey()
|
|
catalog[self.col2]: 0.256013154984
|
catalog.get(self.col2): 0.000240087509155
|
catalog.get(col2Key): 0.000104188919067 # 2.3x faster than catalog.get(self.col2)
|
catalog[col2Key]: 0.255188941956
|
Would you mind having a look? The before and after time python profiles for one of my command lines (on just two ccds) are as follows:
pipe_analysis: master |
real 5m4.139s
|
user 4m3.749s
|
sys 0m5.680s
|
pipe_analysis: tickets/DM-11873 |
real 2m33.943s
|
user 1m48.800s
|
sys 0m5.631s
|
As for your timing tests above, that makes sense. The slow versions are only fixed on afw master as of this afternoon, will be in the next weekly. Then the .get() and [] versions will be the same, and I think that [key] will be as fast as .get(key) but I haven't tested that explicitly yet, but from the code there's no reason there should be any difference. One of the mysteries that led me to look into this was that people had said that "oh, if you look up the key first it's much faster" but I wasn't seeing that speedup. Turns out that the speedup coincidentally happens if you also change to .get().
You also show that there is still a factor of two speed-up when you pre-lookup the key. But there's a certain point where other parts of the code/io are the bottlenecks. Of course, an easy factor of two won't hurt!
Will get to the review first thing tomorrow morning.
Indeed. Why don't I do that explicit test with master afw and, assuming you are indeed correct, I'll go with the more pythonic version before you officially review?
Sounds like a good plan. I'll wait for your notice.
Alright, I've done the test using afw:master. Timings are now as follows:
catalog[self.col2]: 0.000131845474243
|
catalog.get(self.col2): 5.10215759277e-05
|
catalog.get(col2Key): 2.38418579102e-05
|
catalog[col2Key]: 1.59740447998e-05
|
So it seems the more pythonic key access wins (cat[key] is ~1.5x faster than cat.get(key), and it is still ~8x faster than cat[fieldName]). I have updated accordingly and this is now ready for review.
That's interesting about the relative timing of .get() and the brackets; I'm surprised it's that different, but glad that pythonic wins (because it's more natural).
Though in the current PR I don't know if you do get the advantage of this speedup: I would guess that `key = schema['field'].asKey()` directly followed by `value = cat[key]` would be the same as `value = cat['field']`. The advantage of the pre-lookup of the key is if you're using it multiple times and thus only incurring that overhead once. Can you check this?
Good point...I had looked at this, but was thinking on pre-fix timescales, so a re-look is certainly in order. It seems the key = schema[field].asKey() takes on order of 5e-05. I've added the asKey key-fetching to the timing and they now look like:
catalog[self.col2]: 0.000203847885132
|
catalog.get(self.col2) 5.48362731934e-05
|
asKey + catalog.get(col2Key): 5.10215759277e-05
|
asKey + catalog[col2Key]: 4.50611114502e-05
|
So it is still a factor of ~4 times faster for the key fetch + [key] access.
Just as a disclaimer, the following comment is far in the weeds, but hopefully not too far. I think we're very far down in bits that probably don't matter much, and the code updates as they are plus (especially) the afw update will get you a huge performance boost and that's a big win. But I would like to understand what's going on here with these timing differences.
That said, where is the code you are using for the timing tests? With my simple tables, I can't seem to replicate them, in that it looks pretty similar no matter what I do. I wonder if there's some sort of caching going on behind the scenes?
As for profiling the different code-paths for asKey + catalog[colKey] vs catalog[field] the difference seems to be that the first uses key = schema[field].asKey() (as you say), and the second the equivalent of key = schema.find(field).key. But testing these two ways of accessing keys on their own I find the same performance (but with the ipython complaint about possible caching).
Not too far at all! I think I've just confirmed your cashing suspicion. I was running the timing – simply doing:
from timeit import default_timer as timer
|
|
start = timer()
|
catalog[fieldName]
|
end = timer()
|
time1 = end - start
|
print 'catalog[fieldName]: ', time1
|
within my analysis script and trying the different methods in sequence. I just tried doing them one at a time (editing and then doing a new run) to eliminate any possibility of cashing. Indeed, they all seem to have about the same performance speed-wise when run independently. If I run one then another in sequence, the second (and any subsequent) – regardless of order and even if the same method is simply repeated – gets the speedup of ~4x.
So, for code simplicity, I'm thinking it would be best to move to all calls being just catalog[feildName]. Let me know if you disagree.
I am also now wondering if there is any use case where looking up the key provides an advantage? Re-use doesn't seem to apply given that, after a first access call (of any key &/or fieldName fetching method), all subsequent calls (identical or varied in key &/or fieldName fetching method to the first), get the speedup.
Okay, that's interesting on the follow-up speed-up of whichever you do next. So the system is working appropriately, and any of these methods is equally good, and behind-the-scenes no matter what you do the caching seems to work. So that's all excellent news: there shouldn't be any significant performance difference based on your exact choice of incantation. All good!
So I 100% agree that for simplicity catalog[fieldName] is the way to go. It's easy to read, pythonic (or at least numpythonic) and performs well.
As to your question on whether pre-fetching the key helps, I came up with the following toy example:
import lsst.afw.table as afwTable |
import numpy as np |
schema = afwTable.Schema() |
nCol = 10 |
for i in xrange(10): |
schema.addField('col%05d' % (i), type=np.float32,doc='column') |
|
testCat = afwTable.BaseCatalog(schema) |
nRow = 1000 |
testCat.table.preallocate(nRow)
|
for i in xrange(nRow): |
rec=testCat.addNew() |
for j in xrange(nCol): |
rec['col%05d' % (j)] = j |
|
def test1(): |
for i in xrange(nRow): |
for j in xrange(nCol): |
test = testCat['col%05d' % (j)][i] |
|
def test2(): |
keys = [] |
for j in xrange(nCol): |
keys.append(testCat.schema['col%05d' % (j)].asKey()) |
|
for i in xrange(nRow): |
for j in xrange(nCol): |
test = testCat[keys[j]][i] |
|
The point here is that we're doing row-wise and column-wise access of a long table, so you save a bunch of time if you prefetch the keys:
In [14]: %timeit test1() |
1 loop, best of 3: 296 ms per loop |
|
In [15]: %timeit test2() |
10 loops, best of 3: 113 ms per loop |
But if you're doing column-wise access of the whole table, it definitely won't make any difference.
Ah, right...and I did have an instance of row access where I'm now doing key prefetching.
Ok, I think this is finally good-to-go. It's a lot less churn than previous iterations, but I did find other time-wasting issues that I've fixed along the way, so this was still worthwhile (and a learning experience!) Can you give the official review?
I have to go pick up the kids, but it's the top of my list.
Approved by Eli Rykoff on the PR
A full visit now runs in ~20min (compared to 100min+ previously).
With the quick fix from Jim Bosch for
DM-11871this might be unnecessary come the next weekly.