Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-11873

Optimize QA code for speed in accessing/assigning afw table columns

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: pipe_analysis
    • Labels:
      None
    • Story Points:
      2
    • Epic Link:
    • Sprint:
      DRP F17-4
    • Team:
      Data Release Production

      Description

      The QA scripts in pipe_analysis do a lot of column accessing & assigning of afw tables. A recent community post highlighted the inefficient nature of accessing catalog elements with catalog[fieldName] vs catalog.get(fieldName) (or, even better, catalog.get(filedNameKey)). Please update the code for speed efficiency according to the findings and recommendations of this community post.

        Attachments

          Issue Links

            Activity

            Hide
            erykoff Eli Rykoff added a comment - - edited

            Okay, that's interesting on the follow-up speed-up of whichever you do next. So the system is working appropriately, and any of these methods is equally good, and behind-the-scenes no matter what you do the caching seems to work. So that's all excellent news: there shouldn't be any significant performance difference based on your exact choice of incantation. All good!

            So I 100% agree that for simplicity catalog[fieldName] is the way to go. It's easy to read, pythonic (or at least numpythonic) and performs well.

            As to your question on whether pre-fetching the key helps, I came up with the following toy example:

            import lsst.afw.table as afwTable
            import numpy as np
            schema = afwTable.Schema()
            nCol = 10
            for i in xrange(10):
                schema.addField('col%05d' % (i), type=np.float32,doc='column')
             
            testCat = afwTable.BaseCatalog(schema)
            nRow = 1000
            testCat.table.preallocate(nRow)
            for i in xrange(nRow):
                rec=testCat.addNew()
                for j in xrange(nCol):
                    rec['col%05d' % (j)] = j
             
            def test1():
                for i in xrange(nRow):
                    for j in xrange(nCol):
                        test = testCat['col%05d' % (j)][i]
             
            def test2():
                keys = []
                for j in xrange(nCol):
                    keys.append(testCat.schema['col%05d' % (j)].asKey())
             
                for i in xrange(nRow):
                    for j in xrange(nCol):
                        test = testCat[keys[j]][i]
            
            

            The point here is that we're doing row-wise and column-wise access of a long table, so you save a bunch of time if you prefetch the keys:

            In [14]: %timeit test1()
            1 loop, best of 3: 296 ms per loop
             
            In [15]: %timeit test2()
            10 loops, best of 3: 113 ms per loop
            

            But if you're doing column-wise access of the whole table, it definitely won't make any difference.

            Show
            erykoff Eli Rykoff added a comment - - edited Okay, that's interesting on the follow-up speed-up of whichever you do next. So the system is working appropriately, and any of these methods is equally good, and behind-the-scenes no matter what you do the caching seems to work. So that's all excellent news: there shouldn't be any significant performance difference based on your exact choice of incantation. All good! So I 100% agree that for simplicity catalog [fieldName] is the way to go. It's easy to read, pythonic (or at least numpythonic) and performs well. As to your question on whether pre-fetching the key helps, I came up with the following toy example: import lsst.afw.table as afwTable import numpy as np schema = afwTable.Schema() nCol = 10 for i in xrange ( 10 ): schema.addField( 'col%05d' % (i), type = np.float32,doc = 'column' )   testCat = afwTable.BaseCatalog(schema) nRow = 1000 testCat.table.preallocate(nRow) for i in xrange (nRow): rec = testCat.addNew() for j in xrange (nCol): rec[ 'col%05d' % (j)] = j   def test1(): for i in xrange (nRow): for j in xrange (nCol): test = testCat[ 'col%05d' % (j)][i]   def test2(): keys = [] for j in xrange (nCol): keys.append(testCat.schema[ 'col%05d' % (j)].asKey())   for i in xrange (nRow): for j in xrange (nCol): test = testCat[keys[j]][i] The point here is that we're doing row-wise and column-wise access of a long table, so you save a bunch of time if you prefetch the keys: In [ 14 ]: % timeit test1() 1 loop, best of 3 : 296 ms per loop   In [ 15 ]: % timeit test2() 10 loops, best of 3 : 113 ms per loop But if you're doing column-wise access of the whole table, it definitely won't make any difference.
            Hide
            lauren Lauren MacArthur added a comment -

            Ah, right...and I did have an instance of row access where I'm now doing key prefetching.

            Ok, I think this is finally good-to-go. It's a lot less churn than previous iterations, but I did find other time-wasting issues that I've fixed along the way, so this was still worthwhile (and a learning experience!) Can you give the official review?

            Show
            lauren Lauren MacArthur added a comment - Ah, right...and I did have an instance of row access where I'm now doing key prefetching. Ok, I think this is finally good-to-go. It's a lot less churn than previous iterations, but I did find other time-wasting issues that I've fixed along the way, so this was still worthwhile (and a learning experience!) Can you give the official review?
            Hide
            erykoff Eli Rykoff added a comment -

            I have to go pick up the kids, but it's the top of my list.

            Show
            erykoff Eli Rykoff added a comment - I have to go pick up the kids, but it's the top of my list.
            Hide
            lauren Lauren MacArthur added a comment -

            Approved by Eli Rykoff on the PR

            Show
            lauren Lauren MacArthur added a comment - Approved by Eli Rykoff on the PR
            Hide
            lauren Lauren MacArthur added a comment -

            A full visit now runs in ~20min (compared to 100min+ previously).

            Show
            lauren Lauren MacArthur added a comment - A full visit now runs in ~20min (compared to 100min+ previously).

              People

              Assignee:
              lauren Lauren MacArthur
              Reporter:
              lauren Lauren MacArthur
              Reviewers:
              Eli Rykoff
              Watchers:
              Eli Rykoff, John Swinbank, Lauren MacArthur
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.