Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-19461

Investigate and possibly fix afw::table Record allocation performance`

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: afw
    • Labels:
      None

      Description

      A recent discussion with John Parejko relayed a comment from Pierre Astier to the effect that at least some of the performance problems in reading afw.table files is due to a lot of time spent in malloc.  I have a guess at where that is happening - in the creation of shared_ptrs for all of the records - and if that's correct, an idea for how to possibly fix it without breaking anything (by allocating extra space for the Record object in the larger block allocated by the table that actually holds the data, and then using std::allocate_shared to avoid a heap allocation when constructing the shared pointer.

      That still uses the same amount of total memory, so unless it reduces fragmentation in a subtle way, it's unlikely to address DM-15578 - and the existence of that issue makes me think there must be something else quite wasteful going on, even if I'm right that we could avoid some malloc calls with the proposal above.

      John Parejko, Paul Price, Eli Rykoff: if any of your recent profiling produced data that could confirm or reject this hypothesis, please let me know.  Otherwise I'll try to profile this myself and put together a fix if I'm right, but I'm afraid I can't put this at the top of my priority list right now.

       

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            Total time to read the catalog, which always contained 10k rows.  I was only interested in how the minimum time for each record size shifted around, so it was actually nice not to normalize them so the lines didn't lie on top of each other.

            I'm rerunning now with 50k records to make sure I'm not hitting small numbers of chunks for the larger nRowsToPrep values.

            Show
            jbosch Jim Bosch added a comment - Total time to read the catalog, which always contained 10k rows.  I was only interested in how the minimum time for each record size shifted around, so it was actually nice not to normalize them so the lines didn't lie on top of each other. I'm rerunning now with 50k records to make sure I'm not hitting small numbers of chunks for the larger nRowsToPrep values.
            Hide
            jbosch Jim Bosch added a comment -

            I think I'm done with both new functionality and addressing review comments.

            I did one more test with the C++ snippet reading Eli Rykoff's file, and it's happily the fastest we've seen:

            real    0m44.052s
            user    0m30.734s
            sys     0m12.929s

            Now I'll go build a stack on lsst-dev (all of my tests thus far were on tiger) and make extra certain I haven't pessimized reading src catalogs.  I plan to merge after both that and the Jenkins run I just kicked off finish, assuming all goes well.

            Show
            jbosch Jim Bosch added a comment - I think I'm done with both new functionality and addressing review comments. I did one more test with the C++ snippet reading Eli Rykoff 's file, and it's happily the fastest we've seen: real 0m44.052s user 0m30.734s sys 0m12.929s Now I'll go build a stack on lsst-dev (all of my tests thus far were on tiger) and make extra certain I haven't pessimized reading src catalogs.  I plan to merge after both that and the Jenkins run I just kicked off finish, assuming all goes well.
            Hide
            jbosch Jim Bosch added a comment -

            Using Eli Rykoff's script to read src on lsst-dev03 yields 0.29s the first time, and then a repeatable 0.021s after that (all on the branch ticket) - apparently disk caching is quite important for this test.  So I went back to the last weekly and tried that, and get a repeatable 0.044s (unsurprisingly, I can't reproduce the cold-cache slow first read now that the cache is hot).  Anyhow, that's not quite consistent with what's reported above, but the trend is positive no matter how you look at it, and that's good enough for me.

            Show
            jbosch Jim Bosch added a comment - Using Eli Rykoff 's script to read src on lsst-dev03 yields 0.29s the first time, and then a repeatable 0.021s after that (all on the branch ticket) - apparently disk caching is quite important for this test.  So I went back to the last weekly and tried that, and get a repeatable 0.044s (unsurprisingly, I can't reproduce the cold-cache slow first read now that the cache is hot).  Anyhow, that's not quite consistent with what's reported above, but the trend is positive no matter how you look at it, and that's good enough for me.
            Hide
            erykoff Eli Rykoff added a comment -

            I have also built the latest ticket branch on lsst-dev01 and confirm that I get the best performance I've seen on both my giant table (40.5 s, second try after heating up the cache), and on the small source catalog (0.042 s after heating up the cache). And for the small catalog it's faster than astropy, so that's cool.

            For reference, the small source catalog I'm testing with is x2 faster than the last weekly, and the giant catalog is a whopping 6.5 times faster with a lot less memory overhead.

            Because of caching, I'm not sure if we'll actually realize the performance improvement for small catalogs in practice (because where it matters is the cumulative time of reading a lot of source catalogs, which are probably not going to be cached), but I for one will see the improvement in large catalogs immediately, both in time and memory.

            Show
            erykoff Eli Rykoff added a comment - I have also built the latest ticket branch on lsst-dev01 and confirm that I get the best performance I've seen on both my giant table (40.5 s, second try after heating up the cache), and on the small source catalog (0.042 s after heating up the cache). And for the small catalog it's faster than astropy, so that's cool. For reference, the small source catalog I'm testing with is x2 faster than the last weekly, and the giant catalog is a whopping 6.5 times faster with a lot less memory overhead. Because of caching, I'm not sure if we'll actually realize the performance improvement for small catalogs in practice (because where it matters is the cumulative time of reading a lot of source catalogs, which are probably not going to be cached), but I for one will see the improvement in large catalogs immediately, both in time and memory.
            Hide
            jbosch Jim Bosch added a comment -

            Merged to master.

            Show
            jbosch Jim Bosch added a comment - Merged to master.

              People

              • Assignee:
                jbosch Jim Bosch
                Reporter:
                jbosch Jim Bosch
                Reviewers:
                Paul Price
                Watchers:
                Eli Rykoff, Jim Bosch, Michael Wood-Vasey, Paul Price, Tim Jenness
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel