Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-19388

Investigate timing of Pandas dax_ppdb interface

    Details

      Description

      DM-18735 created code paths for reading and writing Pandas into the Ppdb using the standard Pandas sql interfaces. It's not currently known now efficient these interfaces are for read/writes. This ticket will investigate this using Andy Salnikov's test code in dax_ppdb and ap_pipe the difference in run time between the two methods.

        Attachments

          Issue Links

            Activity

            Hide
            cmorrison Chris Morrison added a comment - - edited

            Jenkins: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/29747/pipeline

            We compare the the speed of the `pandas` wrappers for SQLAlchemy with those already in place using `afw`. We look here only that the speed is not significantly degraded when using `pandas` rather than try to improve the ability of the Ppdb interface to read and write.

            We use 2 methods: First, we use 2 ccds (24, 25) across all visits (~60) in the `ap_verify_hits2015` dataset. Second, we use the program from the `l1dbproto` package to simulate reads and writes for DiaObjects, DiaSources, and DiaForcedSources. Unless explicitly stated, times include both conversion of the catalog object into the SQL query and processes within the DB. The DB backend used is `sqlite3` with the "baseline" configuration for the Ppdb. Currently, the `afw` option loops over every individual catalog row and column to translate the data too and from the Ppdb. This isn't done much differently for inserts using pandas, though the code seems to be doing something more clever for selects. One final thing to point out is that for the selects, both code paths (`afw` and `pandas`) are using the same SQL statements. For inserts, some of the Ppdb interface's optimizations are not currently used when inserting from `pandas` due to the API of the `DataFrame.to_sql` method.

            For the HiTS dataset, the inserts (`storeDiaObjects`, `storeDiaSources`) for DiaObjects and Sources are comparable between `pandas` and `afw`. For select statements (`getDiaObjects`, `getDiaSources`), the `pandas` statements are noticeably faster compared to those using `afw`. The jump in time for the different reads is due to the ccd/visit moving to a different, non-overlapping portion of the sky.

            For the `l1dbproto` simulations we run 2000 visits (roughly 2 days) and plot the average read/write time as a function of visit as a rolling 100 visit average. In this case, the `pandas` layer performs more quickly for both inserts and selects. In the case of selects, `pandas` is 6 times faster on average for the 2000 visits run. Note that the reads here includes both time for DiaSources and DiaFourcedSources. For inserts `pandas` is on average 1.7 times faster. One limitation here compared to the previous test is that only roughly ~10 or less columns are used when inserting vs the previous test which writes upwards of 100 columns/values into the DB per object/source.

            Both speed ups are likely due to `pandas` being more efficient in converting the `DataFrame` object to SQL. In the select case, both code paths are using the same SQL statements and can be compared directly. For the inserts, the two are using slightly different statements with the `afw` path being slightly more optimized in its insert statement. Further investigations could be carried out to investigate if these speed ups hold larger, ~50k visit, runs of the `l1dbproto` and better database backends (e.g. PostgreSQL). No multi-threading, as is available in `l1dbproto` is used.

            Show
            cmorrison Chris Morrison added a comment - - edited Jenkins:  https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/29747/pipeline We compare the the speed of the `pandas` wrappers for SQLAlchemy with those already in place using `afw`. We look here only that the speed is not significantly degraded when using `pandas` rather than try to improve the ability of the Ppdb interface to read and write. We use 2 methods: First, we use 2 ccds (24, 25) across all visits (~60) in the `ap_verify_hits2015` dataset. Second, we use the program from the `l1dbproto` package to simulate reads and writes for DiaObjects, DiaSources, and DiaForcedSources. Unless explicitly stated, times include both conversion of the catalog object into the SQL query and processes within the DB. The DB backend used is `sqlite3` with the "baseline" configuration for the Ppdb. Currently, the `afw` option loops over every individual catalog row and column to translate the data too and from the Ppdb. This isn't done much differently for inserts using pandas, though the code seems to be doing something more clever for selects. One final thing to point out is that for the selects, both code paths (`afw` and `pandas`) are using the same SQL statements. For inserts, some of the Ppdb interface's optimizations are not currently used when inserting from `pandas` due to the API of the `DataFrame.to_sql` method. For the HiTS dataset, the inserts (`storeDiaObjects`, `storeDiaSources`) for DiaObjects and Sources are comparable between `pandas` and `afw`. For select statements (`getDiaObjects`, `getDiaSources`), the `pandas` statements are noticeably faster compared to those using `afw`. The jump in time for the different reads is due to the ccd/visit moving to a different, non-overlapping portion of the sky. For the `l1dbproto` simulations we run 2000 visits (roughly 2 days) and plot the average read/write time as a function of visit as a rolling 100 visit average. In this case, the `pandas` layer performs more quickly for both inserts and selects. In the case of selects, `pandas` is 6 times faster on average for the 2000 visits run. Note that the reads here includes both time for DiaSources and DiaFourcedSources. For inserts `pandas` is on average 1.7 times faster. One limitation here compared to the previous test is that only roughly ~10 or less columns are used when inserting vs the previous test which writes upwards of 100 columns/values into the DB per object/source. Both speed ups are likely due to `pandas` being more efficient in converting the `DataFrame` object to SQL. In the select case, both code paths are using the same SQL statements and can be compared directly. For the inserts, the two are using slightly different statements with the `afw` path being slightly more optimized in its insert statement. Further investigations could be carried out to investigate if these speed ups hold larger, ~50k visit, runs of the `l1dbproto` and better database backends (e.g. PostgreSQL). No multi-threading, as is available in `l1dbproto` is used.
            Hide
            cmorrison Chris Morrison added a comment -

            Over the course of this ticket, I found 2 bugs in the `pandas` implementation. First, `DataFrame.append` does not append to the `DataFrame` inplace  and  instead returns a new `DataFrame`. This means that the loop over multiple DiaObject ids when selecting DiaSources was limited to a maximum of 1000 ids instead of all inserted ids. Also, the "multi" config in the `DataFrame.to_sql`that was previous used has been removed. "multi" has limits on the number of rows that can be inserted leading to crashes and, as the `pandas` and `SQLAlchemy` docs mention, is slower than a standard insert. I've also added a `with Timer` statement around the each of the `pandas` insert statements that was not there previously.

            Show
            cmorrison Chris Morrison added a comment - Over the course of this ticket, I found 2 bugs in the `pandas` implementation. First, `DataFrame.append` does not append to the `DataFrame` inplace  and  instead returns a new `DataFrame`. This means that the loop over multiple DiaObject ids when selecting DiaSources was limited to a maximum of 1000 ids instead of all inserted ids. Also, the "multi" config in the `DataFrame.to_sql`that was previous used has been removed. "multi" has limits on the number of rows that can be inserted leading to crashes and, as the `pandas` and `SQLAlchemy` docs mention, is slower than a standard insert. I've also added a `with Timer` statement around the each of the `pandas` insert statements that was not there previously.
            Hide
            tjenness Tim Jenness added a comment -

            Is it possible that the conclusions here will be affected by the massive speed ups in afw table that Jim Bosch is delivering in DM-19461?

            Show
            tjenness Tim Jenness added a comment - Is it possible that the conclusions here will be affected by the massive speed ups in afw table that Jim Bosch is delivering in DM-19461 ?
            Hide
            cmorrison Chris Morrison added a comment -

            Doubt it. Much of the problem here is having to loop over each record in a python for loop and then each value. Unless the speed ups with afw are API changes that make it easier to slice columns/rows, this likely won't change.

            Show
            cmorrison Chris Morrison added a comment - Doubt it. Much of the problem here is having to loop over each record in a python for loop and then each value. Unless the speed ups with afw are API changes that make it easier to slice columns/rows, this likely won't change.
            Hide
            cmorrison Chris Morrison added a comment -

            Summary of the ratio between the 100 visit averages for awf vs pandas.

            afw time to pandas time: [afw/pandas]
            DiaObject select: 6.42
            DiaSource and DiaForcedSource select: 6.04
            DiaObject/Source/ForcedSource insert: 1.74

            Show
            cmorrison Chris Morrison added a comment - Summary of the ratio between the 100 visit averages for awf vs pandas. afw time to pandas time: [afw/pandas] DiaObject select: 6.42 DiaSource and DiaForcedSource select: 6.04 DiaObject/Source/ForcedSource insert: 1.74
            Hide
            salnikov Andy Salnikov added a comment - - edited

            Looks OK in general, though I have to say that numbers quoted don't make a lot of sense. There is clearly an overhead in conversion to/from afwtable, but this test has it exaggerated too much. Running ap_proto on the whole FOV does not match normal per-CCD mode, with individual CCDs we'll have factor of 200 less data per individual process and smaller volume of data to convert in a single query. Second, "baseline" indexing mode is not the one that should be used for any serious work as it has terrible performance for other reasons too. "last_object_table" is the one that is most reasonable for now, and it reads much less data from DiaObject with significantly lower overhead too. Having said that I'm happy that pandas has better performance in converting data in memory (it may also be that it's my implementation of that conversion that is inefficient), hopefully I won't have to work with afwtable any more But I believe that Pandas will have it's set of issues we'll have to look at. You have removed method="multi" option fin this commit, but without that option non-sqlite database will likely have very non-optimal performance (per-query latency is much larger for network compared to disk). Unfortunately sqlalchemy in some cases performs terribly with "multi" option, this is why I had to implement query generation in Ppdb myself, and very likely we'll need to do it for pandas too.

            Show
            salnikov Andy Salnikov added a comment - - edited Looks OK in general, though I have to say that numbers quoted don't make a lot of sense. There is clearly an overhead in conversion to/from afwtable, but this test has it exaggerated too much. Running ap_proto on the whole FOV does not match normal per-CCD mode, with individual CCDs we'll have factor of 200 less data per individual process and smaller volume of data to convert in a single query. Second, "baseline" indexing mode is not the one that should be used for any serious work as it has terrible performance for other reasons too. "last_object_table" is the one that is most reasonable for now, and it reads much less data from DiaObject with significantly lower overhead too. Having said that I'm happy that pandas has better performance in converting data in memory (it may also be that it's my implementation of that conversion that is inefficient), hopefully I won't have to work with afwtable any more But I believe that Pandas will have it's set of issues we'll have to look at. You have removed method="multi" option fin this commit, but without that option non-sqlite database will likely have very non-optimal performance (per-query latency is much larger for network compared to disk). Unfortunately sqlalchemy in some cases performs terribly with "multi" option, this is why I had to implement query generation in Ppdb myself, and very likely we'll need to do it for pandas too.
            Hide
            cmorrison Chris Morrison added a comment -

            Thanks Andy. To be clear, I never thought of this as a comparison to the numbers in your DMTN but, as you point out, this really only shows the relative performance in memory. One thing I will point out though, is that the line plots (with x axis marked "ccdVisit #") are running over the smaller amount of data you mention on a per-ccd level and on selects->conversion ratio between the two output data types (afw vs pandas) still holds.

            I looked over the Pandas code a bit last night and think I, in part, understand how and where we could port your specific Ppdb insert code to. To be fair, Pandas isn't really doing something that much smarter than `for` looping in python (I've checked), but it's at least a lot easier to use for me considering how the code does column naming the typing automatically vs me having to make sure the afw schemas are precisely and painfully aligned. Going forward it looks like Pandas will the choice as that is currently what is being produced by DRP in their Science Data Translator (or whatever it's called) that I hope to re-use for producing catalog with calibrated data and correctly named columns.

            I'll merge the fixes I've found and merge.

            Show
            cmorrison Chris Morrison added a comment - Thanks Andy. To be clear, I never thought of this as a comparison to the numbers in your DMTN but, as you point out, this really only shows the relative performance in memory. One thing I will point out though, is that the line plots (with x axis marked "ccdVisit #") are running over the smaller amount of data you mention on a per-ccd level and on selects->conversion ratio between the two output data types (afw vs pandas) still holds. I looked over the Pandas code a bit last night and think I, in part, understand how and where we could port your specific Ppdb insert code to. To be fair, Pandas isn't really doing something that much smarter than `for` looping in python (I've checked), but it's at least a lot easier to use for me considering how the code does column naming the typing automatically vs me having to make sure the afw schemas are precisely and painfully aligned. Going forward it looks like Pandas will the choice as that is currently what is being produced by DRP in their Science Data Translator (or whatever it's called) that I hope to re-use for producing catalog with calibrated data and correctly named columns. I'll merge the fixes I've found and merge.

              People

              • Assignee:
                cmorrison Chris Morrison
                Reporter:
                cmorrison Chris Morrison
                Reviewers:
                Andy Salnikov
                Watchers:
                Andy Salnikov, Chris Morrison, Colin Slater, Eric Bellm, John Swinbank, Tim Jenness
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel