# Investigate timing of Pandas dax_ppdb interface

XMLWordPrintable

## Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Story Points:
4
• Sprint:
AP S19-6
• Team:

## Description

DM-18735 created code paths for reading and writing Pandas into the Ppdb using the standard Pandas sql interfaces. It's not currently known now efficient these interfaces are for read/writes. This ticket will investigate this using Andy Salnikov's test code in dax_ppdb and ap_pipe the difference in run time between the two methods.

## Attachments

1. image-2019-04-29-11-45-59-782.png
114 kB
2. image-2019-04-29-15-07-37-358.png
112 kB
3. image-2019-04-29-15-08-06-578.png
91 kB
4. image-2019-04-29-15-08-14-310.png
74 kB
5. image-2019-04-29-15-12-34-883.png
72 kB
6. image-2019-04-29-15-12-41-207.png
78 kB
7. image-2019-04-29-15-12-52-522.png
82 kB

## Activity

Hide
Chris Morrison added a comment - - edited

We compare the the speed of the pandas wrappers for SQLAlchemy with those already in place using afw. We look here only that the speed is not significantly degraded when using pandas rather than try to improve the ability of the Ppdb interface to read and write.

We use 2 methods: First, we use 2 ccds (24, 25) across all visits (~60) in the ap_verify_hits2015 dataset. Second, we use the program from the l1dbproto package to simulate reads and writes for DiaObjects, DiaSources, and DiaForcedSources. Unless explicitly stated, times include both conversion of the catalog object into the SQL query and processes within the DB. The DB backend used is sqlite3 with the "baseline" configuration for the Ppdb. Currently, the afw option loops over every individual catalog row and column to translate the data too and from the Ppdb. This isn't done much differently for inserts using pandas, though the code seems to be doing something more clever for selects. One final thing to point out is that for the selects, both code paths (afw and pandas) are using the same SQL statements. For inserts, some of the Ppdb interface's optimizations are not currently used when inserting from pandas due to the API of the DataFrame.to_sql method.

For the HiTS dataset, the inserts (storeDiaObjects, storeDiaSources) for DiaObjects and Sources are comparable between pandas and afw. For select statements (getDiaObjects, getDiaSources), the pandas statements are noticeably faster compared to those using afw. The jump in time for the different reads is due to the ccd/visit moving to a different, non-overlapping portion of the sky.

For the l1dbproto simulations we run 2000 visits (roughly 2 days) and plot the average read/write time as a function of visit as a rolling 100 visit average. In this case, the pandas layer performs more quickly for both inserts and selects. In the case of selects, pandas is 6 times faster on average for the 2000 visits run. Note that the reads here includes both time for DiaSources and DiaFourcedSources. For inserts pandas is on average 1.7 times faster. One limitation here compared to the previous test is that only roughly ~10 or less columns are used when inserting vs the previous test which writes upwards of 100 columns/values into the DB per object/source.

Both speed ups are likely due to pandas being more efficient in converting the DataFrame object to SQL. In the select case, both code paths are using the same SQL statements and can be compared directly. For the inserts, the two are using slightly different statements with the afw path being slightly more optimized in its insert statement. Further investigations could be carried out to investigate if these speed ups hold larger, ~50k visit, runs of the l1dbproto and better database backends (e.g. PostgreSQL). No multi-threading, as is available in l1dbproto is used.

Show
Chris Morrison added a comment - - edited Jenkins:  https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/29747/pipeline We compare the the speed of the pandas wrappers for SQLAlchemy with those already in place using afw. We look here only that the speed is not significantly degraded when using pandas rather than try to improve the ability of the Ppdb interface to read and write. We use 2 methods: First, we use 2 ccds (24, 25) across all visits (~60) in the ap_verify_hits2015 dataset. Second, we use the program from the l1dbproto package to simulate reads and writes for DiaObjects, DiaSources, and DiaForcedSources. Unless explicitly stated, times include both conversion of the catalog object into the SQL query and processes within the DB. The DB backend used is sqlite3 with the "baseline" configuration for the Ppdb. Currently, the afw option loops over every individual catalog row and column to translate the data too and from the Ppdb. This isn't done much differently for inserts using pandas, though the code seems to be doing something more clever for selects. One final thing to point out is that for the selects, both code paths (afw and pandas) are using the same SQL statements. For inserts, some of the Ppdb interface's optimizations are not currently used when inserting from pandas due to the API of the DataFrame.to_sql method. For the HiTS dataset, the inserts (storeDiaObjects, storeDiaSources) for DiaObjects and Sources are comparable between pandas and afw. For select statements (getDiaObjects, getDiaSources), the pandas statements are noticeably faster compared to those using afw. The jump in time for the different reads is due to the ccd/visit moving to a different, non-overlapping portion of the sky. For the l1dbproto simulations we run 2000 visits (roughly 2 days) and plot the average read/write time as a function of visit as a rolling 100 visit average. In this case, the pandas layer performs more quickly for both inserts and selects. In the case of selects, pandas is 6 times faster on average for the 2000 visits run. Note that the reads here includes both time for DiaSources and DiaFourcedSources. For inserts pandas is on average 1.7 times faster. One limitation here compared to the previous test is that only roughly ~10 or less columns are used when inserting vs the previous test which writes upwards of 100 columns/values into the DB per object/source. Both speed ups are likely due to pandas being more efficient in converting the DataFrame object to SQL. In the select case, both code paths are using the same SQL statements and can be compared directly. For the inserts, the two are using slightly different statements with the afw path being slightly more optimized in its insert statement. Further investigations could be carried out to investigate if these speed ups hold larger, ~50k visit, runs of the l1dbproto and better database backends (e.g. PostgreSQL). No multi-threading, as is available in l1dbproto is used.
Hide
Chris Morrison added a comment -

Over the course of this ticket, I found 2 bugs in the pandas implementation. First, DataFrame.append does not append to the DataFrame inplace  and  instead returns a new DataFrame. This means that the loop over multiple DiaObject ids when selecting DiaSources was limited to a maximum of 1000 ids instead of all inserted ids. Also, the "multi" config in the DataFrame.to_sqlthat was previous used has been removed. "multi" has limits on the number of rows that can be inserted leading to crashes and, as the pandas and SQLAlchemy docs mention, is slower than a standard insert. I've also added a with Timer statement around the each of the pandas insert statements that was not there previously.

Show
Chris Morrison added a comment - Over the course of this ticket, I found 2 bugs in the pandas implementation. First, DataFrame.append does not append to the DataFrame inplace  and  instead returns a new DataFrame. This means that the loop over multiple DiaObject ids when selecting DiaSources was limited to a maximum of 1000 ids instead of all inserted ids. Also, the "multi" config in the DataFrame.to_sqlthat was previous used has been removed. "multi" has limits on the number of rows that can be inserted leading to crashes and, as the pandas and SQLAlchemy docs mention, is slower than a standard insert. I've also added a with Timer statement around the each of the pandas insert statements that was not there previously.
Hide
Tim Jenness added a comment -

Is it possible that the conclusions here will be affected by the massive speed ups in afw table that Jim Bosch is delivering in DM-19461?

Show
Tim Jenness added a comment - Is it possible that the conclusions here will be affected by the massive speed ups in afw table that Jim Bosch is delivering in DM-19461 ?
Hide
Chris Morrison added a comment -

Doubt it. Much of the problem here is having to loop over each record in a python for loop and then each value. Unless the speed ups with afw are API changes that make it easier to slice columns/rows, this likely won't change.

Show
Chris Morrison added a comment - Doubt it. Much of the problem here is having to loop over each record in a python for loop and then each value. Unless the speed ups with afw are API changes that make it easier to slice columns/rows, this likely won't change.
Hide
Chris Morrison added a comment -

Summary of the ratio between the 100 visit averages for awf vs pandas.

afw time to pandas time: [afw/pandas]
DiaObject select: 6.42
DiaSource and DiaForcedSource select: 6.04
DiaObject/Source/ForcedSource insert: 1.74

Show
Chris Morrison added a comment - Summary of the ratio between the 100 visit averages for awf vs pandas. afw time to pandas time: [afw/pandas] DiaObject select: 6.42 DiaSource and DiaForcedSource select: 6.04 DiaObject/Source/ForcedSource insert: 1.74
Hide
Andy Salnikov added a comment - - edited

Looks OK in general, though I have to say that numbers quoted don't make a lot of sense. There is clearly an overhead in conversion to/from afwtable, but this test has it exaggerated too much. Running ap_proto on the whole FOV does not match normal per-CCD mode, with individual CCDs we'll have factor of 200 less data per individual process and smaller volume of data to convert in a single query. Second, "baseline" indexing mode is not the one that should be used for any serious work as it has terrible performance for other reasons too. "last_object_table" is the one that is most reasonable for now, and it reads much less data from DiaObject with significantly lower overhead too. Having said that I'm happy that pandas has better performance in converting data in memory (it may also be that it's my implementation of that conversion that is inefficient), hopefully I won't have to work with afwtable any more But I believe that Pandas will have it's set of issues we'll have to look at. You have removed method="multi" option fin this commit, but without that option non-sqlite database will likely have very non-optimal performance (per-query latency is much larger for network compared to disk). Unfortunately sqlalchemy in some cases performs terribly with "multi" option, this is why I had to implement query generation in Ppdb myself, and very likely we'll need to do it for pandas too.

Show
Andy Salnikov added a comment - - edited Looks OK in general, though I have to say that numbers quoted don't make a lot of sense. There is clearly an overhead in conversion to/from afwtable, but this test has it exaggerated too much. Running ap_proto on the whole FOV does not match normal per-CCD mode, with individual CCDs we'll have factor of 200 less data per individual process and smaller volume of data to convert in a single query. Second, "baseline" indexing mode is not the one that should be used for any serious work as it has terrible performance for other reasons too. "last_object_table" is the one that is most reasonable for now, and it reads much less data from DiaObject with significantly lower overhead too. Having said that I'm happy that pandas has better performance in converting data in memory (it may also be that it's my implementation of that conversion that is inefficient), hopefully I won't have to work with afwtable any more But I believe that Pandas will have it's set of issues we'll have to look at. You have removed method="multi" option fin this commit, but without that option non-sqlite database will likely have very non-optimal performance (per-query latency is much larger for network compared to disk). Unfortunately sqlalchemy in some cases performs terribly with "multi" option, this is why I had to implement query generation in Ppdb myself, and very likely we'll need to do it for pandas too.
Hide
Chris Morrison added a comment -

Thanks Andy. To be clear, I never thought of this as a comparison to the numbers in your DMTN but, as you point out, this really only shows the relative performance in memory. One thing I will point out though, is that the line plots (with x axis marked "ccdVisit #") are running over the smaller amount of data you mention on a per-ccd level and on selects->conversion ratio between the two output data types (afw vs pandas) still holds.

I looked over the Pandas code a bit last night and think I, in part, understand how and where we could port your specific Ppdb insert code to. To be fair, Pandas isn't really doing something that much smarter than for looping in python (I've checked), but it's at least a lot easier to use for me considering how the code does column naming the typing automatically vs me having to make sure the afw schemas are precisely and painfully aligned. Going forward it looks like Pandas will the choice as that is currently what is being produced by DRP in their Science Data Translator (or whatever it's called) that I hope to re-use for producing catalog with calibrated data and correctly named columns.

I'll merge the fixes I've found and merge.

Show
Chris Morrison added a comment - Thanks Andy. To be clear, I never thought of this as a comparison to the numbers in your DMTN but, as you point out, this really only shows the relative performance in memory. One thing I will point out though, is that the line plots (with x axis marked "ccdVisit #") are running over the smaller amount of data you mention on a per-ccd level and on selects->conversion ratio between the two output data types (afw vs pandas) still holds. I looked over the Pandas code a bit last night and think I, in part, understand how and where we could port your specific Ppdb insert code to. To be fair, Pandas isn't really doing something that much smarter than for looping in python (I've checked), but it's at least a lot easier to use for me considering how the code does column naming the typing automatically vs me having to make sure the afw schemas are precisely and painfully aligned. Going forward it looks like Pandas will the choice as that is currently what is being produced by DRP in their Science Data Translator (or whatever it's called) that I hope to re-use for producing catalog with calibrated data and correctly named columns. I'll merge the fixes I've found and merge.

## People

• Assignee:
Chris Morrison
Reporter:
Chris Morrison
Reviewers:
Andy Salnikov
Watchers:
Andy Salnikov, Chris Morrison, Colin Slater, Eric Bellm, John Swinbank, Tim Jenness