We compare the the speed of the `pandas` wrappers for SQLAlchemy with those already in place using `afw`. We look here only that the speed is not significantly degraded when using `pandas` rather than try to improve the ability of the Ppdb interface to read and write.
We use 2 methods: First, we use 2 ccds (24, 25) across all visits (~60) in the `ap_verify_hits2015` dataset. Second, we use the program from the `l1dbproto` package to simulate reads and writes for DiaObjects, DiaSources, and DiaForcedSources. Unless explicitly stated, times include both conversion of the catalog object into the SQL query and processes within the DB. The DB backend used is `sqlite3` with the "baseline" configuration for the Ppdb. Currently, the `afw` option loops over every individual catalog row and column to translate the data too and from the Ppdb. This isn't done much differently for inserts using pandas, though the code seems to be doing something more clever for selects. One final thing to point out is that for the selects, both code paths (`afw` and `pandas`) are using the same SQL statements. For inserts, some of the Ppdb interface's optimizations are not currently used when inserting from `pandas` due to the API of the `DataFrame.to_sql` method.
For the HiTS dataset, the inserts (`storeDiaObjects`, `storeDiaSources`) for DiaObjects and Sources are comparable between `pandas` and `afw`. For select statements (`getDiaObjects`, `getDiaSources`), the `pandas` statements are noticeably faster compared to those using `afw`. The jump in time for the different reads is due to the ccd/visit moving to a different, non-overlapping portion of the sky.
For the `l1dbproto` simulations we run 2000 visits (roughly 2 days) and plot the average read/write time as a function of visit as a rolling 100 visit average. In this case, the `pandas` layer performs more quickly for both inserts and selects. In the case of selects, `pandas` is 6 times faster on average for the 2000 visits run. Note that the reads here includes both time for DiaSources and DiaFourcedSources. For inserts `pandas` is on average 1.7 times faster. One limitation here compared to the previous test is that only roughly ~10 or less columns are used when inserting vs the previous test which writes upwards of 100 columns/values into the DB per object/source.
Both speed ups are likely due to `pandas` being more efficient in converting the `DataFrame` object to SQL. In the select case, both code paths are using the same SQL statements and can be compared directly. For the inserts, the two are using slightly different statements with the `afw` path being slightly more optimized in its insert statement. Further investigations could be carried out to investigate if these speed ups hold larger, ~50k visit, runs of the `l1dbproto` and better database backends (e.g. PostgreSQL). No multi-threading, as is available in `l1dbproto` is used.