Fix Version/s: None
Team:Data Access and Database
Does butler.put not end up creating the sqlite registry for you? I really don't know, but it makes sense, I think? I need to understand this part better.
Butler puts do not currently involve registry updates (unless I'm totally misreading the code), and I don't have any good ideas for making put support registry updates in a generic way. That's why I suggested having the spatial index be an output dataset that the selection task can use. By the way - the butler code I'm supposed to be looking at is in daf_persistence, not daf_butler right?
Lastly, yes I will cover
yes: you should use daf_persistence, and daf_butlerUtils, not daf_butler.
Last week we ended up deciding that there would be a script a la ingestProcessed.py that produces an sqlite3 database containing the spatial index, but that for now, the index would live outside of the purview of the butler. The selection task will have to be configured with the location of such a database for accelerated spatial selects.
Kian-Tat Lim, Nate Pease [X],
Here's a summary of what I'm thinking of doing.
1. Enhance sphgeom package:
2. Add an ingestion task which, given an input dataset that contains exposures, either adds a spatial index for those exposures to the sqlite3 repository registry, or which produces an output repository with a sqlite3 index as an output dataset. I'm not sure which is preferable, and I'll likely need some guidance implementing either one. A somewhat more detailed explanation of how this would work:
The last point motivates the Python wrapper. I could theoretically save myself a bunch of work by making the Python interface a single function that takes an sqlite3.Connection object and registers a bunch of UDFs. But that requires extracting the sqlite3 * C pointer from a PyObject sub-class defined in a C header from pysqlite that I almost certainly do not have access to. I don't see how to do this in a reasonable way.
3. Add one or more selection tasks (like selectSdssImages.py) which, given a spatial region specification (an explicit sky polygon, or more indirectly, a coadd patch id), returns a list of data IDs of overlapping exposures. This task could have multiple backends - e.g. one that runs queries against MySQL exposure tables that have been ingested with ingestProcessed.py from datarel, and another that queries a repo registry/sqlite3 output dataset.
Does this sound reasonable to you? Any thoughts are very welcome,