Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-3504

Improve spatial image search for butler

    XMLWordPrintable

Details

    • Story
    • Status: Done
    • Resolution: Done
    • None
    • None
    • 11
    • DB_W16_10, DB_W16_11
    • Data Access and Database

    Attachments

      Activity

        smonkewitz Serge Monkewitz added a comment - - edited

        ktl, npease,

        Here's a summary of what I'm thinking of doing.

        1. Enhance sphgeom package:

        • Implement convex polygon intersection testing
        • Add binary and/or textual IO for the geometric primitives - Remove ellipse support
        • Add a cartesian 3d bounding box type and methods to compute it for the various geometric primitives
        • Transform into a standard stack package. Ditch the custom build system, use sconsUtils, standard stack package layout, wrap C++ using SWIG.

        2. Add an ingestion task which, given an input dataset that contains exposures, either adds a spatial index for those exposures to the sqlite3 repository registry, or which produces an output repository with a sqlite3 index as an output dataset. I'm not sure which is preferable, and I'll likely need some guidance implementing either one. A somewhat more detailed explanation of how this would work:

        • one table would contain the exposure ID, complete data ID columns, and some representation of the boundary polygon for each exposure.
        • another would contain the 3d bounding box and exposure ID, and would be an sqlite3 R* index.
        • we either register functions for testing whether polygons overlap via the python sqlite3 API and post filter R* search results in SQL, or post-filter in Python.

        The last point motivates the Python wrapper. I could theoretically save myself a bunch of work by making the Python interface a single function that takes an sqlite3.Connection object and registers a bunch of UDFs. But that requires extracting the sqlite3 * C pointer from a PyObject sub-class defined in a C header from pysqlite that I almost certainly do not have access to. I don't see how to do this in a reasonable way.

        3. Add one or more selection tasks (like selectSdssImages.py) which, given a spatial region specification (an explicit sky polygon, or more indirectly, a coadd patch id), returns a list of data IDs of overlapping exposures. This task could have multiple backends - e.g. one that runs queries against MySQL exposure tables that have been ingested with ingestProcessed.py from datarel, and another that queries a repo registry/sqlite3 output dataset.

        Does this sound reasonable to you? Any thoughts are very welcome,
        Serge

        smonkewitz Serge Monkewitz added a comment - - edited ktl , npease , Here's a summary of what I'm thinking of doing. 1. Enhance sphgeom package: Implement convex polygon intersection testing Add binary and/or textual IO for the geometric primitives - Remove ellipse support Add a cartesian 3d bounding box type and methods to compute it for the various geometric primitives Transform into a standard stack package. Ditch the custom build system, use sconsUtils, standard stack package layout, wrap C++ using SWIG. 2. Add an ingestion task which, given an input dataset that contains exposures, either adds a spatial index for those exposures to the sqlite3 repository registry, or which produces an output repository with a sqlite3 index as an output dataset. I'm not sure which is preferable, and I'll likely need some guidance implementing either one. A somewhat more detailed explanation of how this would work: one table would contain the exposure ID, complete data ID columns, and some representation of the boundary polygon for each exposure. another would contain the 3d bounding box and exposure ID, and would be an sqlite3 R* index . we either register functions for testing whether polygons overlap via the python sqlite3 API and post filter R* search results in SQL, or post-filter in Python. The last point motivates the Python wrapper. I could theoretically save myself a bunch of work by making the Python interface a single function that takes an sqlite3.Connection object and registers a bunch of UDFs. But that requires extracting the sqlite3 * C pointer from a PyObject sub-class defined in a C header from pysqlite that I almost certainly do not have access to. I don't see how to do this in a reasonable way. 3. Add one or more selection tasks (like selectSdssImages.py ) which, given a spatial region specification (an explicit sky polygon, or more indirectly, a coadd patch id), returns a list of data IDs of overlapping exposures. This task could have multiple backends - e.g. one that runs queries against MySQL exposure tables that have been ingested with ingestProcessed.py from datarel, and another that queries a repo registry/sqlite3 output dataset. Does this sound reasonable to you? Any thoughts are very welcome, Serge

        Does butler.put not end up creating the sqlite registry for you? I really don't know, but it makes sense, I think? I need to understand this part better.

        npease Nate Pease [X] (Inactive) added a comment - Does butler.put not end up creating the sqlite registry for you? I really don't know, but it makes sense, I think? I need to understand this part better.

        sounds like you will cover DM-2262

        jbecla Jacek Becla (Inactive) added a comment - sounds like you will cover DM-2262

        Butler puts do not currently involve registry updates (unless I'm totally misreading the code), and I don't have any good ideas for making put support registry updates in a generic way. That's why I suggested having the spatial index be an output dataset that the selection task can use. By the way - the butler code I'm supposed to be looking at is in daf_persistence, not daf_butler right?

        Lastly, yes I will cover DM-2262.

        smonkewitz Serge Monkewitz added a comment - Butler puts do not currently involve registry updates (unless I'm totally misreading the code), and I don't have any good ideas for making put support registry updates in a generic way. That's why I suggested having the spatial index be an output dataset that the selection task can use. By the way - the butler code I'm supposed to be looking at is in daf_persistence, not daf_butler right? Lastly, yes I will cover DM-2262 .

        yes: you should use daf_persistence, and daf_butlerUtils, not daf_butler.

        npease Nate Pease [X] (Inactive) added a comment - yes: you should use daf_persistence, and daf_butlerUtils, not daf_butler.

        Last week we ended up deciding that there would be a script a la ingestProcessed.py that produces an sqlite3 database containing the spatial index, but that for now, the index would live outside of the purview of the butler. The selection task will have to be configured with the location of such a database for accelerated spatial selects.

        smonkewitz Serge Monkewitz added a comment - Last week we ended up deciding that there would be a script a la ingestProcessed.py that produces an sqlite3 database containing the spatial index, but that for now, the index would live outside of the purview of the butler. The selection task will have to be configured with the location of such a database for accelerated spatial selects.

        DM-3472 will serve as the implementation issue.

        smonkewitz Serge Monkewitz added a comment - DM-3472 will serve as the implementation issue.

        People

          smonkewitz Serge Monkewitz
          fritzm Fritz Mueller
          Jacek Becla (Inactive), Nate Pease [X] (Inactive), Serge Monkewitz
          Votes:
          0 Vote for this issue
          Watchers:
          3 Start watching this issue

          Dates

            Created:
            Updated:
            Resolved:

            Jenkins

              No builds found.