Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-2351

Clarify ability to recreate LSST (DM) Python objects from the data archive

    XMLWordPrintable

    Details

    • Team:
      Architecture

      Description

      The request (from the Level 3 perspective) is to clarify our expectations for which LSST Python object types can be readily recreated from data retrieved from the LSST Archive, and how explicitly the stack will support this recreation. In other words, roughly, for what types do we support something like an object-relational mapping?

      This is a key point in trying to understand what the Level 3 environment will be like for users.

      I want to be sure that I understand any differences in what is possible (other than issues related solely to I/O performance/throughput) on resources at a DAC vs. on computers outside the LSST project-provided facilities.

      For images, it's clear that a user can use the documented ability to retrieve LSST image data in FITS format to read that FITS data into an LSST DM Python image class. Ideally this would all be through the Butler, and presumably this is possible for all classes of image data, including calibration images, that are available from the Archive (per the DPDD specification of the official data products).

      It is much less clear to me whether any form of object-recreation, short of a full recomputation, is envisioned for the Python objects that lie behind our catalog entries (Object, Source, etc.) - or for derived metadata. For instance, can the computed PSF model for an image be retrieved from the database and readily recreated in Python in a form satisfying our PSF API? A WCS?

      If recreation is supported, what will those interfaces look like?

        Attachments

          Issue Links

            Activity

            Hide
            ktl Kian-Tat Lim added a comment -

            All datasets stored in the database are intended to be retrievable via the Butler as Python objects. This includes a PSF (if complete PSFs are indeed stored in the database, which is not clear), a WCS, a Source, and an Object. The Butler data id for items in the database is expected to include key/value pairs that contribute to an appropriate SQL query.

            Furthermore, I would expect that the Butler would also be able to retrieve (lowercase) objects via the REST API and return them as Python objects.

            As I see it, all interfaces used by normal stack code at a DAC should also work away from the DAC. The exceptions would be "alongside DRP" (a.k.a. Level 2.5) processing and "alongside database" processing that inherently live at the DAC.

            Show
            ktl Kian-Tat Lim added a comment - All datasets stored in the database are intended to be retrievable via the Butler as Python objects. This includes a PSF (if complete PSFs are indeed stored in the database, which is not clear), a WCS, a Source, and an Object. The Butler data id for items in the database is expected to include key/value pairs that contribute to an appropriate SQL query. Furthermore, I would expect that the Butler would also be able to retrieve (lowercase) objects via the REST API and return them as Python objects. As I see it, all interfaces used by normal stack code at a DAC should also work away from the DAC. The exceptions would be "alongside DRP" (a.k.a. Level 2.5) processing and "alongside database" processing that inherently live at the DAC.
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            Re: "... all interfaces used by normal stack code at a DAC should also work away from the DAC."

            Agreed. That appears to be a requirement. Given what we have said about the ability for users to recreate the entire processing chain, I think the statement could be even stronger, perhaps:

            "All interfaces used by normal stack code should function and return the same results whether run on LSST resources (at the Base and Archive centers, or at DACs) or offsite. The performance achieved may of course vary, and the behavior of the interface implementations onsite vs. offsite, and in production vs. for users, may be different (e.g., onsite, production processes may use more direct connection methods and/or access different sources for the same data in order to improve performance)."

            Show
            gpdf Gregory Dubois-Felsmann added a comment - Re: "... all interfaces used by normal stack code at a DAC should also work away from the DAC." Agreed. That appears to be a requirement. Given what we have said about the ability for users to recreate the entire processing chain, I think the statement could be even stronger, perhaps: "All interfaces used by normal stack code should function and return the same results whether run on LSST resources (at the Base and Archive centers, or at DACs) or offsite. The performance achieved may of course vary, and the behavior of the interface implementations onsite vs. offsite, and in production vs. for users, may be different (e.g., onsite, production processes may use more direct connection methods and/or access different sources for the same data in order to improve performance)."
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            Re: "All datasets stored in the database are intended to be retrievable via the Butler as Python objects."

            That is very good to hear. Is this written down somewhere?

            From my perspective, it may be OK if the returned objects are different implementations of the same interfaces; it would be good to know where it may appear that this trick would be needed.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - Re: "All datasets stored in the database are intended to be retrievable via the Butler as Python objects." That is very good to hear. Is this written down somewhere? From my perspective, it may be OK if the returned objects are different implementations of the same interfaces; it would be good to know where it may appear that this trick would be needed.
            Hide
            frossie Frossie Economou added a comment -

            Kian-Tat Lim It's not even clear to me that 2.5 interfaces are an exception...

            Show
            frossie Frossie Economou added a comment - Kian-Tat Lim It's not even clear to me that 2.5 interfaces are an exception...
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            I think they are at least an exception in that, if a production is not in progress, the data objects that might be requested through those interfaces will simply not exist.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - I think they are at least an exception in that, if a production is not in progress, the data objects that might be requested through those interfaces will simply not exist.
            Hide
            ktl Kian-Tat Lim added a comment -

            Re: retrievability of database datasets. Butler Redesign for Winter2014 and the New Data Butler design document mention this, but perhaps not as explicitly.

            Re: Level 2.5. I've seen this as basically our DAC DMCS (cooperating with the Archive DMCS) submitting a user-provided Task to a DAC-resident queue once an intermediate data product is available. I don't think we want a polling interface, nor one where someone will queue up millions of blocking requests at once. I take no position on the political ramifications of making DRP intermediates widely available before the DRP itself is.

            Show
            ktl Kian-Tat Lim added a comment - Re: retrievability of database datasets. Butler Redesign for Winter2014 and the New Data Butler design document mention this, but perhaps not as explicitly. Re: Level 2.5. I've seen this as basically our DAC DMCS (cooperating with the Archive DMCS) submitting a user-provided Task to a DAC-resident queue once an intermediate data product is available. I don't think we want a polling interface, nor one where someone will queue up millions of blocking requests at once. I take no position on the political ramifications of making DRP intermediates widely available before the DRP itself is.
            Hide
            ktl Kian-Tat Lim added a comment -

            I will attempt to summarize the state of what is currently expected based on our requirements and designs:

            For images, a Butler instantiated at a DAC will be able to retrieve afw.Exposure objects.  Butler export/import will enable images to be transported outside the DAC while still being similarly retrievable.  When retrieving images from an RSP API Aspect VO service, methods will be provided to read a retrieved FITS file into an afw.Exposure.

            For catalogs, a Butler instantiated at a DAC will be able to retrieve the Science Data Model catalog contents from Parquet files. These will not be in the original afw.Catalog Python type but instead in a Python type thought to be more generally useful such as pandas.Dataframe. Again butler export/import can be used to transport these outside the DAC.

            Original pipeline Python objects (such as Source or DiaSource) as well as intermediate data present in those objects and afw.Catalog FITS files (such as Footprint objects) are not expected to be available or retrievable.  It is not currently planned to enable the Butler to retrieve catalog entries from a database; that would have to be done using an RSP API Aspect VO service (i.e. TAP), which would also not return Python objects.

            Show
            ktl Kian-Tat Lim added a comment - I will attempt to summarize the state of what is currently expected based on our requirements and designs: For images, a Butler instantiated at a DAC will be able to retrieve  afw.Exposure  objects.  Butler export/import will enable images to be transported outside the DAC while still being similarly retrievable.  When retrieving images from an RSP API Aspect VO service, methods will be provided to read a retrieved FITS file into an  afw.Exposure . For catalogs, a Butler instantiated at a DAC will be able to retrieve the Science Data Model catalog contents from Parquet files. These will not be in the original  afw.Catalog  Python type but instead in a Python type thought to be more generally useful such as  pandas.Dataframe . Again butler export/import can be used to transport these outside the DAC. Original pipeline Python objects (such as  Source  or  DiaSource ) as well as intermediate data present in those objects and afw.Catalog FITS files (such as Footprint objects) are not expected to be available or retrievable.  It is not currently planned to enable the Butler to retrieve catalog entries from a database; that would have to be done using an RSP API Aspect VO service (i.e. TAP), which would also not return Python objects.
            Hide
            ktl Kian-Tat Lim added a comment - - edited

            Gregory Dubois-Felsmann is adding this to LDM-542 on DM-26661.
            But first we're asking for review from Leanne Guy

            Show
            ktl Kian-Tat Lim added a comment - - edited Gregory Dubois-Felsmann is adding this to LDM-542 on DM-26661 . But first we're asking for review from Leanne Guy

              People

              Assignee:
              ktl Kian-Tat Lim
              Reporter:
              gpdf Gregory Dubois-Felsmann
              Reviewers:
              Leanne Guy
              Watchers:
              Frossie Economou, Gregory Dubois-Felsmann, John Swinbank, Kian-Tat Lim, Leanne Guy, Robert Lupton, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:

                  Jenkins

                  No builds found.