Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-6913

Please document the semantics of object identifiers

    Details

    • Team:
      Architecture
    • Urgent?:
      No

      Description

      On clo, Robert Lupton mentions that object identifiers in source (object, etc) tables contain useful information beyond merely being a unique id. Please capture this in the stack documentation.

      It may also be appropriate to describe this in the DPDD. I leave that to those wiser than myself (i.e. Zeljko Ivezic) to opine on.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            Let's call these functions "packers" for now - they're reversible mappings between some group of identifiers and a single integer that represents all of them.

            We have packers for tract+patch, visit+detector, exposure+detector, and probably more. Some of these are implemented in astro_metadata_translator; some are implemented in CameraMapper (Gen2) and daf_butler DataIdPacker classes (Gen3). There is also completely distinct code in afw.table.IdFactory that mangles any integer representing an image (often generated by one of the above) with an autoincrement integer coming from the detection code into a Source or Object ID.

            I don't really see much of this as a middleware concern - while there is code in both Butlers to define some of them, the packers are really obs-package or skymap-defined mappings, and the middleware's role is just to deliver them from obs-package or skymap to science code, which is then responsible for mangling them into Source or Object IDs. It also uses the outputs of the packers in astro_metadata_translator at ingest time, but doesn't care that it does - it just cares that it is given unique integers.

            Maybe this is semantics, because I think Tim Jenness probably owns a good chunk of the problem regardless of the hat he's wearing, but I think it's most helpful to just think of middleware's role in this as that of a middleman between obs-package, skymap, and the science pipelines.

            Show
            jbosch Jim Bosch added a comment - Let's call these functions "packers" for now - they're reversible mappings between some group of identifiers and a single integer that represents all of them. We have packers for tract+patch, visit+detector, exposure+detector, and probably more. Some of these are implemented in astro_metadata_translator; some are implemented in CameraMapper (Gen2) and daf_butler DataIdPacker classes (Gen3). There is also completely distinct code in afw.table.IdFactory that mangles any integer representing an image (often generated by one of the above) with an autoincrement integer coming from the detection code into a Source or Object ID. I don't really see much of this as a middleware concern - while there is code in both Butlers to define some of them, the packers are really obs-package or skymap-defined mappings, and the middleware's role is just to deliver them from obs-package or skymap to science code, which is then responsible for mangling them into Source or Object IDs. It also uses the outputs of the packers in astro_metadata_translator at ingest time, but doesn't care that it does - it just cares that it is given unique integers. Maybe this is semantics, because I think Tim Jenness probably owns a good chunk of the problem regardless of the hat he's wearing, but I think it's most helpful to just think of middleware's role in this as that of a middleman between obs-package, skymap, and the science pipelines.
            Hide
            rhl Robert Lupton added a comment -

            I think this is more than documentation.  I provided functionality in the mappers (or tied to the mappers) to allow code to unpack the information;  I think we need to provide those APIs.  I think that's what Tim and Jim are talking about (where's Pim??)

            Show
            rhl Robert Lupton added a comment - I think this is more than documentation.  I provided functionality in the mappers (or tied to the mappers) to allow code to unpack the information;  I think we need to provide those APIs.  I think that's what Tim and Jim are talking about (where's Pim??)
            Hide
            swinbank John Swinbank added a comment -

            This whole discussion is about source IDs though isn't it even though the ticket says object IDs...

            The ticket is confusing (and I can say that without fear of upsetting anybody, since I wrote it). The key wording is “object identifiers in source (object, etc) tables”; a shameful overloading of the term “object”.

            The key point is: any time we publish a table of anything which provides an “ID” field that carries more meaning than simply being a unique identifier, we should:

            • document its significance, and
            • provide code to interpret it (where appropriate).

            Given the above discussion, though, I'm a little confused about how we actually get there; there seem to be many moving parts, and nobody who really owns all of them.

            I'm also not sure if it really matters. Is being able to unpack these IDs important? What would we lose if they were simply unique identifiers? I suggest that most people productively using are catalogs don't see them as anything more than that, and it has not been an obvious blocker to productivity.

            Show
            swinbank John Swinbank added a comment - This whole discussion is about source IDs though isn't it even though the ticket says object IDs... The ticket is confusing (and I can say that without fear of upsetting anybody, since I wrote it). The key wording is “object identifiers in source (object, etc) tables”; a shameful overloading of the term “object”. The key point is: any time we publish a table of anything which provides an “ID” field that carries more meaning than simply being a unique identifier, we should: document its significance, and provide code to interpret it (where appropriate). Given the above discussion, though, I'm a little confused about how we actually get there; there seem to be many moving parts, and nobody who really owns all of them. I'm also not sure if it really matters . Is being able to unpack these IDs important? What would we lose if they were simply unique identifiers? I suggest that most people productively using are catalogs don't see them as anything more than that, and it has not been an obvious blocker to productivity.
            Hide
            rhl Robert Lupton added a comment -

            I think it does matter as soon as you look at data from more than one chip or coadd – that was why I wrote the code that I referred to.  You find an outlier and ask, "which visit/CCD is this?"

            Show
            rhl Robert Lupton added a comment - I think it does matter as soon as you look at data from more than one chip or coadd – that was why I wrote the code that I referred to.  You find an outlier and ask, "which visit/CCD is this?"
            Hide
            jbosch Jim Bosch added a comment - - edited

            Should we consider having explicitly compound primary keys (or logical equivalents elsewhere) instead of mangling multiple integers into another integer?  There's a tradeoff between having to provide code to mangle/demangle IDs on one side and the pain of passing around tuples instead of integers on the other - we've just assumed that the former is less painful than the latter so far.  Recent (mostly unrelated) discussions have made me realize that "providing a function" in the future may be a much more involved task than it seems today, if it involves numpy-friendly vectorization, SQL UDFs for multiple database engines, and maybe even things the Dask/Parquet ecosystem can partition on.

            Show
            jbosch Jim Bosch added a comment - - edited Should we consider having explicitly compound primary keys (or logical equivalents elsewhere) instead of mangling multiple integers into another integer?  There's a tradeoff between having to provide code to mangle/demangle IDs on one side and the pain of passing around tuples instead of integers on the other - we've just assumed that the former is less painful than the latter so far.  Recent (mostly unrelated) discussions have made me realize that "providing a function" in the future may be a much more involved task than it seems today, if it involves numpy-friendly vectorization, SQL UDFs for multiple database engines, and maybe even things the Dask/Parquet ecosystem can partition on.

              People

              • Assignee:
                Unassigned
                Reporter:
                swinbank John Swinbank
                Watchers:
                Colin Slater, Gregory Dubois-Felsmann, Jim Bosch, John Swinbank, Kian-Tat Lim, Robert Lupton, Tim Jenness
              • Votes:
                1 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Summary Panel