Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-25447

Support read-only components for butler get

    XMLWordPrintable

    Details

    • Story Points:
      10
    • Team:
      Architecture
    • Urgent?:
      No

      Description

      Currently we define components for StorageClasses to be the components that a dataset can be disassembled into or reassembled from. This is fine for things like masks and wcs but gen2 supports read-only components such as BBox that allow you to efficiently query information about a specific dataset without having to fetch the entire dataset.

      Getting this to work is fairly straight forward if there is no disassembly. If we do disassemble then I need to think about how we forward the BBox request to the correct component.

        Attachments

          Issue Links

            Activity

            Hide
            tjenness Tim Jenness added a comment -

            I have something simple that works now for Exposure that is not disassembled. I list readComponents separately, register those dataset types and datastore naively passes through the component request to the formatter.

            For disassembled datasets things are a bit trickier since the only portable solution would be to pass the component name and a dict mapping disassembled components to their own formatters to the composite Formatter and have that composite formatter work out which components it needs to work out filter or bbox etc. The bit that is a problem is that currently datastore doesn't know what the composite Formatter is because it never bothers to store it (since it disassembled things) – I think this means that my only option is to always add a row for the composite to posix_datastore_records with a blank filename (or I guess write such a row if the storage class defines read components – but it probably doesn't hurt to always write).

            I don't even know if Image has a fast equivalent of FitsExposureReader for getting the bounding box.

            Show
            tjenness Tim Jenness added a comment - I have something simple that works now for Exposure that is not disassembled. I list readComponents separately, register those dataset types and datastore naively passes through the component request to the formatter. For disassembled datasets things are a bit trickier since the only portable solution would be to pass the component name and a dict mapping disassembled components to their own formatters to the composite Formatter and have that composite formatter work out which components it needs to work out filter or bbox etc. The bit that is a problem is that currently datastore doesn't know what the composite Formatter is because it never bothers to store it (since it disassembled things) – I think this means that my only option is to always add a row for the composite to posix_datastore_records with a blank filename (or I guess write such a row if the storage class defines read components – but it probably doesn't hurt to always write). I don't even know if Image has a fast equivalent of FitsExposureReader for getting the bounding box.
            Hide
            jbosch Jim Bosch added a comment -

            Yup, there's an FitsImageReader, a FitsMaskReader, and a MaskdImageFitsReader, too.

            I had actually forgotten that we are still registering dataset types for components in Registry even though Registry doesn't every know about those datasets (that is what you're referring to, right?). I didn't follow all of the above, so maybe I'm just confused, but we could avoid assuming that in Datastore logic it might be nice to remove someday.

            Would it help at all to think of read-only components as actually being more like read parameters that change the type of the thing returned?

            Show
            jbosch Jim Bosch added a comment - Yup, there's an FitsImageReader , a FitsMaskReader , and a MaskdImageFitsReader , too. I had actually forgotten that we are still registering dataset types for components in Registry even though Registry doesn't every know about those datasets (that is what you're referring to, right?). I didn't follow all of the above, so maybe I'm just confused, but we could avoid assuming that in Datastore logic it might be nice to remove someday. Would it help at all to think of read-only components as actually being more like read parameters that change the type of the thing returned?
            Hide
            tjenness Tim Jenness added a comment -

            I did have to tweak the code to add the new dataset types to the registry. It was a one line change so not really a problem. Components really do get their own dataset types at the moment. The whole system assumes you can make a DatasetRef with the component name but of course that ref uses the dataId for the composite. I can quite easily regenerate the error message so you can see what part of registry got upset.

            I think read-only components is easier to deal with than using read parameters because components already have StorageClasses defined for them and so the infrastructure knows what python type is coming back. Doing a get with a read parameter of "bbox=True" means that the datastore has to know that somehow that parameter is going to chagne the returned python type. This surely would involve asking the composite formatter what type is expected or else removing all attempts at doing type checking in the datastore. As I say, it does work and the changes are small. The main work now is to store the composite formatter even when disassembled.

            Show
            tjenness Tim Jenness added a comment - I did have to tweak the code to add the new dataset types to the registry. It was a one line change so not really a problem. Components really do get their own dataset types at the moment. The whole system assumes you can make a DatasetRef with the component name but of course that ref uses the dataId for the composite. I can quite easily regenerate the error message so you can see what part of registry got upset. I think read-only components is easier to deal with than using read parameters because components already have StorageClasses defined for them and so the infrastructure knows what python type is coming back. Doing a get with a read parameter of "bbox=True" means that the datastore has to know that somehow that parameter is going to chagne the returned python type. This surely would involve asking the composite formatter what type is expected or else removing all attempts at doing type checking in the datastore. As I say, it does work and the changes are small. The main work now is to store the composite formatter even when disassembled.
            Hide
            tjenness Tim Jenness added a comment -

            Jim Bosch I think this is finally ready for review.

            • Now always store the composite formatter in the datastore records (but with a null location)
            • Add special case for a read-only component for a disassembled composite. This is necessary because we ask the formatter which component can service the read-only component request.
            • Added new tests specifically for datastore disassembly (because when we moved disassembly to datastore I only added tests at butler level).
            • This required a new formatter to be written for ExampleMetrics so that we can demonstrate disassembly with read-only component.
            • Read parameters are problematic for read-only components since you can't tell the difference between a parameter that is meant to modify the read-only component value, or a parameter that is meant to modify the component being used to derive that read-only component. This needs more thought and for now I assume that the parameter should be passed to the component before deriving the read-only component.
            • I've updated FitsExposure formatter to support bbox/dimensions/xy0.
            • It seems that support for filter is fundamentally not possible at the moment which also means that disassembly of exposures is lossy. It seems that part of this is that Exposure removes items from metadata and I can't find a method that will put the information back in metadata when I disassemble. The easiest fix maybe to add Filter as a proper component (which I could serialize to YAML – there is no reason we have to use writeFits since it's just the filter name)
            • I did learn that using fits.readMetadata and ExposureFitsReader.readMetadata gives very different results. I use fits.readMetadata when reading the exposure but this means things don't round trip because when I later use Exposure.getMetadata I get the cut down version.
            • Your idea of using an entirely new table that just maps the ref.id to composite formatter is interesting. It probably would removed a lot of the code that checks for the special case. On the other hand that's an entirely new table that needs managing and support in trash collection. After you look at the code you may think it's worth it (I'll try not to think of the phrase "sunk cost").
            Show
            tjenness Tim Jenness added a comment - Jim Bosch I think this is finally ready for review. Now always store the composite formatter in the datastore records (but with a null location) Add special case for a read-only component for a disassembled composite. This is necessary because we ask the formatter which component can service the read-only component request. Added new tests specifically for datastore disassembly (because when we moved disassembly to datastore I only added tests at butler level). This required a new formatter to be written for ExampleMetrics so that we can demonstrate disassembly with read-only component. Read parameters are problematic for read-only components since you can't tell the difference between a parameter that is meant to modify the read-only component value, or a parameter that is meant to modify the component being used to derive that read-only component. This needs more thought and for now I assume that the parameter should be passed to the component before deriving the read-only component. I've updated FitsExposure formatter to support bbox/dimensions/xy0. It seems that support for filter is fundamentally not possible at the moment which also means that disassembly of exposures is lossy. It seems that part of this is that Exposure removes items from metadata and I can't find a method that will put the information back in metadata when I disassemble. The easiest fix maybe to add Filter as a proper component (which I could serialize to YAML – there is no reason we have to use writeFits since it's just the filter name) I did learn that using fits.readMetadata and ExposureFitsReader.readMetadata gives very different results. I use fits.readMetadata when reading the exposure but this means things don't round trip because when I later use Exposure.getMetadata I get the cut down version. Your idea of using an entirely new table that just maps the ref.id to composite formatter is interesting. It probably would removed a lot of the code that checks for the special case. On the other hand that's an entirely new table that needs managing and support in trash collection. After you look at the code you may think it's worth it (I'll try not to think of the phrase "sunk cost").
            Hide
            tjenness Tim Jenness added a comment -

            Jim Bosch I've put this ticket back into review since I changed the entire approach. Following your review I removed the code that added the extra row to posix_datastore_records for the non-file composite. Instead I changed DatasetType such that it has a parent storage class. Now when dealing with read-only components I ask the parent storage class for the assembler and ask the assembler for the responsible component. This does look much neater.

            One side effect of this is that I no longer write component dataset type definitions to registry but dynamically create them on demand. This also led to queryDatasetTypes being rewritten to create the component dataset types as part of the query.

            Please take a second look.

            Show
            tjenness Tim Jenness added a comment - Jim Bosch I've put this ticket back into review since I changed the entire approach. Following your review I removed the code that added the extra row to posix_datastore_records for the non-file composite. Instead I changed DatasetType such that it has a parent storage class. Now when dealing with read-only components I ask the parent storage class for the assembler and ask the assembler for the responsible component. This does look much neater. One side effect of this is that I no longer write component dataset type definitions to registry but dynamically create them on demand. This also led to queryDatasetTypes being rewritten to create the component dataset types as part of the query. Please take a second look.
            Hide
            jbosch Jim Bosch added a comment -

            Re-reviewed!

            Show
            jbosch Jim Bosch added a comment - Re-reviewed!

              People

              Assignee:
              tjenness Tim Jenness
              Reporter:
              tjenness Tim Jenness
              Reviewers:
              Jim Bosch
              Watchers:
              Jim Bosch, Simon Krughoff, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.