Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-24225

Explore performance of I/O of LUT linearizer in yaml format

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Team:
      Architecture
    • Urgent?:
      No

      Description

      As per DM-23044, ptc.py from cp_pipe is now able to persist different types of linearizers. One of them is LookupTable (see linearizer.py in ip_isr), with, for an LSSTCam-type device, will have a shape of (#nAmps, #DN) --->(16, 2**18). This table, along with necessary metadata, is saved as a yaml file. However, reading this file (38Mb for the file in the example below) again takes a very long time ( 3 mins), which is not ideal:

      import yaml
      import timeit 
      file = "/project/shared/auxTel/rerun/plazas/PTC-auxTel2020MAR25-doLinearize-true/calibrations/linearizers/linearize-lut-det000_2020-03-27.yaml"
       
      def read_yaml ():
          with open(file, 'r') as f:
              dataLU = yaml.load(f, Loader=yaml.Loader)
          return 
      

      %%timeit -n2 -r3
      read_yaml ()
       
      3min 18s ± 4.11 s per loop (mean ± std. dev. of 3 runs, 2 loops each)
      

        Attachments

          Issue Links

            Activity

            Hide
            mfisherlevine Merlin Fisher-Levine added a comment -

            The one line change, while obviously low-hanging fruit, is barely worth doing IMO. Maybe for some very early testing a 30s load is tolerable, but that's functionally useless really, so I'd almost lean away from merging that, lest people think that's OK. (i.e. obviously so that they can work for now there's no reason for Andrés/Chris not to have that on user branches).

            Show
            mfisherlevine Merlin Fisher-Levine added a comment - The one line change, while obviously low-hanging fruit, is barely worth doing IMO. Maybe for some very early testing a 30s load is tolerable, but that's functionally useless really, so I'd almost lean away from merging that, lest people think that's OK. ( i.e. obviously so that they can work for now there's no reason for Andrés/Chris not to have that on user branches).
            Hide
            swinbank John Swinbank added a comment - - edited

            I think what I'd like to see is guidance from Middleware/Architecture about what we're “allowed” to do here. I have the impression that there's some expectation that data “has” to be in YAML format, and that pickles (for example) are not allowed. Is that right? Is it documented? (That's a genuine question; I'm not trying to be snarky or naive... even if I am managing it).

            If we can establish that, I feel like we'd be in a more educated position here. If we can just dump in a pickle and turn that 3+ minutes into milliseconds, then problem solved, job's a good 'un, and we can all knock off early. If there is a requirement to use YAML, or some other process we have to go through in order to select an appropriate serialization mechanism... then we have to do some more thinking.

            Show
            swinbank John Swinbank added a comment - - edited I think what I'd like to see is guidance from Middleware/Architecture about what we're “allowed” to do here. I have the impression that there's some expectation that data “has” to be in YAML format, and that pickles (for example) are not allowed. Is that right? Is it documented? (That's a genuine question; I'm not trying to be snarky or naive... even if I am managing it). If we can establish that, I feel like we'd be in a more educated position here. If we can just dump in a pickle and turn that 3+ minutes into milliseconds, then problem solved, job's a good 'un, and we can all knock off early. If there is a requirement to use YAML, or some other process we have to go through in order to select an appropriate serialization mechanism... then we have to do some more thinking.
            Hide
            tjenness Tim Jenness added a comment -

            Pickle is a good hack but can't be the end game because it's almost impossible to document as a data format.

            There is no requirement to use YAML for this particular butler.put. The YAML form is meeting the requirement that the data can be stored as text form in the obs_lsst_data package – I do wonder if ecsv would be more efficient for that if we could store the lookup table in the YAML header but I don't know the relative sizes of the two tables.

            What's the performance of the FITS writer for linearizer? For curated calibrations we have been storing as text in git but putting them as fits files into butler. I was worrying mostly about performance for reading the text form because that slows down the curated calibration ingest. That's not a disaster though since you don't do it very often. For all I know the writeFits method takes ~ 1 sec and so we would be fine with that. Has anyone tried changing the storage form in gen2 butler so that it uses the Fits Catalog storage?

            Show
            tjenness Tim Jenness added a comment - Pickle is a good hack but can't be the end game because it's almost impossible to document as a data format. There is no requirement to use YAML for this particular butler.put. The YAML form is meeting the requirement that the data can be stored as text form in the obs_lsst_data package – I do wonder if ecsv would be more efficient for that if we could store the lookup table in the YAML header but I don't know the relative sizes of the two tables. What's the performance of the FITS writer for linearizer? For curated calibrations we have been storing as text in git but putting them as fits files into butler. I was worrying mostly about performance for reading the text form because that slows down the curated calibration ingest. That's not a disaster though since you don't do it very often. For all I know the writeFits method takes ~ 1 sec and so we would be fine with that. Has anyone tried changing the storage form in gen2 butler so that it uses the Fits Catalog storage?
            Hide
            swinbank John Swinbank added a comment -

            Thanks Tim. If I've understood the above correctly, there is no restriction on somebody in Pipelines just choosing an appropriate binary representation (and I agree with you that Pickle is not appropriate) and using that instead of YAML. Given that, I think we can claim this ticket as done (hooray), and file another ticket for the Pipelines group to change the format.

            Show
            swinbank John Swinbank added a comment - Thanks Tim. If I've understood the above correctly, there is no restriction on somebody in Pipelines just choosing an appropriate binary representation (and I agree with you that Pickle is not appropriate) and using that instead of YAML. Given that, I think we can claim this ticket as done (hooray), and file another ticket for the Pipelines group to change the format.
            Hide
            tjenness Tim Jenness added a comment -

            To be even clearer, Linearizer class already has a writeFits and readFits method so if in DM-23044 the butler was configured so that it treated Linearizer as a FitsCatalog then you are probably done with everything (assuming fits reading is reasonably fast). I still think it's worth investigating the text format on a separate ticket since I'm not convinced that YAML is the right answer and it might be that ecsv table is better. Jim Bosch would also prefer it if we switched from afwTable to astropy Table in Linearizer but that's also a separate ticket at this point.

            Show
            tjenness Tim Jenness added a comment - To be even clearer, Linearizer class already has a writeFits and readFits method so if in DM-23044 the butler was configured so that it treated Linearizer as a FitsCatalog then you are probably done with everything (assuming fits reading is reasonably fast). I still think it's worth investigating the text format on a separate ticket since I'm not convinced that YAML is the right answer and it might be that ecsv table is better. Jim Bosch would also prefer it if we switched from afwTable to astropy Table in Linearizer but that's also a separate ticket at this point.

              People

              • Assignee:
                Unassigned
                Reporter:
                plazas Andrés Alejandro Plazas Malagón
                Watchers:
                Andrés Alejandro Plazas Malagón, Christopher Waters, John Swinbank, Merlin Fisher-Levine, Tim Jenness
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel