Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-22655

Genericize gen2to3.py to be useable with any gen2 repo

    XMLWordPrintable

Details

    • 4
    • AP S20-1 (December), AP S20-2 (January), AP S20-3 (February), AP S20-4 (March)
    • Alert Production

    Description

      To deal with the various obs_package gen3 support tickets, I'm going to try to make a more generic version of the gen2to3.py conversion script. This will serve as a useful stopgap for using test datasets, while the official "create a gen3 repo from scratch" code is sorted out.

      Attachments

        Issue Links

          Activity

            Parejkoj John Parejko added a comment -

            I think jbosch and I got this working for DECam calibrations, but I haven't been able to figure out how to load calibrations in gen2 to compare against. ktl: you said you had the right gen2 command to load the bias/dark/flat associated with a given raw?

            Parejkoj John Parejko added a comment - I think jbosch and I got this working for DECam calibrations, but I haven't been able to figure out how to load calibrations in gen2 to compare against. ktl : you said you had the right gen2 command to load the bias/dark/flat associated with a given raw?
            tjenness Tim Jenness added a comment -

            I am trying to piece together what this all means but the open questions I have:

            • Is the complication here solely for DECam because community pipeline calibrations are also single files with multi extensions?
            • Can we use a sledgehammer and compare the file system contents for gen3 and gen2 to see that they have the same number of files? Or is bypass magic being used?
            • Can we use sqlalchemy to query the gen2 calibration registry directly to see what is meant to be in there?
            • Are we okay with DECam complications blocking obs_lsst from using gen3?
            • Should we move forward with LSST gen3 support so that we can start using the new cp_pipe with LATISS data and defer DECam?
            • How much of this is DECam and how much of this is uncertainty in gen3 calibration handling?
            tjenness Tim Jenness added a comment - I am trying to piece together what this all means but the open questions I have: Is the complication here solely for DECam because community pipeline calibrations are also single files with multi extensions? Can we use a sledgehammer and compare the file system contents for gen3 and gen2 to see that they have the same number of files? Or is bypass magic being used? Can we use sqlalchemy to query the gen2 calibration registry directly to see what is meant to be in there? Are we okay with DECam complications blocking obs_lsst from using gen3? Should we move forward with LSST gen3 support so that we can start using the new cp_pipe with LATISS data and defer DECam? How much of this is DECam and how much of this is uncertainty in gen3 calibration handling?
            Parejkoj John Parejko added a comment -

            The code on this ticket should be generic for all cameras. DECam is the first non-HSC gen3 camera I'm using to test it, because I know the DECam gen3 Formatter/Instrument work and because AP needs DECam gen3 for their work. Once this is merged, we should be able to write tests against converting a gen2 obs_lsst repo; assuming obs_lsst's gen3 support is good, that should be fairly easy.

            Parejkoj John Parejko added a comment - The code on this ticket should be generic for all cameras. DECam is the first non-HSC gen3 camera I'm using to test it, because I know the DECam gen3 Formatter/Instrument work and because AP needs DECam gen3 for their work. Once this is merged, we should be able to write tests against converting a gen2 obs_lsst repo; assuming obs_lsst's gen3 support is good, that should be fairly easy.
            tjenness Tim Jenness added a comment -

            Ok, is the summary here that you think you have completed all the work but you have one final verification issue automating the comparison of gen2 calibrations with gen3 calibrations? You are not able to list all the calibration datasets in a gen2 repository?

            tjenness Tim Jenness added a comment - Ok, is the summary here that you think you have completed all the work but you have one final verification issue automating the comparison of gen2 calibrations with gen3 calibrations? You are not able to list all the calibration datasets in a gen2 repository?
            jbosch Jim Bosch added a comment -

            If we're just talking about test code, I'd favor just hard-coding the lists of expected datasets into the tests. They aren't long lists (AFAIK), and forcing devs to think about changes to expected test outputs when changing the test data strikes me as a good thing, overall.

             

            jbosch Jim Bosch added a comment - If we're just talking about test code, I'd favor just hard-coding the lists of expected datasets into the tests. They aren't long lists (AFAIK), and forcing devs to think about changes to expected test outputs when changing the test data strikes me as a good thing, overall.  
            Parejkoj John Parejko added a comment -

            I don't think that hard-coding lists of datasets would work, since I still can't load them from the gen2 butler.

            Parejkoj John Parejko added a comment - I don't think that hard-coding lists of datasets would work, since I still can't load them from the gen2 butler.
            tjenness Tim Jenness added a comment -

            My interpretation of what was being suggested was that the test would list all the calibrations that you expect to exist in the gen3 butler after conversion. Then you don't need to query gen2 at all, you go through the expected list and check they are all there.

            tjenness Tim Jenness added a comment - My interpretation of what was being suggested was that the test would list all the calibrations that you expect to exist in the gen3 butler after conversion. Then you don't need to query gen2 at all, you go through the expected list and check they are all there.

            My problem with the above is that I still don't know how to correctly load the gen2 data to compare with. I'd like to be able to test more than just "does gen3 think it can load dataset X?" but also "is dataset X the same as it was in gen2?"

            Parejkoj John Parejko added a comment - My problem with the above is that I still don't know how to correctly load the gen2 data to compare with. I'd like to be able to test more than just "does gen3 think it can load dataset X?" but also "is dataset X the same as it was in gen2?"
            tjenness Tim Jenness added a comment -

            Does gen2to3 use ingest() to store the gen2 data in the gen3 repository or does it use put()? If it is using ingest then the data files themselves are known to be identical.

            For ingest this means that your test is testing that get in gen3 gives you the same in memory dataset as get in gen2. This is testing the formatter read. It's good to be testing this but we can get a long way towards where you want to be by explicitly testing formatters on some representative dataset types. I am worried that this ticket is going to continue to stay open for an indeterminate amount of time whilst we try to wrestle with gen2 issues and it's blocking further progress on gen 3 migration.

            If data are put then formatter write and formatter read are in play and things are much more interesting when trying to determine whether the gen3 and gen2 butlers have the same content.

            tjenness Tim Jenness added a comment - Does gen2to3 use ingest() to store the gen2 data in the gen3 repository or does it use put() ? If it is using ingest then the data files themselves are known to be identical. For ingest this means that your test is testing that get in gen3 gives you the same in memory dataset as get in gen2. This is testing the formatter read. It's good to be testing this but we can get a long way towards where you want to be by explicitly testing formatters on some representative dataset types. I am worried that this ticket is going to continue to stay open for an indeterminate amount of time whilst we try to wrestle with gen2 issues and it's blocking further progress on gen 3 migration. If data are put then formatter write and formatter read are in play and things are much more interesting when trying to determine whether the gen3 and gen2 butlers have the same content.
            jbosch Jim Bosch added a comment -

            If you hard-code a list of tuples of (Gen2 data ID, Gen3 data ID, DatasetType), I think you can just call get on both butlers and do whatever you'd do on the in-memory object to compare them. If you want to be clever, you could add a comparison callable to the tuple, and have a little library of those. I actually think that if you don't at least hard-code both the Gen2 and Gen3 data IDs, you'll have to use some of the conversion logic you'd otherwise want to test to do the conversion.

            jbosch Jim Bosch added a comment - If you hard-code a list of tuples of (Gen2 data ID, Gen3 data ID, DatasetType), I think you can just call get on both butlers and do whatever you'd do on the in-memory object to compare them. If you want to be clever, you could add a comparison callable to the tuple, and have a little library of those. I actually think that if you don't at least hard-code both the Gen2 and Gen3 data IDs, you'll have to use some of the conversion logic you'd otherwise want to test to do the conversion.
            tjenness Tim Jenness added a comment -

            I think the initial point was that even if you know the dataId of a bias/dark/flat you can't actually get that dataId from a gen2 butler.

            tjenness Tim Jenness added a comment - I think the initial point was that even if you know the dataId of a bias/dark/flat you can't actually get that dataId from a gen2 butler.
            Parejkoj John Parejko added a comment -

            tjenness: you have a good point about testing the Formatters separately. I'm already doing that, and I think I was concerned about the various other magic that happens in the gen2 Mapper. But we don't have to worry about that in the gen3 world, since there aren't overrides for everything and all the "magic" happens in the Formatter itself. The conversion uses `ingest`.

            Further update: I've got conversion of some DECam data mostly working, except for a slew of warnings on read and not getting the correct types for things (e.g. Defects being Catalog not DefectList). So, I think we're pretty close!

            Parejkoj John Parejko added a comment - tjenness : you have a good point about testing the Formatters separately. I'm already doing that, and I think I was concerned about the various other magic that happens in the gen2 Mapper. But we don't have to worry about that in the gen3 world, since there aren't overrides for everything and all the "magic" happens in the Formatter itself. The conversion uses `ingest`. Further update: I've got conversion of some DECam data mostly working, except for a slew of warnings on read and not getting the correct types for things (e.g. Defects being Catalog not DefectList). So, I think we're pretty close!
            tjenness Tim Jenness added a comment -

            Regarding the defects, the defects datasetType has to be storage class DefectList and it can use the standard FitsCatalog Formatter. Something in gen2to3 has to do that mapping.

            tjenness Tim Jenness added a comment - Regarding the defects, the defects datasetType has to be storage class DefectList and it can use the standard FitsCatalog Formatter. Something in gen2to3 has to do that mapping.
            Parejkoj John Parejko added a comment - - edited Jenkins run without ci_hsc (still under construction): https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/31314/pipeline
            Parejkoj John Parejko added a comment -

            tjenness: can you please review this, as we discussed?

            A few notes:

            • DM-23616 testing is still ongoing, but I think any changes as part of that work should go on that ticket: I believe I've demonstrated with this ticket that I can convert DECam gen2 repositories, but whether we can run the full pipeline on them requires work outside the scope of this ticket.
            • This ticket does not change ci_hsc_gen2, for which I created DM-23728. It shouldn't break ci_hsc, but we do want to update it to use this new script.
            • tjenness: do you want to create the ticket for the "cleanup Formatter to remove cpBias/cpFlat from daf_butler" idea we discussed? You can probably describe your plan for that better than I can.
            Parejkoj John Parejko added a comment - tjenness : can you please review this, as we discussed? A few notes: DM-23616 testing is still ongoing, but I think any changes as part of that work should go on that ticket: I believe I've demonstrated with this ticket that I can convert DECam gen2 repositories, but whether we can run the full pipeline on them requires work outside the scope of this ticket. This ticket does not change ci_hsc_gen2, for which I created DM-23728 . It shouldn't break ci_hsc, but we do want to update it to use this new script. tjenness : do you want to create the ticket for the "cleanup Formatter to remove cpBias/cpFlat from daf_butler" idea we discussed? You can probably describe your plan for that better than I can.
            tjenness Tim Jenness added a comment -

            I have some major comments and questions on the obs_base changes. See GitHub.

            tjenness Tim Jenness added a comment - I have some major comments and questions on the obs_base changes. See GitHub.
            Parejkoj John Parejko added a comment - - edited Jenkins run with ci_hsc: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/31337/pipeline
            Parejkoj John Parejko added a comment -

            I believe I have dealt with most of your comments. I've fired off a new Jenkins run, but we should sort out a few of the remaining questions (in particular what to do about ci for DECam given file sizes: see my comment on that PR). It may be worth a short telecon to discuss our options for that.

            Parejkoj John Parejko added a comment - I believe I have dealt with most of your comments. I've fired off a new Jenkins run, but we should sort out a few of the remaining questions (in particular what to do about ci for DECam given file sizes: see my comment on that PR). It may be worth a short telecon to discuss our options for that.
            Parejkoj John Parejko added a comment - Final rebase Jenkins run: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/31343/pipeline
            Parejkoj John Parejko added a comment -

            Jenkins was clean. Huge thank you to tjenness for helping me work through the final questions (several of which spawned extra tickets).

            Parejkoj John Parejko added a comment - Jenkins was clean. Huge thank you to tjenness for helping me work through the final questions (several of which spawned extra tickets).

            People

              Parejkoj John Parejko
              Parejkoj John Parejko
              Tim Jenness
              Dino Bektesevic, Eric Bellm, Heather Kelly, James Chiang, Jim Bosch, John Parejko, John Swinbank, Kian-Tat Lim, Meredith Rawls, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Jenkins

                  No builds found.