Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-30316

Write UUID migration script for sqlite

    XMLWordPrintable

    Details

    • Story Points:
      6
    • Team:
      Architecture
    • Urgent?:
      No

      Description

      In DM-30186 a migration script is being written to convert Postgres registries from integer IDs to UUIDs.

      That work indicated that sqlite wouldn't support the same approach. Given that sqlite is file-based a much easier scheme is to migrate by creating an entirely new cloned registry with the UUIDs.

      This can be implemented as something like:

      • Retrieve dimensions config
      • Export dataset type / collections / dimension records
      • Create new repo with same configuration and same dimensions config
      • Import the definitions.
      • Run Butler.transfer_from to migrate all the datasets
      • Deal with calibration and tagged collection membership.

        Attachments

          Issue Links

            Activity

            Hide
            tjenness Tim Jenness added a comment -

            Andy Salnikov, Jim Bosch I've made an initial attempt at this in a draft pull request. It seems to work although there are some questions. Can you give it a quick pre-review to see if I've missed anything obvious. I'm not sure how to easily check that the two sqlite registries are the same (ignoring the dataset_id) since some tables have different names.

            We probably should consider whether there is code here and in execution butler pipe_base that we should move to daf_butler.

            I assume this subcommand should be moved to daf_butler_migrate.

            Show
            tjenness Tim Jenness added a comment - Andy Salnikov , Jim Bosch I've made an initial attempt at this in a draft pull request. It seems to work although there are some questions. Can you give it a quick pre-review to see if I've missed anything obvious. I'm not sure how to easily check that the two sqlite registries are the same (ignoring the dataset_id) since some tables have different names. We probably should consider whether there is code here and in execution butler pipe_base that we should move to daf_butler. I assume this subcommand should be moved to daf_butler_migrate.
            Hide
            tjenness Tim Jenness added a comment -

            I have moved the command to daf_butler_migrate and created a pull request. It seems to work in that I can convert the pipelines_check repo and with some shell shenanigans to strip the dataset_id showed that the datasets are the same in each.

            I copy over the dimension universe, collections, dimension records, all the datasets, and fill the TAGGED and CALIBRATION collections. It's hard to test that the sqlite registries are identical because the dynamic tables have different names and the dataset_id is different.

            Andy Salnikov did you have anything for postgres that could compare before and after when you did the UUID change? (in your case it was in place so the dynamic tables would not have been recreated).

            Show
            tjenness Tim Jenness added a comment - I have moved the command to daf_butler_migrate and created a pull request. It seems to work in that I can convert the pipelines_check repo and with some shell shenanigans to strip the dataset_id showed that the datasets are the same in each. I copy over the dimension universe, collections, dimension records, all the datasets, and fill the TAGGED and CALIBRATION collections. It's hard to test that the sqlite registries are identical because the dynamic tables have different names and the dataset_id is different. Andy Salnikov did you have anything for postgres that could compare before and after when you did the UUID change? (in your case it was in place so the dynamic tables would not have been recreated).
            Hide
            salnikov Andy Salnikov added a comment -

            I did compare before/after results but I only did it for a small subset of datasets and using butler query-datasets to dump them. migrate_uuid.sh script has the code which dumps, strips dataset IDs and compares. If you want to do table-level comparison you need to know how to remap table names, those names are in dataset_type table.

            Show
            salnikov Andy Salnikov added a comment - I did compare before/after results but I only did it for a small subset of datasets and using butler query-datasets to dump them. migrate_uuid.sh script has the code which dumps, strips dataset IDs and compares. If you want to do table-level comparison you need to know how to remap table names, those names are in dataset_type table.
            Hide
            tjenness Tim Jenness added a comment -

            Jim Bosch not to worry if you drop out unexpectedly but I'd like you to take a quick look at the code in daf_butler_migrate in case you think I've missed something obvious.

            I have converted a couple of repos and run Andy Salnikov's comparison script (it just checks that all the datasets are there) and everything works.

            I'm wondering if I need to write a more comprehensive comparison tool in python that compares datasets and checks all collections (chained and calibration) and dimension records. The main problem being that if I've missed something then I would probably miss the same thing in my comparison tool.

            Dumping the sqlite contents with .dump is tricky to compare because the dynamic tables change their names (converting a UUID registry to a UUID registry should preserve UUIDs).

            Show
            tjenness Tim Jenness added a comment - Jim Bosch not to worry if you drop out unexpectedly but I'd like you to take a quick look at the code in daf_butler_migrate in case you think I've missed something obvious. I have converted a couple of repos and run Andy Salnikov 's comparison script (it just checks that all the datasets are there) and everything works. I'm wondering if I need to write a more comprehensive comparison tool in python that compares datasets and checks all collections (chained and calibration) and dimension records. The main problem being that if I've missed something then I would probably miss the same thing in my comparison tool. Dumping the sqlite contents with .dump is tricky to compare because the dynamic tables change their names (converting a UUID registry to a UUID registry should preserve UUIDs).

              People

              Assignee:
              tjenness Tim Jenness
              Reporter:
              tjenness Tim Jenness
              Reviewers:
              Jim Bosch
              Watchers:
              Andy Salnikov, Jim Bosch, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.