Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-12604

Migrate SQuaSH data from the current production database to the new database schema

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Once we have the new service deployed we need to migrate the data from the current production database squash-db to the new QC Tier 0 database schema implemented in DM-12193

      The task consists in grabbing the jobs from the current production API https://squash-api.lsst.codes/jobs/<id>, convert them to the new verification job format and use the new API https://squash-restful-api-demo.lsst.codes/job to restore these jobs.

      The result should be an script that helps us to automate that task, so that we can repeat it easily and restore the current production database.

      Assuming that the current production API provides all the data we need, i.e if no change in the API is required, the real work is to write the piece of code that converts the jobs to the new verification job format.

      This notebook provides and example on how to use the new API and POST verification jobs.

      Transformation from the old job format to the verification job format.

      You can check the new format of the verification jobs in the above notebook. It has:

      data.keys()
      dict_keys(['metrics', 'measurements', 'meta', 'blobs', 'specs'])
      

      you don't have to worry about 'metrics' and 'specs' this info will not be restored. It is completely different in the new format, the only thing we need is to the metric name right which is the key for the metrics definition and specifications.

      1) data['measurements']

      OLD FORMAT:

      { "metric": "PA1",
        "value": 14.9064428565398,
        "metadata": null}
      

      NEW FORMAT:

       
      {'blob_refs': ['42ddf00268f244748d33f6dbae219371',
         'e92c8be5a34c45d1aaad00d5913d27c9',
         'd34a77ba0b234f6b95dc88cfccacccc6',
         '39e2da347ae84d01aa369f5f4f23d9ef'],
        'identifier': 'd35daa0952294deda56d4adcac6ca06c',
        'metric': 'validate_drp.AF3_stretch',
        'unit': '%',
        'value': 1.35345435345}
      
      

      1.1) Now metrics are prefixed by the verification package, all metrics in production are for the validade_drp package so you have to rename the metric name to validate_drp.<metric>

      1.2) Include unit field, its value can be discovered from the https://squash-api-dev.lsst.codes/metrics/<metric name>

      1.3) I dont use the identifier field, so you can include and left its value null

      1.4) include blob_refs field, the list of blob identifiers can be obtained from

      data['measurements']['metadata']['blobs']
      

      or let it null if there's no blob associated to the job. Note that blob_refs is just a list in the new format.

      OLD FORMAT:

      "measurements": [
              {
                  "metric": "AM1",
                  "value": 10.7817351689061,
                  "metadata": "{'blobs': {'astromModel': 'fb1a324139da49b99c8383c5837ae862', 'photomModel': '63e56a6c8a9a4ae398d4ee69caeec748', 'matchedDataset': '77cfbfdc61ad4d19a8fee7cf468edbae'}
      

      2) data['blobs']

      Blobs format has not changed, they always have the following keys:

      data['blobs'][0].keys()
      dict_keys(['identifier', 'data', 'name'])
      

      However, the extras and parameters fields in the old format:

      measurements['metadata']['extras']
      measurements['metadata']['parameteres']
      

      became a single blob in the new format. So you have to merge extras and parameters into a single object that goes into the data key in

      data['blobs'][0].keys()
      dict_keys(['identifier', 'data', 'name'])
      

      you have to generate a new blob UUID-4 based identifier using

      import uuid
      identifier = uuid.uuid4().hex
      

      Finally the blob name is the name of the associated metric (see 1.1) above.

      The new blob must be added to the data['blobs'] list and its identifier added to the blob_refs for this measurement (see 1.4 above).

      Example of data['blobs'] in the new format is available in the notebook mentioned above.

      3) data['meta']

      These fields in the OLD FORMAT:

      "ci_id": "1085",
      "ci_name": "validate_drp",
      "ci_dataset": "hsc",
      "ci_label": "centos-7",
      "date": "2017-10-13T11:58:30Z",
      "ci_url": "https://ci.lsst.codes/job/sqre/job/validate_drp/dataset=hsc,label=centos-7,python=py3/1085/",
      "status": 0,
      

      now go to:

       data['meta']['env']
      

      You have to add a new field, though

      data['meta']['env']['env_name'] = 'jenkins'
      
      

      with the fixed value jenkins for all jobs.

      packages are now part of metadata in the NEW FORMAT:

       data['meta']['packages']
      

      but the name field also became a key:

       
      { 'apr': {'eups_version': '1.5.2',
        'git_branch': None,
        'git_sha': '39b3212aa46217e4f485b02496381907da8b8d7a',
        'git_url': 'https://github.com/lsst/apr.git',
        'name': 'apr'},
        'apr_util': {'eups_version': '1.5.4',
        'git_branch': None,
        'git_sha': '45d74ccc249484be4fdba93fe2f66aafb5869335',
        'git_url': 'https://github.com/lsst/apr_util.git',
        'name': 'apr_util'}, ... }
       
      
      

      we should preserve the value of the 'date' field which corresponds to the timestamp when the job was originally created.

        Attachments

          Issue Links

            Activity

            Hide
            athornton Adam Thornton added a comment -

            The ETL tool is basically done.  Now waiting for Angelo Fausti to do a couple more server side tweaks, then add a status-checker to the loader, then test that, then do a bulk run to actually migrate the data.

            Show
            athornton Adam Thornton added a comment - The ETL tool is basically done.  Now waiting for Angelo Fausti to do a couple more server side tweaks, then add a status-checker to the loader, then test that, then do a bulk run to actually migrate the data.
            Hide
            athornton Adam Thornton added a comment -

            Bulk load in progress.  Will hopefully be done by morning.

            Show
            athornton Adam Thornton added a comment - Bulk load in progress.  Will hopefully be done by morning.
            Hide
            athornton Adam Thornton added a comment -

            Data imported.  Closing ticket.

             

            Show
            athornton Adam Thornton added a comment - Data imported.  Closing ticket.  

              People

              • Assignee:
                athornton Adam Thornton
                Reporter:
                afausti Angelo Fausti
                Watchers:
                Adam Thornton, Angelo Fausti
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel