Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-14328

Publish verification jobs produced by the HSC reprocessing to SQuaSH

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: squash
    • Labels:
      None

      Description

      HSC weekly re processing is now producing verification jobs, located at

       /datasets/hsc/repo/rerun/RC/w_2018_17/DM-14055/validateDrp/matchedVisitMetrics/*/*/*json
      

      This ticket is to start a discussion of what is needed to send those results to SQuaSH.

      • dispatch_verify.py is used to send a verification job to SQuaSH, it has the --env option that can be used to grab information from the environment which is added to the job metatada. That's used by SQuaSH to identify the dataset being processed, the ID of the run, URLs linking to additional information about the run, the stack version used, etc.

      For CI we defined a "jenkins" enviroment, here we need to create another environment option named perhaps ldf. My initial suggestion for the environment variables is:

      • DATASET: name of the database being processed, e.g HSC RC2
      • DATASET_REPO_URL: do we have a git lfs repo for the dataset?
      • RUN_ID : can we use the associated jira ticket to identify this run? following https://confluence.lsstcorp.org/display/DM/Reprocessing+of+the+HSC+RC+dataset it looks like a good idea, is there a better identifier for the runs?
      • RUN_ID_URL: could be the corresponding jira ticket URL
      • VERSION_TAG: the LSST stack version used, e.g w_2017_14

      Once this new environment is created in dispatch_verify.py an example of command line to publish the results to SQuaSH would be:

      $ export DATASET="HSC RC2"
      $ export DATASET_REPO_URL=""
      $ export RUN_ID="DM-10084"
      $ export RUN_ID_URL="https://jira.lsstcorp.org/browse/DM-10084"
      $ export VERSION_TAG="w_2017_14"
       
      $ dispatch_verify.py --url https://squash-restful-api-demo.lsst.codes --user <squash user> --password <squash passwd> --env ldf --lsstsw lsstsw/ output/verify/job.json 
      
      

      Note that the above URL points to a demo instance of SQuaSH so we can test as needed without affecting the production instance.

      Here I am assuming we have an lsstw stack installation and thus access to the manifest file with the versions of the stack packages. We can supress the --lsstsw but it would be useful to carry on this information to compare which stack packages changed from weekly to weekly build.

      • Simon Krughoff mentioned that we have one verification job per patch. I guess, as a start, we could publish result for one patch only? Later, one way to do that would be to add the patch id as a job metadata so that we can distinguish them on SQuaSH. Also, we can use dispatch_verify.py to combine multiple verification jobs into a single JSON if that is required (see https://sqr-019.lsst.io/#Post-processing-verification-jobs) but this wouldn't scale if we process many patches which will be the case.

        Attachments

          Issue Links

            Activity

            Hide
            hchiang2 Hsin-Fang Chiang added a comment -

            Thanks for starting this!

            Regarding the questions about the env info,

            • DATASET_REPO_URL: No, there is no git lfs repo corresponding to the dataset. Generally speaking, I don't envision future datasets we process would have git lfs repo either, because most are likely larger than we want to store at a git lfs repo. However, for the HSC-RC2 dataset, we do have a Jira ticket DM–11345 so that URL may be an option? 

            About the last paragraph, HSC-RC2 covers 3 tracts, including 241 patches in total (in w_2018_17). I ran matchedVisitMetrics.py for each tract x filter combinations, and that summed up to 16 matchedVisitMetrics.py jobs so 16 JSON files. Should I group them differently? We can see each tract as a different sub-dataset of HSC-RC2 so I guess we want 3 at least.

            Show
            hchiang2 Hsin-Fang Chiang added a comment - Thanks for starting this! Regarding the questions about the env info, DATASET_REPO_URL : No, there is no git lfs repo corresponding to the dataset. Generally speaking, I don't envision future datasets we process would have git lfs repo either, because most are likely larger than we want to store at a git lfs repo. However, for the HSC-RC2 dataset, we do have a Jira ticket DM–11345 so that URL may be an option?  RUN_ID : In the Gen3 middleware there is the concept of RUN_ID as well (see https://dmtn-056.lsst.io/grouping-and-provenance.html#run ) and it looks like the concept maps to yours perfectly. That's great. Before that, using the JIRA ticket sounds like a good idea to me. About the last paragraph, HSC-RC2 covers 3 tracts, including 241 patches in total (in w_2018_17 ). I ran matchedVisitMetrics.py for each tract x filter combinations, and that summed up to 16 matchedVisitMetrics.py jobs so 16 JSON files. Should I group them differently? We can see each tract as a different sub-dataset of HSC-RC2 so I guess we want 3 at least.
            Hide
            afausti Angelo Fausti added a comment - - edited

            Great, let us use DM-11345 as reference to the HSC-RC2 dataset and the JIRA ticket associated to the run as RUN_ID for now. Simon Krughoff I think while SQuaSH does not handle the spatial information (i.e. multiple tracks) we could combine the filters for one track into a single verification JSON and send that to SQuaSH. I have no idea of the size of the resulting JSON at this point, but the SQuaSH API should be able to handle requests up to ~1GB https://github.com/lsst-sqre/squash-restful-api/blob/master/kubernetes/nginx/nginx.conf#L16

            Show
            afausti Angelo Fausti added a comment - - edited Great, let us use DM-11345 as reference to the HSC-RC2 dataset and the JIRA ticket associated to the run as RUN_ID for now. Simon Krughoff I think while SQuaSH does not handle the spatial information (i.e. multiple tracks) we could combine the filters for one track into a single verification JSON and send that to SQuaSH. I have no idea of the size of the resulting JSON at this point, but the SQuaSH API should be able to handle requests up to ~1GB https://github.com/lsst-sqre/squash-restful-api/blob/master/kubernetes/nginx/nginx.conf#L16
            Hide
            afausti Angelo Fausti added a comment - - edited

            So, before creating a new environment in dispatch_verify.py I just tried to use the "jenkins" enviroment variables to give it a try:

             
            dispatch_verify.py --url https://squash-restful-api-demo.lsst.codes --user <squash user> --password <squash password> --env jenkins --ignore-lsstsw /datasets/hsc/repo/rerun/RC/w_2018_18/DM-14243/validateDrp/matchedVisitMetrics/9615/HSC-G/matchedVisit_HSC-G.json
             
            verify.bin.dispatchverify.main INFO: Loading /datasets/hsc/repo/rerun/RC/w_2018_18/DM-14243/validateDrp/matchedVisitMetrics/9615/HSC-G/matchedVisit_HSC-G.json
            verify.bin.dispatchverify.main INFO: Refreshing metric definitions from verify_metrics
            verify.bin.dispatchverify.main INFO: Inserting Jenkins CI environment metadata.
            verify.bin.dispatchverify.main INFO: Uploading Job JSON to https://squash-restful-api-demo.lsst.codes.
            verify.squash.get INFO: GET https://squash-restful-api-demo.lsst.codes status: 200
            verify.squash.post INFO: POST https://squash-restful-api-demo.lsst.codes/auth status: 200
            verify.squash.post INFO: POST https://squash-restful-api-demo.lsst.codes/job status: 202
            

            you can see the validate_drp results for HSC-RC2 dataset at https://squash-demo.lsst.codes/dash/code_changes/

            Some findings:

            • I would not combine all the filters for a given patch as previously suggested, sending one tract and one filter each time seems ok and we do the same in CI.
            • The drill down plot for the AM1 metric takes a while to load, but it eventually loads (one tract has enough data points to make the app irresponsive...).
            • lsst.very relies on repos.yaml to get the information about the package versions. I run it with --ignore-lsstsw because repos.yaml was not found in the shared stack installation. We have to figure out how to obtain that information so that we have the "Code changes" column filled.

            FileNotFoundError: [Errno 2] No such file or directory: '/ssd/lsstsw/stack/etc/repos.yaml'
            

            • It is important to have in mind that the time displayed in SQuaSH correspond to the timestamp of the job when it was registered (not when it was executed). In CI this is not an issue since dispatch_verify runs right after the job is finished. Hsin-Fang Chiang do you think we can add the dispatch_verify step as part of the pipeline so that it is also automated?
            Show
            afausti Angelo Fausti added a comment - - edited So, before creating a new environment in dispatch_verify.py I just tried to use the "jenkins" enviroment variables to give it a try:   dispatch_verify.py --url https://squash-restful-api-demo.lsst.codes --user <squash user> --password <squash password> --env jenkins --ignore-lsstsw /datasets/hsc/repo/rerun/RC/w_2018_18/DM-14243/validateDrp/matchedVisitMetrics/9615/HSC-G/matchedVisit_HSC-G.json   verify.bin.dispatchverify.main INFO: Loading /datasets/hsc/repo/rerun/RC/w_2018_18/DM-14243/validateDrp/matchedVisitMetrics/9615/HSC-G/matchedVisit_HSC-G.json verify.bin.dispatchverify.main INFO: Refreshing metric definitions from verify_metrics verify.bin.dispatchverify.main INFO: Inserting Jenkins CI environment metadata. verify.bin.dispatchverify.main INFO: Uploading Job JSON to https://squash-restful-api-demo.lsst.codes. verify.squash.get INFO: GET https://squash-restful-api-demo.lsst.codes status: 200 verify.squash.post INFO: POST https://squash-restful-api-demo.lsst.codes/auth status: 200 verify.squash.post INFO: POST https://squash-restful-api-demo.lsst.codes/job status: 202 you can see the validate_drp results for HSC-RC2 dataset at https://squash-demo.lsst.codes/dash/code_changes/ Some findings: I would not combine all the filters for a given patch as previously suggested, sending one tract and one filter each time seems ok and we do the same in CI. The drill down plot for the AM1 metric takes a while to load, but it eventually loads (one tract has enough data points to make the app irresponsive...). lsst.very relies on repos.yaml to get the information about the package versions. I run it with --ignore-lsstsw because repos.yaml was not found in the shared stack installation. We have to figure out how to obtain that information so that we have the "Code changes" column filled. FileNotFoundError: [Errno 2] No such file or directory: '/ssd/lsstsw/stack/etc/repos.yaml' It is important to have in mind that the time displayed in SQuaSH correspond to the timestamp of the job when it was registered (not when it was executed). In CI this is not an issue since dispatch_verify runs right after the job is finished. Hsin-Fang Chiang do you think we can add the dispatch_verify step as part of the pipeline so that it is also automated?
            Hide
            hchiang2 Hsin-Fang Chiang added a comment -
            • The CmdLineTask framework stores a packages.pickle file with version information of the stack packages. It's the datasetType of "packages" in the Gen1/2 Butler. An example is at /datasets/hsc/repo/rerun/RC/w_2018_18/DM-14243/config/packages.pickle. Is this useful for SQuaSH to gather the package version information?
            • Yes, I think I can add an additional dispatch_verify step after generating the JSON file.
            Show
            hchiang2 Hsin-Fang Chiang added a comment - The CmdLineTask framework stores a packages.pickle file with version information of the stack packages. It's the datasetType of "packages" in the Gen1/2 Butler. An example is at /datasets/hsc/repo/rerun/RC/w_2018_18/ DM-14243 /config/packages.pickle . Is this useful for SQuaSH to gather the package version information? Yes, I think I can add an additional dispatch_verify step after generating the JSON file.
            Hide
            afausti Angelo Fausti added a comment -

            In DM-14357 we added the LDF execution environment option in dispatch_verify.py.  Another change in progress is DM-14550 that implements a new option --ignore-blobs to skip the upload of data blobs to SQuaSH.

            Show
            afausti Angelo Fausti added a comment - In DM-14357 we added the LDF execution environment option in dispatch_verify.py .  Another change in progress is DM-14550 that implements a new option  --ignore-blobs  to skip the upload of data blobs to SQuaSH.
            Hide
            afausti Angelo Fausti added a comment -

            Hsin-Fang Chiang we are ready to test the submission of validate_drp results on HSC RC2 reprocessing to SQuaSH.

            I've added instructions on how to run dispatch_verify on the LDF environment here: https://confluence.lsstcorp.org/display/~hchiang2/validateDrp

            Show
            afausti Angelo Fausti added a comment - Hsin-Fang Chiang we are ready to test the submission of validate_drp results on HSC RC2 reprocessing to SQuaSH. I've added instructions on how to run dispatch_verify on the LDF environment here: https://confluence.lsstcorp.org/display/~hchiang2/validateDrp
            Hide
            hchiang2 Hsin-Fang Chiang added a comment -

            Thanks a lot Angelo Fausti.   I've submitted one tract of w_2018_22's validate_drp results of HSC RC2 reprocessing to SQuaSH, and the 5 data points (for 5 filters) can be seen on the dashboard now. I'll add this to each biweekly run.

            Show
            hchiang2 Hsin-Fang Chiang added a comment - Thanks a lot Angelo Fausti .   I've submitted one tract of  w_2018_22 's validate_drp results of HSC RC2 reprocessing to SQuaSH, and the 5 data points (for 5 filters) can be seen on the dashboard now. I'll add this to each biweekly run.
            Hide
            afausti Angelo Fausti added a comment - - edited

            I removed the child tasks related to visualization and interface changes in order to close this ticket since we have data flowing from LDF to SQuaSH. See results on https://squash.lsst.codes for the HSC RC2 dataset.

            Show
            afausti Angelo Fausti added a comment - - edited I removed the child tasks related to visualization and interface changes in order to close this ticket since we have data flowing from LDF to SQuaSH. See results on https://squash.lsst.codes for the HSC RC2 dataset.

              People

              Assignee:
              afausti Angelo Fausti
              Reporter:
              afausti Angelo Fausti
              Watchers:
              Angelo Fausti, Hsin-Fang Chiang, John Parejko, John Swinbank, Jonathan Sick, Krzysztof Findeisen, Simon Krughoff
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.