Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-15518

pipelines.lsst.io editions updates are breaking since 2018-07-22

    XMLWordPrintable

    Details

      Description

      Since the d_2018_07_22 daily build, new editions for pipelines.lsst.io have not been published. The builds are being uploaded, however. Edition resources are being created, but we can see that the build_url associated with these new /editions/ resources is null.

       

      This problem seems to be specific to the pipelines.lsst.io, though additional diagnosis is necessary to identify and fix the problem.

        Attachments

          Issue Links

            Activity

            Hide
            jsick Jonathan Sick added a comment -

            In fact, the builds are not being registered as uploaded at all:

            » http get https://keeper.lsst.codes/builds/5744                         jsick@Gingham ~/lsst/lsst_the_docs/ltd-keeper
            HTTP/1.1 200 OK
            Alt-Svc: clear
            Content-Length: 497
            Content-Type: application/json
            Date: Thu, 23 Aug 2018 19:01:24 GMT
            Server: nginx/1.9.12
            Via: 1.1 google
             
            {
                "bucket_name": "lsst-the-docs",
                "bucket_root_dir": "pipelines/builds/557",
                "date_created": "2018-07-22T15:41:08Z",
                "date_ended": null,
                "git_refs": [
                    "d_2018_07_22"
                ],
                "github_requester": null,
                "product_url": "https://keeper.lsst.codes/products/pipelines",
                "published_url": "https://pipelines.lsst.io/builds/557",
                "self_url": "https://keeper.lsst.codes/builds/5744",
                "slug": "557",
                "surrogate_key": "95044e405f1645fda14912f59e427a93",
                "uploaded": false
            }
            

            This explains why the editions are not being updated.

            Show
            jsick Jonathan Sick added a comment - In fact, the builds are not being registered as uploaded at all: » http get https://keeper.lsst.codes/builds/5744 jsick@Gingham ~/lsst/lsst_the_docs/ltd-keeper HTTP/1.1 200 OK Alt-Svc: clear Content-Length: 497 Content-Type: application/json Date: Thu, 23 Aug 2018 19:01:24 GMT Server: nginx/1.9.12 Via: 1.1 google   { "bucket_name": "lsst-the-docs", "bucket_root_dir": "pipelines/builds/557", "date_created": "2018-07-22T15:41:08Z", "date_ended": null, "git_refs": [ "d_2018_07_22" ], "github_requester": null, "product_url": "https://keeper.lsst.codes/products/pipelines", "published_url": "https://pipelines.lsst.io/builds/557", "self_url": "https://keeper.lsst.codes/builds/5744", "slug": "557", "surrogate_key": "95044e405f1645fda14912f59e427a93", "uploaded": false } This explains why the editions are not being updated.
            Hide
            jsick Jonathan Sick added a comment -

            The build is in fact uploaded to S3, but manually marking the upload as complete is not working:

            » http --auth $TOKEN: patch https://keeper.lsst.codes/builds/5744 uploaded:=true
            HTTP/1.1 400 BAD REQUEST
            Alt-Svc: clear
            Content-Length: 143
            Content-Type: application/json
            Date: Thu, 23 Aug 2018 19:19:04 GMT
            Server: nginx/1.9.12
            Via: 1.1 google
             
            {
                "error": "bad request",
                "message": "This edition already has a pending rebuild, this request will not be accepted.",
                "status": 400
            }
            

            So there's an edition with a pending rebuild, and somehow that is blocking the build from being marked as uploaded.

            This could be related to the work on weekly/nightly editions, which happened close to, but not exactly, at the same time.

            https://ltd-keeper.lsst.io/changelog.html

            Show
            jsick Jonathan Sick added a comment - The build is in fact uploaded to S3, but manually marking the upload as complete is not working: » http --auth $TOKEN: patch https://keeper.lsst.codes/builds/5744 uploaded:=true HTTP/1.1 400 BAD REQUEST Alt-Svc: clear Content-Length: 143 Content-Type: application/json Date: Thu, 23 Aug 2018 19:19:04 GMT Server: nginx/1.9.12 Via: 1.1 google   { "error": "bad request", "message": "This edition already has a pending rebuild, this request will not be accepted.", "status": 400 } So there's an edition with a pending rebuild, and somehow that is blocking the build from being marked as uploaded. This could be related to the work on weekly/nightly editions, which happened close to, but not exactly, at the same time. https://ltd-keeper.lsst.io/changelog.html
            Hide
            jsick Jonathan Sick added a comment -

            I reset pending_rebuild=False on the /weekly and /daily editions to see if that would clear the queue. In the worker logs I now see:

             
            2018-08-23 13:11:49.000 MST
            Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/celery/app/trace.py", line 382, in trace_task R = retval = fun(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/keeper/celery.py", line 35, in __call__ return TaskBase.__call__(self, *args, **kwargs) File "/usr/local/lib/python3.6/site-packages/celery/app/trace.py", line 641, in __protected_call__ return self.run(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/keeper/tasks/editionrebuild.py", line 60, in rebuild_edition cache_control='no-cache') File "/usr/local/lib/python3.6/site-packages/keeper/s3.py", line 143, in copy_directory aws_access_key_id, aws_secret_access_key) File "/usr/local/lib/python3.6/site-packages/keeper/s3.py", line 62, in delete_directory Delete=delete_keys) File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call return self._make_api_call(operation_name, kwargs) File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 612, in _make_api_call raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (MalformedXML) when calling the DeleteObjects operation: The XML you provided was not well-formed or did not validate against our published schema
            

            Show
            jsick Jonathan Sick added a comment - I reset pending_rebuild=False on the /weekly and /daily editions to see if that would clear the queue. In the worker logs I now see:   2018-08-23 13:11:49.000 MST Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/celery/app/trace.py", line 382, in trace_task R = retval = fun(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/keeper/celery.py", line 35, in __call__ return TaskBase.__call__(self, *args, **kwargs) File "/usr/local/lib/python3.6/site-packages/celery/app/trace.py", line 641, in __protected_call__ return self.run(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/keeper/tasks/editionrebuild.py", line 60, in rebuild_edition cache_control='no-cache') File "/usr/local/lib/python3.6/site-packages/keeper/s3.py", line 143, in copy_directory aws_access_key_id, aws_secret_access_key) File "/usr/local/lib/python3.6/site-packages/keeper/s3.py", line 62, in delete_directory Delete=delete_keys) File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call return self._make_api_call(operation_name, kwargs) File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 612, in _make_api_call raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (MalformedXML) when calling the DeleteObjects operation: The XML you provided was not well-formed or did not validate against our published schema
            Hide
            jsick Jonathan Sick added a comment -

            Solution Is that the S3 API LTD Keeper was using to delete objects didn't work for >1000 objects. The fix is to use pagination.

            Show
            jsick Jonathan Sick added a comment - Solution Is that the S3 API LTD Keeper was using to delete objects didn't work for >1000 objects. The fix is to use pagination.

              People

              Assignee:
              jsick Jonathan Sick
              Reporter:
              jsick Jonathan Sick
              Watchers:
              Jonathan Sick
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.