Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-30073

Fix Portal retrieval of async queries

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: SUIT

      Description

      On data-int after deploying SUIT 2.2.0, Portal retrieval of async queries from TAP started failing with the error message "Failed to retrieve data"

        Attachments

          Activity

          Hide
          rra Russ Allbery added a comment -

          When Portal tried to retrieve the results from the Google bucket, it got a redirect from Google and then an auth challenge that it couldn't reply to. This turned out to be because it was sending the Gafaelfawr bearer auth token to Google, which in turn was due to a change in 2.2.0 to add .lsst.codes to the auth domain. That in turn was because we have SUIT deployments on .lsst.codes environments where auth will be required.

          However, the async results bucket is named async-results.lsst.codes, so that meant SUIT started sending auth to that request, which Google didn't like. Confirmed this was the problem by overriding sso.req.auth.hosts on data-int.

          We don't use SUIT with TAP on .lsst.codes environments other than minikube and red-five (the rest are all T&S deployments), so I think the fix is to undo part of my change and remove .lsst.codes from the auth domain list again.

          Show
          rra Russ Allbery added a comment - When Portal tried to retrieve the results from the Google bucket, it got a redirect from Google and then an auth challenge that it couldn't reply to. This turned out to be because it was sending the Gafaelfawr bearer auth token to Google, which in turn was due to a change in 2.2.0 to add .lsst.codes to the auth domain. That in turn was because we have SUIT deployments on .lsst.codes environments where auth will be required. However, the async results bucket is named async-results.lsst.codes, so that meant SUIT started sending auth to that request, which Google didn't like. Confirmed this was the problem by overriding sso.req.auth.hosts on data-int. We don't use SUIT with TAP on .lsst.codes environments other than minikube and red-five (the rest are all T&S deployments), so I think the fix is to undo part of my change and remove .lsst.codes from the auth domain list again.
          Show
          rra Russ Allbery added a comment - https://github.com/lsst/suit/pull/27
          Hide
          gpdf Gregory Dubois-Felsmann added a comment -

          I have a pending request in on our side to improve messaging when queries fail.  We have had so many breakdowns over auth problems that it would help to be prepared with some way to expose a little more information, so that when a user reports a problem during the Data Preview testing cycle this summer, we get a better clue right away.

          Show
          gpdf Gregory Dubois-Felsmann added a comment - I have a pending request in on our side to improve messaging when queries fail.  We have had so many breakdowns over auth problems that it would help to be prepared with some way to expose a little more information, so that when a user reports a problem during the Data Preview testing cycle this summer, we get a better clue right away.
          Hide
          gpdf Gregory Dubois-Felsmann added a comment -

          I'm not sure I've parsed the double negative in "We don't use SUIT with TAP on .lsst.codes environments other than minikube and red-five" correctly.  Is there a Portal and/or TAP on those two?

          Show
          gpdf Gregory Dubois-Felsmann added a comment - I'm not sure I've parsed the double negative in "We don't use SUIT with TAP on .lsst.codes environments other than minikube and red-five" correctly.  Is there a Portal and/or TAP on those two?
          Hide
          rra Russ Allbery added a comment -

          On minikube and red-five we install both SUIT and TAP, but they're just test/development environments, so it's not super-important that this one specific thing doesn't work. All other .lsst.codes environments are T&S environments that do not run TAP.

          Show
          rra Russ Allbery added a comment - On minikube and red-five we install both SUIT and TAP, but they're just test/development environments, so it's not super-important that this one specific thing doesn't work. All other .lsst.codes environments are T&S environments that do not run TAP.
          Hide
          gpdf Gregory Dubois-Felsmann added a comment -

          For now this is OK, I guess.  But I'm uncomfortable with having the components there but not verified to be functional.  Other things could erode over time if we're not checking.

          Show
          gpdf Gregory Dubois-Felsmann added a comment - For now this is OK, I guess.  But I'm uncomfortable with having the components there but not verified to be functional.  Other things could erode over time if we're not checking.
          Hide
          rra Russ Allbery added a comment -

          We uncovered a new problem with retrieving errors from async queries. I fixed one misconfiguration of the authentication rules for the TAP ingress, but the problem still persists. It was introduced some time after SUIT 2.1.0.

          Show
          rra Russ Allbery added a comment - We uncovered a new problem with retrieving errors from async queries. I fixed one misconfiguration of the authentication rules for the TAP ingress, but the problem still persists. It was introduced some time after SUIT 2.1.0.
          Hide
          gpdf Gregory Dubois-Felsmann added a comment - - edited

          Russ Allbery has filed https://github.com/Caltech-IPAC/firefly/pull/1091 against the development branch of Firefly for a correction to the authorization flow for error documents.  It's received preliminary approval internally.

          I've created local ticket FIREFLY-780 to capture this and to ask that it be cherry-picked to the current release branch.

          Show
          gpdf Gregory Dubois-Felsmann added a comment - - edited Russ Allbery  has filed https://github.com/Caltech-IPAC/firefly/pull/1091  against the development branch of Firefly for a correction to the authorization flow for error documents.  It's received preliminary approval internally. I've created local ticket FIREFLY-780 to capture this and to ask that it be cherry-picked to the current release branch.
          Hide
          gpdf Gregory Dubois-Felsmann added a comment -

          Fix is merged to Firefly release-2021.2.2. Testing locally at IPAC.

          Show
          gpdf Gregory Dubois-Felsmann added a comment - Fix is merged to Firefly release-2021.2.2 . Testing locally at IPAC.
          Hide
          rra Russ Allbery added a comment -

          We discovered my patch was not entirely correct, but we have come up with a new patch which is on a hotfix branch, and that appears to work correctly. This will be deployed at data-dev and data-int, and then will be rolled out once a formal release is made next week.

          Show
          rra Russ Allbery added a comment - We discovered my patch was not entirely correct, but we have come up with a new patch which is on a hotfix branch, and that appears to work correctly. This will be deployed at data-dev and data-int, and then will be rolled out once a formal release is made next week.
          Hide
          rra Russ Allbery added a comment -

          The new point release has been deployed on all IDF environments. We were unable to reproduce some other instability problems we saw with TAP and are hoping that was an infrastructure problem. Next step is a real release and then to deploy that everywhere.

          Show
          rra Russ Allbery added a comment - The new point release has been deployed on all IDF environments. We were unable to reproduce some other instability problems we saw with TAP and are hoping that was an infrastructure problem. Next step is a real release and then to deploy that everywhere.
          Hide
          gpdf Gregory Dubois-Felsmann added a comment -

          Revised fix is in Firefly release-2021.2.3, used in Portal/suit v2.3.3, test deployments likely tomorrow. Tested against RSP and against unauthenticated (e.g., IRSA, CADC) TAP services.

          Show
          gpdf Gregory Dubois-Felsmann added a comment - Revised fix is in Firefly release-2021.2.3, used in Portal/suit v2.3.3, test deployments likely tomorrow. Tested against RSP and against unauthenticated (e.g., IRSA, CADC) TAP services.
          Hide
          rra Russ Allbery added a comment -

          The debug build of the new release is now deployed at all of the IDF environments. Remaining work on this ticket is to update the charts and phalanx configuration for the final release when that's available.

          Show
          rra Russ Allbery added a comment - The debug build of the new release is now deployed at all of the IDF environments. Remaining work on this ticket is to update the charts and phalanx configuration for the final release when that's available.
          Hide
          rra Russ Allbery added a comment -

          I'm going to track the deployment of the new Portal release in a separate story since it will be a while. The fixed development version is now deployed in both the IDF and NCSA, so closing this out.

          Show
          rra Russ Allbery added a comment - I'm going to track the deployment of the new Portal release in a separate story since it will be a while. The fixed development version is now deployed in both the IDF and NCSA, so closing this out.

            People

            Assignee:
            rra Russ Allbery
            Reporter:
            rra Russ Allbery
            Reviewers:
            Gregory Dubois-Felsmann
            Watchers:
            Gregory Dubois-Felsmann, Loi Ly, Russ Allbery
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.