Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-27008

Update Functor implementation to use gen3 butler/parquet access

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Story Points:
      8
    • Epic Link:
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      Gen3 parquet access is fundamentally different from gen2.  In gen2, a butler.get  call on a parquet dataset call returned a lazy ParquetTable object that doesn't load any data until requested by .getDataFrame at which point specific columns can be requested.  However, in gen3, a butler.get call returns data directly, and specific columns are requested at this time.  

      The Functor implementation currently in pipe_tasks is based on the gen2 ParquetTable object, as it takes one of those as the input to its _call_ method.  Updating for gen3 butler will need to change this.  

      A related concern is that one of the reasons for the gen2 ParquetTable implementation to be the way it was was to not have to read parquet metadata with every data load.  Part of this ticket should be to explore the degree to which this metadata loading affects practical use cases, and if anything needs to be changed in the gen3 parquet-interaction layer.

        Attachments

          Issue Links

            Activity

            Hide
            tjenness Tim Jenness added a comment -

            A fundamental difference between gen2 and gen3 is that gen2 is assuming the file is local but in gen3 it might be in an S3 object store and by the time you get the data from it the file itself has been deleted locally. There might be local caching of that file in a future datastore implementation but the caller code can't assume that will be there.

            Show
            tjenness Tim Jenness added a comment - A fundamental difference between gen2 and gen3 is that gen2 is assuming the file is local but in gen3 it might be in an S3 object store and by the time you get the data from it the file itself has been deleted locally. There might be local caching of that file in a future datastore implementation but the caller code can't assume that will be there.
            Hide
            jbosch Jim Bosch added a comment -

            I think for Tim Morton [X]'s purposes, he can assume that we will at least cache full parquet files on local disk after we fetch them from S3 - someday. We're obviously not doing that yet, and I hope that's okay because these use cases aren't S3 yet, but it's obviously doable (in the sense that it's a use case we considered when developing the architecture), and there's just no way we'll be able to make things performant in S3 without it.

            And I could see us having to go further - I know even on local (or at least shared FS) use Tim Morton [X] has to rewrite our "normal" parquet files to do better partitioning, and I think it's conceivable that we might want a custom formatter (or even datastore) that could do fetching and caching at the column level. That's well beyond the scope of this ticket, but it's the sort of thing we should keep in mind.

            Show
            jbosch Jim Bosch added a comment - I think for Tim Morton [X] 's purposes, he can assume that we will at least cache full parquet files on local disk after we fetch them from S3 - someday. We're obviously not doing that yet, and I hope that's okay because these use cases aren't S3 yet, but it's obviously doable (in the sense that it's a use case we considered when developing the architecture), and there's just no way we'll be able to make things performant in S3 without it. And I could see us having to go further - I know even on local (or at least shared FS) use Tim Morton [X] has to rewrite our "normal" parquet files to do better partitioning, and I think it's conceivable that we might want a custom formatter (or even datastore) that could do fetching and caching at the column level. That's well beyond the scope of this ticket, but it's the sort of thing we should keep in mind.
            Hide
            tmorton Tim Morton [X] (Inactive) added a comment -

            Jim Bosch I have this working on a small in-notebook demo (it is indeed slower than the gen2 version for the reasons mentioned above, though whether this is a problem or not can be determined later).  Can either you or Tim Jenness point me to some pattern for me to be able to add a gen3 test to test_functors.py?  Is there some pattern for me to generate a DeferredDatasetHandle of a fake parquet/DataFrame dataset to use in the testing here?

            Show
            tmorton Tim Morton [X] (Inactive) added a comment - Jim Bosch I have this working on a small in-notebook demo (it is indeed slower than the gen2 version for the reasons mentioned above, though whether this is a problem or not can be determined later).  Can either you or Tim Jenness point me to some pattern for me to be able to add a gen3 test to test_functors.py ?  Is there some pattern for me to generate a DeferredDatasetHandle of a fake parquet/DataFrame dataset to use in the testing here?
            Hide
            jbosch Jim Bosch added a comment -

            There's a test in daf_butler that sets up a repo with some DataFrame datasets, and gets and puts them:

            https://github.com/lsst/daf_butler/blob/master/tests/test_parquet.py

            If you steal some of that to set up the repo, you should be able to use butler.getDeferred to get a DeferredDatasetHandle.

             

            Show
            jbosch Jim Bosch added a comment - There's a test in daf_butler that sets up a repo with some DataFrame datasets, and gets and puts them: https://github.com/lsst/daf_butler/blob/master/tests/test_parquet.py If you steal some of that to set up the repo, you should be able to use butler.getDeferred to get a DeferredDatasetHandle .  
            Show
            tmorton Tim Morton [X] (Inactive) added a comment - Jenkins run started  https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/33055/pipeline
            Show
            tmorton Tim Morton [X] (Inactive) added a comment - New jenkins:  https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/33057/pipeline
            Hide
            tmorton Tim Morton [X] (Inactive) added a comment -

            New jenkins (previous still running, but this is one is off the current PR, which I re-did to separate the black formatting commit from the meaningful changes): https://ci.lsst.codes/job/stack-os-matrix/33058/display/redirect

            Show
            tmorton Tim Morton [X] (Inactive) added a comment - New jenkins (previous still running, but this is one is off the current PR, which I re-did to separate the black formatting commit from the meaningful changes):  https://ci.lsst.codes/job/stack-os-matrix/33058/display/redirect
            Hide
            tmorton Tim Morton [X] (Inactive) added a comment -

            Latest jenkins, after un-blacking and adding some additional documentation and one small code tweak:

            https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/33159/pipeline

            Show
            tmorton Tim Morton [X] (Inactive) added a comment - Latest jenkins, after un-blacking and adding some additional documentation and one small code tweak: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/33159/pipeline
            Show
            tmorton Tim Morton [X] (Inactive) added a comment - final jenkins: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/33269/pipeline
            Hide
            ktl Kian-Tat Lim added a comment -
            Show
            ktl Kian-Tat Lim added a comment - We are reverting the pipe_tasks PR due to problems caused in ci_hsc. See https://ci.lsst.codes/job/stack-os-matrix/33281/artifact/a138e20582/lsstsw/build/ci_hsc_gen2/_build.log and the conversation at https://lsstc.slack.com/archives/C2NCSTY3A/p1608264413121800
            Show
            ktl Kian-Tat Lim added a comment - New Jenkins run with ci_hsc included is https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/33286/pipeline/
            Hide
            ktl Kian-Tat Lim added a comment -

            And it passed, but I missed the nightly — which will therefore have the code in it.

            Show
            ktl Kian-Tat Lim added a comment - And it passed, but I missed the nightly — which will therefore have the code in it.
            Hide
            tmorton Tim Morton [X] (Inactive) added a comment -

            OK, sorry about this. I'm guessing this error comes from the fact that in the old gen2 parquet interface we silently ignored the request of columns that didn't exist, and something about my updates must have broken that. That is, I'm assuming the part of the ci_hsc run that broke was using gen2, right? My understanding is that in the gen3 parquet interface we are supposed to raise upon request of non-existent columns.

            Show
            tmorton Tim Morton [X] (Inactive) added a comment - OK, sorry about this. I'm guessing this error comes from the fact that in the old gen2 parquet interface we silently ignored the request of columns that didn't exist, and something about my updates must have broken that. That is, I'm assuming the part of the ci_hsc run that broke was using gen2, right? My understanding is that in the gen3 parquet interface we are supposed to raise upon request of non-existent columns.
            Hide
            ktl Kian-Tat Lim added a comment - - edited

            Yes, the failure was in the gen2 command transformObjectCatalog.py .../ci_hsc_gen2/DATA --rerun ci_hsc --id tract=0 patch=5,4 --doraise

            Show
            ktl Kian-Tat Lim added a comment - - edited Yes, the failure was in the gen2 command transformObjectCatalog.py .../ci_hsc_gen2/DATA --rerun ci_hsc --id tract=0 patch=5,4 --doraise
            Hide
            tmorton Tim Morton [X] (Inactive) added a comment -

            Thanks-- sounds like pipe_tasks should test for this in test_parquet.

            Show
            tmorton Tim Morton [X] (Inactive) added a comment - Thanks-- sounds like pipe_tasks should test for this in test_parquet.
            Hide
            lauren Lauren MacArthur added a comment -

            I was wondering if some test coverage could be added that would’ve caught this 

            Show
            lauren Lauren MacArthur added a comment - I was wondering if some test coverage could be added that would’ve caught this 
            Hide
            tmorton Tim Morton [X] (Inactive) added a comment -

            Hi all-- The problem was one offending line: a raise left over in a try/except block that I had put in to help debug and forgotten to take out. Successful jenkins run (including ci_hcs) here: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/33300/pipeline.

            Show
            tmorton Tim Morton [X] (Inactive) added a comment - Hi all-- The problem was one offending line: a raise left over in a try/except block that I had put in to help debug and forgotten to take out. Successful jenkins run (including ci_hcs) here: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/33300/pipeline .
            Hide
            tmorton Tim Morton [X] (Inactive) added a comment -

            Lauren MacArthur would you mind "reviewing" this? I don't think you need to look at any code (unless you would like to).

            Show
            tmorton Tim Morton [X] (Inactive) added a comment - Lauren MacArthur would you mind "reviewing" this? I don't think you need to look at any code (unless you would like to).
            Hide
            lauren Lauren MacArthur added a comment -

            We’re you able to find a way to add to the test coverage that would’ve caught this (or is it really that we need to rely on ci_hsc)?

            Show
            lauren Lauren MacArthur added a comment - We’re you able to find a way to add to the test coverage that would’ve caught this (or is it really that we need to rely on ci_hsc )?
            Hide
            tmorton Tim Morton [X] (Inactive) added a comment -

            Yes, good idea. I've added a test that catches the issue (and, incidentally, another bug that also got caught by the same test!) in the latest PR (jenkins running again at https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/33302/pipeline).

            Show
            tmorton Tim Morton [X] (Inactive) added a comment - Yes, good idea. I've added a test that catches the issue (and, incidentally, another bug that also got caught by the same test!) in the latest PR (jenkins running again at https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/33302/pipeline ).
            Hide
            yusra Yusra AlSayyad added a comment -

            That's great you added a unit test.

            To Lauren MacArthur's point, there is test coverage for commandline|pipelineTasks: ci_hsc. We are expected to run ci_hsc in Jenkins (esp for tickets that affect DRP) before merging.

            Show
            yusra Yusra AlSayyad added a comment - That's great you added a unit test. To Lauren MacArthur 's point, there is test coverage for commandline|pipelineTasks: ci_hsc . We are expected to run ci_hsc in Jenkins (esp for tickets that affect DRP) before merging.
            Hide
            tmorton Tim Morton [X] (Inactive) added a comment -

            Thanks, Yusra AlSayyad. I'll remember to add ci_hsc to Jenkins in the future. I don't know if Lauren MacArthur is available today or not, but if either of you is able to mark this reviewed, then I can merge today.

            Show
            tmorton Tim Morton [X] (Inactive) added a comment - Thanks, Yusra AlSayyad . I'll remember to add ci_hsc to Jenkins in the future. I don't know if Lauren MacArthur is available today or not, but if either of you is able to mark this reviewed, then I can merge today.
            Hide
            lauren Lauren MacArthur added a comment -

            I don't have time in the next few days to actually look over the code, so any review today would simply consist of me confirming that I saw the successful jenkins + ci_hsc.  I'm not entirely sure that's adequate, but might be given the previous review?

            Show
            lauren Lauren MacArthur added a comment - I don't have time in the next few days to actually look over the code, so any review today would simply consist of me confirming that I saw the successful jenkins + ci_hsc .  I'm not entirely sure that's adequate, but might be given the previous review?
            Hide
            tmorton Tim Morton [X] (Inactive) added a comment -

            I would guess so? Since Sophie Reed already reviewed the first time, and the re-review was just because ci_hsc was failing, I would guess that's ok?

            Show
            tmorton Tim Morton [X] (Inactive) added a comment - I would guess so? Since Sophie Reed already reviewed the first time, and the re-review was just because ci_hsc was failing, I would guess that's ok?

              People

              Assignee:
              tmorton Tim Morton [X] (Inactive)
              Reporter:
              tmorton Tim Morton [X] (Inactive)
              Reviewers:
              Lauren MacArthur
              Watchers:
              Jim Bosch, Kian-Tat Lim, Lauren MacArthur, Sophie Reed, Tim Jenness, Tim Morton [X] (Inactive), Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.