Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: None
-
Labels:None
-
Story Points:8
-
Epic Link:
-
Team:Data Release Production
-
Urgent?:No
Description
Gen3 parquet access is fundamentally different from gen2. In gen2, a butler.get call on a parquet dataset call returned a lazy ParquetTable object that doesn't load any data until requested by .getDataFrame at which point specific columns can be requested. However, in gen3, a butler.get call returns data directly, and specific columns are requested at this time.
The Functor implementation currently in pipe_tasks is based on the gen2 ParquetTable object, as it takes one of those as the input to its _call_ method. Updating for gen3 butler will need to change this.
A related concern is that one of the reasons for the gen2 ParquetTable implementation to be the way it was was to not have to read parquet metadata with every data load. Part of this ticket should be to explore the degree to which this metadata loading affects practical use cases, and if anything needs to be changed in the gen3 parquet-interaction layer.
Attachments
Issue Links
Activity
Field | Original Value | New Value |
---|---|---|
Epic Link |
|
Status | To Do [ 10001 ] | In Progress [ 3 ] |
Comment | [ Jenkins run started https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/33054/pipeline ] |
Reviewers | Sophie Reed [ sophiereed ] | |
Status | In Progress [ 3 ] | In Review [ 10004 ] |
Status | In Review [ 10004 ] | Reviewed [ 10101 ] |
Resolution | Done [ 10000 ] | |
Status | Reviewed [ 10101 ] | Done [ 10002 ] |
Story Points | 8 |
Watchers | Jim Bosch, Kian-Tat Lim, Sophie Reed, Tim Jenness, Tim Morton, Yusra AlSayyad [ Jim Bosch, Kian-Tat Lim, Sophie Reed, Tim Jenness, Tim Morton, Yusra AlSayyad ] | Jim Bosch, Kian-Tat Lim, Lauren MacArthur, Sophie Reed, Tim Jenness, Tim Morton, Yusra AlSayyad [ Jim Bosch, Kian-Tat Lim, Lauren MacArthur, Sophie Reed, Tim Jenness, Tim Morton, Yusra AlSayyad ] |
Resolution | Done [ 10000 ] | |
Status | Done [ 10002 ] | To Do [ 10001 ] |
Status | To Do [ 10001 ] | In Progress [ 3 ] |
Reviewers | Sophie Reed [ sophiereed ] | Lauren MacArthur [ lauren ] |
Status | In Progress [ 3 ] | In Review [ 10004 ] |
Status | In Review [ 10004 ] | Reviewed [ 10101 ] |
Resolution | Done [ 10000 ] | |
Status | Reviewed [ 10101 ] | Done [ 10002 ] |
A fundamental difference between gen2 and gen3 is that gen2 is assuming the file is local but in gen3 it might be in an S3 object store and by the time you get the data from it the file itself has been deleted locally. There might be local caching of that file in a future datastore implementation but the caller code can't assume that will be there.