# Persist parquet tables from pipe_analysis scripts

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Story Points:
1
• Sprint:
DRP F17-6
• Team:
Data Release Production

#### Description

In order to make interactive QA plots such as DM-11682, we need to persist parquet files containing the compiled data that are used to make the static matplotlib plots. Admittedly, this should be a temporary step until all the source info is easily loadable from a column-store database, but for now it is necessary. At the visit level it might be less so, but I'll do it anyway. This is probably technically part of DM-10859, but I thought it might be useful to get it in its own ticket so it can be merged quickly.

#### Activity

Hide
Tim Morton [X] (Inactive) added a comment -

Just remembered that in order to write these tables, you need to have the fastparquet python library. I did a pip install --user fastparquet, but perhaps this should be conda install'ed to the stack before merging? Or should I hide the fastparquet import behind a warning? Thoughts, John Swinbank?

Show
Tim Morton [X] (Inactive) added a comment - Just remembered that in order to write these tables, you need to have the fastparquet python library. I did a pip install --user fastparquet , but perhaps this should be conda install 'ed to the stack before merging? Or should I hide the fastparquet import behind a warning? Thoughts, John Swinbank ?
Hide
Paul Price added a comment -

It feels rather heavy to have to make all the plots (which has historically been slow; may be improved now) in order to write the parquet files. Can't you move the functionality into its own Task?

I don't like that you've just stuck writeParquet in utils and magically get it in other code with the illicit from .utils import *; but that naughtiness pre-dates your change.

writeParquet should be documented with the usual standards (preferably numpydoc style).

Since this is all experimental at the moment, I'm fine with leaving fastparquet as an undeclared dependency. When it comes time to push pipe_analysis into lsst_distrib, or we move into production, then we'll need to handle the packaging.

Show
Paul Price added a comment - It feels rather heavy to have to make all the plots (which has historically been slow; may be improved now) in order to write the parquet files. Can't you move the functionality into its own Task ? I don't like that you've just stuck writeParquet in utils and magically get it in other code with the illicit from .utils import * ; but that naughtiness pre-dates your change. writeParquet should be documented with the usual standards (preferably numpydoc style). Since this is all experimental at the moment, I'm fine with leaving fastparquet as an undeclared dependency. When it comes time to push pipe_analysis into lsst_distrib, or we move into production, then we'll need to handle the packaging.
Hide
Tim Morton [X] (Inactive) added a comment -

My main reason for dropping these table writes into the make-the-plots tasks was to capture the same tables being used to generate the plots, with as little development as possible---certainly not to be economical. If we want to make this its own Task, it seems like that could (should?) be completely divorced from the current plotting scripts. This relates to what we talked about in the meeting yesterday about maybe with the interactive plots we shouldn't do even the pre-culling that the pipe_analysis scripts are doing (which I'm currently capturing with my table writes).

I do agree in principle that the table-preparation should eventually be an independent step (in which case, the catalog preparation within the plotting tasks should probably also be broken out into its own task), but before we get there we want to try to view the whole PDR1 with the interactive plots (DM-12043), and I think the quickest way to get there is with the writes just snuck in there.

I'm happy to make the import of writeParquet explicit, but as you point out, that's just a drop in the bucket, so I figured it probably wasn't worth it.

Show
Tim Morton [X] (Inactive) added a comment - My main reason for dropping these table writes into the make-the-plots tasks was to capture the same tables being used to generate the plots, with as little development as possible---certainly not to be economical. If we want to make this its own Task , it seems like that could (should?) be completely divorced from the current plotting scripts. This relates to what we talked about in the meeting yesterday about maybe with the interactive plots we shouldn't do even the pre-culling that the pipe_analysis scripts are doing (which I'm currently capturing with my table writes). I do agree in principle that the table-preparation should eventually be an independent step (in which case, the catalog preparation within the plotting tasks should probably also be broken out into its own task), but before we get there we want to try to view the whole PDR1 with the interactive plots ( DM-12043 ), and I think the quickest way to get there is with the writes just snuck in there. I'm happy to make the import of writeParquet explicit, but as you point out, that's just a drop in the bucket, so I figured it probably wasn't worth it.
Hide
Tim Morton [X] (Inactive) added a comment -

I've now added to the PR writeParquet documentation and hiding of the fastparquet import in a try block.

Show
Tim Morton [X] (Inactive) added a comment - I've now added to the PR writeParquet documentation and hiding of the fastparquet import in a try block.
Hide
Simon Krughoff added a comment -

Unfortunately, this merge broke Jenkins builds. That needs to be run and passed before merging to master. Please revert or fix ASAP.

Show
Simon Krughoff added a comment - Unfortunately, this merge broke Jenkins builds. That needs to be run and passed before merging to master. Please revert or fix ASAP.
Hide
Paul Price added a comment -

I've pushed a fix on tickets/DM-12030 of obs_test, which passes Jenkins.

 price@pap-laptop:~/LSST/obs_test (tickets/DM-12030=) $git sub-patch commit a6a55e6f20793b269e32b63109688d0882313987 (HEAD -> tickets/DM-12030, origin/tickets/DM-12030) Author: Paul Price  Date: Mon Oct 2 16:01:16 2017 -0400    adapt to addition of plotting datasets    This introduces a new key, 'description', which is used in pipe_analysis.   diff --git a/tests/test_obs_test.py b/tests/test_obs_test.py index 4976057..1c820c3 100644 --- a/tests/test_obs_test.py +++ b/tests/test_obs_test.py @@ -76,7 +76,7 @@ class TestObsTest(lsst.obs.base.tests.ObsTests, lsst.utils.tests.TestCase):  )    path_to_raw = os.path.join(data_dir, "raw", "raw_v1_fg.fits.gz") - keys = set(('filter', 'name', 'patch', 'tract', 'visit', 'pixel_id', 'subfilter')) + keys = set(('filter', 'name', 'patch', 'tract', 'visit', 'pixel_id', 'subfilter', 'description'))  query_format = ["visit", "filter"]  queryMetadata = (({'visit': 1}, [(1, 'g')]),  ({'visit': 2}, [(2, 'g')]),  Show Paul Price added a comment - I've pushed a fix on tickets/ DM-12030 of obs_test, which passes Jenkins . price@pap-laptop:~/LSST/obs_test (tickets/DM-12030=)$ git sub-patch commit a6a55e6f20793b269e32b63109688d0882313987 (HEAD -> tickets/DM-12030, origin/tickets/DM-12030) Author: Paul Price <price@astro.princeton.edu> Date: Mon Oct 2 16:01:16 2017 -0400   adapt to addition of plotting datasets This introduces a new key, 'description', which is used in pipe_analysis.   diff --git a/tests/test_obs_test.py b/tests/test_obs_test.py index 4976057..1c820c3 100644 --- a/tests/test_obs_test.py +++ b/tests/test_obs_test.py @@ -76,7 +76,7 @@ class TestObsTest(lsst.obs.base.tests.ObsTests, lsst.utils.tests.TestCase): ) path_to_raw = os.path.join(data_dir, "raw", "raw_v1_fg.fits.gz") - keys = set(('filter', 'name', 'patch', 'tract', 'visit', 'pixel_id', 'subfilter')) + keys = set(('filter', 'name', 'patch', 'tract', 'visit', 'pixel_id', 'subfilter', 'description')) query_format = ["visit", "filter"] queryMetadata = (({'visit': 1}, [(1, 'g')]), ({'visit': 2}, [(2, 'g')]),
Hide
Simon Krughoff added a comment -

Paul Price Looks good. Ship it.

Show
Simon Krughoff added a comment - Paul Price Looks good. Ship it.
Hide
Paul Price added a comment -

Thanks Simon Krughoff. Merged to master.

Show
Paul Price added a comment - Thanks Simon Krughoff . Merged to master.

#### People

Assignee:
Tim Morton [X] (Inactive)
Reporter:
Tim Morton [X] (Inactive)
Reviewers:
Paul Price
Watchers:
Jim Bosch, John Swinbank, Lauren MacArthur, Paul Price, Simon Krughoff, Tim Morton [X] (Inactive)