Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-12030

Persist parquet tables from pipe_analysis scripts

    Details

    • Story Points:
      1
    • Epic Link:
    • Sprint:
      DRP F17-6
    • Team:
      Data Release Production

      Description

      In order to make interactive QA plots such as DM-11682, we need to persist parquet files containing the compiled data that are used to make the static matplotlib plots. Admittedly, this should be a temporary step until all the source info is easily loadable from a column-store database, but for now it is necessary. At the visit level it might be less so, but I'll do it anyway. This is probably technically part of DM-10859, but I thought it might be useful to get it in its own ticket so it can be merged quickly.

        Attachments

          Issue Links

            Activity

            Hide
            tmorton Tim Morton added a comment -

            Just remembered that in order to write these tables, you need to have the fastparquet python library. I did a pip install --user fastparquet, but perhaps this should be conda install'ed to the stack before merging? Or should I hide the fastparquet import behind a warning? Thoughts, John Swinbank?

            Show
            tmorton Tim Morton added a comment - Just remembered that in order to write these tables, you need to have the fastparquet python library. I did a pip install --user fastparquet , but perhaps this should be conda install 'ed to the stack before merging? Or should I hide the fastparquet import behind a warning? Thoughts, John Swinbank ?
            Hide
            price Paul Price added a comment -

            It feels rather heavy to have to make all the plots (which has historically been slow; may be improved now) in order to write the parquet files. Can't you move the functionality into its own Task?

            I don't like that you've just stuck writeParquet in utils and magically get it in other code with the illicit from .utils import *; but that naughtiness pre-dates your change.

            writeParquet should be documented with the usual standards (preferably numpydoc style).

            Since this is all experimental at the moment, I'm fine with leaving fastparquet as an undeclared dependency. When it comes time to push pipe_analysis into lsst_distrib, or we move into production, then we'll need to handle the packaging.

            Show
            price Paul Price added a comment - It feels rather heavy to have to make all the plots (which has historically been slow; may be improved now) in order to write the parquet files. Can't you move the functionality into its own Task ? I don't like that you've just stuck writeParquet in utils and magically get it in other code with the illicit from .utils import * ; but that naughtiness pre-dates your change. writeParquet should be documented with the usual standards (preferably numpydoc style). Since this is all experimental at the moment, I'm fine with leaving fastparquet as an undeclared dependency. When it comes time to push pipe_analysis into lsst_distrib, or we move into production, then we'll need to handle the packaging.
            Hide
            tmorton Tim Morton added a comment -

            My main reason for dropping these table writes into the make-the-plots tasks was to capture the same tables being used to generate the plots, with as little development as possible---certainly not to be economical. If we want to make this its own Task, it seems like that could (should?) be completely divorced from the current plotting scripts. This relates to what we talked about in the meeting yesterday about maybe with the interactive plots we shouldn't do even the pre-culling that the pipe_analysis scripts are doing (which I'm currently capturing with my table writes).

            I do agree in principle that the table-preparation should eventually be an independent step (in which case, the catalog preparation within the plotting tasks should probably also be broken out into its own task), but before we get there we want to try to view the whole PDR1 with the interactive plots (DM-12043), and I think the quickest way to get there is with the writes just snuck in there.

            I'm happy to make the import of writeParquet explicit, but as you point out, that's just a drop in the bucket, so I figured it probably wasn't worth it.

            Show
            tmorton Tim Morton added a comment - My main reason for dropping these table writes into the make-the-plots tasks was to capture the same tables being used to generate the plots, with as little development as possible---certainly not to be economical. If we want to make this its own Task , it seems like that could (should?) be completely divorced from the current plotting scripts. This relates to what we talked about in the meeting yesterday about maybe with the interactive plots we shouldn't do even the pre-culling that the pipe_analysis scripts are doing (which I'm currently capturing with my table writes). I do agree in principle that the table-preparation should eventually be an independent step (in which case, the catalog preparation within the plotting tasks should probably also be broken out into its own task), but before we get there we want to try to view the whole PDR1 with the interactive plots ( DM-12043 ), and I think the quickest way to get there is with the writes just snuck in there. I'm happy to make the import of writeParquet explicit, but as you point out, that's just a drop in the bucket, so I figured it probably wasn't worth it.
            Hide
            tmorton Tim Morton added a comment -

            I've now added to the PR writeParquet documentation and hiding of the fastparquet import in a try block.

            Show
            tmorton Tim Morton added a comment - I've now added to the PR writeParquet documentation and hiding of the fastparquet import in a try block.
            Hide
            krughoff Simon Krughoff added a comment -

            Unfortunately, this merge broke Jenkins builds. That needs to be run and passed before merging to master. Please revert or fix ASAP.

            Show
            krughoff Simon Krughoff added a comment - Unfortunately, this merge broke Jenkins builds. That needs to be run and passed before merging to master. Please revert or fix ASAP.
            Hide
            price Paul Price added a comment -

            I've pushed a fix on tickets/DM-12030 of obs_test, which passes Jenkins.

            price@pap-laptop:~/LSST/obs_test (tickets/DM-12030=) $ git sub-patch
            commit a6a55e6f20793b269e32b63109688d0882313987 (HEAD -> tickets/DM-12030, origin/tickets/DM-12030)
            Author: Paul Price <price@astro.princeton.edu>
            Date:   Mon Oct 2 16:01:16 2017 -0400
             
                adapt to addition of plotting datasets
                
                This introduces a new key, 'description', which is used in pipe_analysis.
             
            diff --git a/tests/test_obs_test.py b/tests/test_obs_test.py
            index 4976057..1c820c3 100644
            --- a/tests/test_obs_test.py
            +++ b/tests/test_obs_test.py
            @@ -76,7 +76,7 @@ class TestObsTest(lsst.obs.base.tests.ObsTests, lsst.utils.tests.TestCase):
                                           )
             
                     path_to_raw = os.path.join(data_dir, "raw", "raw_v1_fg.fits.gz")
            -        keys = set(('filter', 'name', 'patch', 'tract', 'visit', 'pixel_id', 'subfilter'))
            +        keys = set(('filter', 'name', 'patch', 'tract', 'visit', 'pixel_id', 'subfilter', 'description'))
                     query_format = ["visit", "filter"]
                     queryMetadata = (({'visit': 1}, [(1, 'g')]),
                                      ({'visit': 2}, [(2, 'g')]),
            

            Show
            price Paul Price added a comment - I've pushed a fix on tickets/ DM-12030 of obs_test, which passes Jenkins . price@pap-laptop:~/LSST/obs_test (tickets/DM-12030=) $ git sub-patch commit a6a55e6f20793b269e32b63109688d0882313987 (HEAD -> tickets/DM-12030, origin/tickets/DM-12030) Author: Paul Price <price@astro.princeton.edu> Date: Mon Oct 2 16:01:16 2017 -0400   adapt to addition of plotting datasets This introduces a new key, 'description', which is used in pipe_analysis.   diff --git a/tests/test_obs_test.py b/tests/test_obs_test.py index 4976057..1c820c3 100644 --- a/tests/test_obs_test.py +++ b/tests/test_obs_test.py @@ -76,7 +76,7 @@ class TestObsTest(lsst.obs.base.tests.ObsTests, lsst.utils.tests.TestCase): ) path_to_raw = os.path.join(data_dir, "raw", "raw_v1_fg.fits.gz") - keys = set(('filter', 'name', 'patch', 'tract', 'visit', 'pixel_id', 'subfilter')) + keys = set(('filter', 'name', 'patch', 'tract', 'visit', 'pixel_id', 'subfilter', 'description')) query_format = ["visit", "filter"] queryMetadata = (({'visit': 1}, [(1, 'g')]), ({'visit': 2}, [(2, 'g')]),
            Hide
            krughoff Simon Krughoff added a comment -

            Paul Price Looks good. Ship it.

            Show
            krughoff Simon Krughoff added a comment - Paul Price Looks good. Ship it.
            Hide
            price Paul Price added a comment -

            Thanks Simon Krughoff. Merged to master.

            Show
            price Paul Price added a comment - Thanks Simon Krughoff . Merged to master.

              People

              • Assignee:
                tmorton Tim Morton
                Reporter:
                tmorton Tim Morton
                Reviewers:
                Paul Price
                Watchers:
                Jim Bosch, John Swinbank, Lauren MacArthur, Paul Price, Simon Krughoff, Tim Morton
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: