Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-12030

Persist parquet tables from pipe_analysis scripts

    XMLWordPrintable

Details

    • 1
    • DRP F17-6
    • Data Release Production

    Description

      In order to make interactive QA plots such as DM-11682, we need to persist parquet files containing the compiled data that are used to make the static matplotlib plots. Admittedly, this should be a temporary step until all the source info is easily loadable from a column-store database, but for now it is necessary. At the visit level it might be less so, but I'll do it anyway. This is probably technically part of DM-10859, but I thought it might be useful to get it in its own ticket so it can be merged quickly.

      Attachments

        Issue Links

          Activity

            Just remembered that in order to write these tables, you need to have the fastparquet python library. I did a pip install --user fastparquet, but perhaps this should be conda install'ed to the stack before merging? Or should I hide the fastparquet import behind a warning? Thoughts, swinbank?

            tmorton Tim Morton [X] (Inactive) added a comment - Just remembered that in order to write these tables, you need to have the fastparquet python library. I did a pip install --user fastparquet , but perhaps this should be conda install 'ed to the stack before merging? Or should I hide the fastparquet import behind a warning? Thoughts, swinbank ?
            price Paul Price added a comment -

            It feels rather heavy to have to make all the plots (which has historically been slow; may be improved now) in order to write the parquet files. Can't you move the functionality into its own Task?

            I don't like that you've just stuck writeParquet in utils and magically get it in other code with the illicit from .utils import *; but that naughtiness pre-dates your change.

            writeParquet should be documented with the usual standards (preferably numpydoc style).

            Since this is all experimental at the moment, I'm fine with leaving fastparquet as an undeclared dependency. When it comes time to push pipe_analysis into lsst_distrib, or we move into production, then we'll need to handle the packaging.

            price Paul Price added a comment - It feels rather heavy to have to make all the plots (which has historically been slow; may be improved now) in order to write the parquet files. Can't you move the functionality into its own Task ? I don't like that you've just stuck writeParquet in utils and magically get it in other code with the illicit from .utils import * ; but that naughtiness pre-dates your change. writeParquet should be documented with the usual standards (preferably numpydoc style). Since this is all experimental at the moment, I'm fine with leaving fastparquet as an undeclared dependency. When it comes time to push pipe_analysis into lsst_distrib, or we move into production, then we'll need to handle the packaging.

            My main reason for dropping these table writes into the make-the-plots tasks was to capture the same tables being used to generate the plots, with as little development as possible---certainly not to be economical. If we want to make this its own Task, it seems like that could (should?) be completely divorced from the current plotting scripts. This relates to what we talked about in the meeting yesterday about maybe with the interactive plots we shouldn't do even the pre-culling that the pipe_analysis scripts are doing (which I'm currently capturing with my table writes).

            I do agree in principle that the table-preparation should eventually be an independent step (in which case, the catalog preparation within the plotting tasks should probably also be broken out into its own task), but before we get there we want to try to view the whole PDR1 with the interactive plots (DM-12043), and I think the quickest way to get there is with the writes just snuck in there.

            I'm happy to make the import of writeParquet explicit, but as you point out, that's just a drop in the bucket, so I figured it probably wasn't worth it.

            tmorton Tim Morton [X] (Inactive) added a comment - My main reason for dropping these table writes into the make-the-plots tasks was to capture the same tables being used to generate the plots, with as little development as possible---certainly not to be economical. If we want to make this its own Task , it seems like that could (should?) be completely divorced from the current plotting scripts. This relates to what we talked about in the meeting yesterday about maybe with the interactive plots we shouldn't do even the pre-culling that the pipe_analysis scripts are doing (which I'm currently capturing with my table writes). I do agree in principle that the table-preparation should eventually be an independent step (in which case, the catalog preparation within the plotting tasks should probably also be broken out into its own task), but before we get there we want to try to view the whole PDR1 with the interactive plots ( DM-12043 ), and I think the quickest way to get there is with the writes just snuck in there. I'm happy to make the import of writeParquet explicit, but as you point out, that's just a drop in the bucket, so I figured it probably wasn't worth it.

            I've now added to the PR writeParquet documentation and hiding of the fastparquet import in a try block.

            tmorton Tim Morton [X] (Inactive) added a comment - I've now added to the PR writeParquet documentation and hiding of the fastparquet import in a try block.

            Unfortunately, this merge broke Jenkins builds. That needs to be run and passed before merging to master. Please revert or fix ASAP.

            krughoff Simon Krughoff (Inactive) added a comment - Unfortunately, this merge broke Jenkins builds. That needs to be run and passed before merging to master. Please revert or fix ASAP.
            price Paul Price added a comment -

            I've pushed a fix on tickets/DM-12030 of obs_test, which passes Jenkins.

            price@pap-laptop:~/LSST/obs_test (tickets/DM-12030=) $ git sub-patch
            commit a6a55e6f20793b269e32b63109688d0882313987 (HEAD -> tickets/DM-12030, origin/tickets/DM-12030)
            Author: Paul Price <price@astro.princeton.edu>
            Date:   Mon Oct 2 16:01:16 2017 -0400
             
                adapt to addition of plotting datasets
                
                This introduces a new key, 'description', which is used in pipe_analysis.
             
            diff --git a/tests/test_obs_test.py b/tests/test_obs_test.py
            index 4976057..1c820c3 100644
            --- a/tests/test_obs_test.py
            +++ b/tests/test_obs_test.py
            @@ -76,7 +76,7 @@ class TestObsTest(lsst.obs.base.tests.ObsTests, lsst.utils.tests.TestCase):
                                           )
             
                     path_to_raw = os.path.join(data_dir, "raw", "raw_v1_fg.fits.gz")
            -        keys = set(('filter', 'name', 'patch', 'tract', 'visit', 'pixel_id', 'subfilter'))
            +        keys = set(('filter', 'name', 'patch', 'tract', 'visit', 'pixel_id', 'subfilter', 'description'))
                     query_format = ["visit", "filter"]
                     queryMetadata = (({'visit': 1}, [(1, 'g')]),
                                      ({'visit': 2}, [(2, 'g')]),
            

            price Paul Price added a comment - I've pushed a fix on tickets/ DM-12030 of obs_test, which passes Jenkins . price@pap-laptop:~/LSST/obs_test (tickets/DM-12030=) $ git sub-patch commit a6a55e6f20793b269e32b63109688d0882313987 (HEAD -> tickets/DM-12030, origin/tickets/DM-12030) Author: Paul Price <price@astro.princeton.edu> Date: Mon Oct 2 16:01:16 2017 -0400   adapt to addition of plotting datasets This introduces a new key, 'description', which is used in pipe_analysis.   diff --git a/tests/test_obs_test.py b/tests/test_obs_test.py index 4976057..1c820c3 100644 --- a/tests/test_obs_test.py +++ b/tests/test_obs_test.py @@ -76,7 +76,7 @@ class TestObsTest(lsst.obs.base.tests.ObsTests, lsst.utils.tests.TestCase): ) path_to_raw = os.path.join(data_dir, "raw", "raw_v1_fg.fits.gz") - keys = set(('filter', 'name', 'patch', 'tract', 'visit', 'pixel_id', 'subfilter')) + keys = set(('filter', 'name', 'patch', 'tract', 'visit', 'pixel_id', 'subfilter', 'description')) query_format = ["visit", "filter"] queryMetadata = (({'visit': 1}, [(1, 'g')]), ({'visit': 2}, [(2, 'g')]),

            price Looks good. Ship it.

            krughoff Simon Krughoff (Inactive) added a comment - price Looks good. Ship it.
            price Paul Price added a comment -

            Thanks krughoff. Merged to master.

            price Paul Price added a comment - Thanks krughoff . Merged to master.

            People

              tmorton Tim Morton [X] (Inactive)
              tmorton Tim Morton [X] (Inactive)
              Paul Price
              Jim Bosch, John Swinbank, Lauren MacArthur, Paul Price, Simon Krughoff (Inactive), Tim Morton [X] (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Jenkins

                  No builds found.