Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-12566

Generate statistics for 30% DR1 dataset at IN2P3

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      In accordance with documentation in DM-10360, new statistics for Object and Source tables need to be generated for this new dataset for KPM30 tests.

        Attachments

          Issue Links

            Activity

            Hide
            vaikunth Vaikunth Thukral added a comment -

            Going to use this documentation: https://confluence.lsstcorp.org/pages/viewpage.action?pageId=58950786

            As an aside, it may be worth trying to copy the KPM20 statistics to see if that might expedite the test.

            Show
            vaikunth Vaikunth Thukral added a comment - Going to use this documentation: https://confluence.lsstcorp.org/pages/viewpage.action?pageId=58950786 As an aside, it may be worth trying to copy the KPM20 statistics to see if that might expedite the test.
            Hide
            vaikunth Vaikunth Thukral added a comment -

            The stats have finished generating and been tested against an Object-Source JOIN that finished in ~130 minutes. Previously, JOIN queries would time out based on the 8hr hard limit in the qserv scheduler.

            The generation itself was expedited by splitting the ANALYZE TABLE step into 4 threads and running it in parallel, totaling ~1day run time at IN2P3 rather than the expected 4-5 day timeframe.

            Other research attempts were also made to expedite and plan for this process for future KPM and DRs. We found that local mysql instances depended on the mysql.column_stats table only, which is generated by ANALYZE TABLE. It is yet unclear which columns particularly help decide the optimizer for the correct JOIN order, but a short meta-analysis of these column values for all chunks in a worker provided some insight into the spread of the min/max/avg values of the columns. The hope is that some order could be determined in the scaling of the distribution of these values for future/larger datasets, thus avoiding the need for reanalyzing tabes for each new DR. As of now, this work is tentatively planned for a future cycle as it would need dedicated resources and time.

            Show
            vaikunth Vaikunth Thukral added a comment - The stats have finished generating and been tested against an Object-Source JOIN that finished in ~130 minutes. Previously, JOIN queries would time out based on the 8hr hard limit in the qserv scheduler. The generation itself was expedited by splitting the ANALYZE TABLE step into 4 threads and running it in parallel, totaling ~1day run time at IN2P3 rather than the expected 4-5 day timeframe. Other research attempts were also made to expedite and plan for this process for future KPM and DRs. We found that local mysql instances depended on the mysql.column_stats table only, which is generated by ANALYZE TABLE. It is yet unclear which columns particularly help decide the optimizer for the correct JOIN order, but a short meta-analysis of these column values for all chunks in a worker provided some insight into the spread of the min/max/avg values of the columns. The hope is that some order could be determined in the scaling of the distribution of these values for future/larger datasets, thus avoiding the need for reanalyzing tabes for each new DR. As of now, this work is tentatively planned for a future cycle as it would need dedicated resources and time.

              People

              Assignee:
              vaikunth Vaikunth Thukral
              Reporter:
              vaikunth Vaikunth Thukral
              Reviewers:
              Fritz Mueller
              Watchers:
              Fritz Mueller, Igor Gaponenko, John Gates, Vaikunth Thukral
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: