Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-3527

FY19 Query Result Caching

    XMLWordPrintable

    Details

    • Type: Epic
    • Status: To Do
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None
    • Epic Name:
      FY19 Query Caching
    • Story Points:
      79
    • WBS:
      02C.06.02.03
    • Team:
      Data Access and Database

      Description

      Implement Query caching. Note that most likely we can't "just" rely on mysql caching, and we will need to do some custom tweaks, in particular for async queries.

      I'd be useful to allow users to "pin" results from interesting queries. Typical usecase: user runs a query (it can be interactive) and then decides to keep the results for longer time.

        Attachments

          Issue Links

            Activity

            Hide
            tjenness Tim Jenness added a comment -

            How long is a long time?

            Show
            tjenness Tim Jenness added a comment - How long is a long time?
            Show
            fritzm Fritz Mueller added a comment - http://longnow.org/essays/time-10000-year-clock/
            Hide
            tjenness Tim Jenness added a comment -

            Fritz Mueller Maybe I'll be less Zen in my comments next time.

            When we say "[astronomer] decides to keep the results for longer time." how long a time is that thought to be? I am interested in the concept of being able to issue a DOI for a specific query so that the query can be recreated by arbitrary users and included in science publications.

            Show
            tjenness Tim Jenness added a comment - Fritz Mueller Maybe I'll be less Zen in my comments next time. When we say " [astronomer] decides to keep the results for longer time." how long a time is that thought to be? I am interested in the concept of being able to issue a DOI for a specific query so that the query can be recreated by arbitrary users and included in science publications.
            Hide
            jbecla Jacek Becla added a comment -

            Xiuqin Wu [X] brought up the issue of pinning results when we chatted last week, so she might have an opinion. And maybe Mario Juric? My understanding is that space for query results will not be used for keeping things for ever.

            Show
            jbecla Jacek Becla added a comment - Xiuqin Wu [X] brought up the issue of pinning results when we chatted last week, so she might have an opinion. And maybe Mario Juric ? My understanding is that space for query results will not be used for keeping things for ever.
            Hide
            tjenness Tim Jenness added a comment -

            I don't necessarily need the results to be cached for ever. For me the important thing is a reproducible query that knows what data release it is for. For the DOI the resultant page could go directly to the cached results or trigger a whole new query on an archived data release (Which might take a long time).

            It may be that we want the ability to be able to turn a cache into permanent query result storage if a scientist indicates that they have published a paper from the query.

            Show
            tjenness Tim Jenness added a comment - I don't necessarily need the results to be cached for ever. For me the important thing is a reproducible query that knows what data release it is for. For the DOI the resultant page could go directly to the cached results or trigger a whole new query on an archived data release (Which might take a long time). It may be that we want the ability to be able to turn a cache into permanent query result storage if a scientist indicates that they have published a paper from the query.
            Hide
            jbecla Jacek Becla added a comment -

            Sounds doable. Should we (well, you/Mario/K-T) document this requirement then, and turn it over to the Data Access team? This is not mentioned in any baseline docs so we are not promising anything now

            Show
            jbecla Jacek Becla added a comment - Sounds doable. Should we (well, you/Mario/K-T) document this requirement then, and turn it over to the Data Access team? This is not mentioned in any baseline docs so we are not promising anything now
            Hide
            ktl Kian-Tat Lim added a comment -

            Every query against the data release tables should know what data release it is for based on the database used and should always be reproducible.

            My initial bias is to have the query result cache be just that: a relatively simple optimization for repeated queries with no reliability guarantees. Users would have to copy results into their database or file workspace to retain them. But if we can send query results to reliable high-bandwidth shared storage (GPFS or an object store like S3/Swift) and if we can build out suitable management infrastructure (labeling by owning user or users, accounting for space used, listing and removal operations), more useful optimization possibilities do open up.

            I do worry a little about the conceptual difference, especially for future processing, between retained query results and a real table stemming from "CREATE TABLE AS SELECT". On the other hand, perhaps this argues for (eventually) dumping SQL and instead having results be like entries in a Jupyter notebook with something like Pandas as a query language...

            Show
            ktl Kian-Tat Lim added a comment - Every query against the data release tables should know what data release it is for based on the database used and should always be reproducible. My initial bias is to have the query result cache be just that: a relatively simple optimization for repeated queries with no reliability guarantees. Users would have to copy results into their database or file workspace to retain them. But if we can send query results to reliable high-bandwidth shared storage (GPFS or an object store like S3/Swift) and if we can build out suitable management infrastructure (labeling by owning user or users, accounting for space used, listing and removal operations), more useful optimization possibilities do open up. I do worry a little about the conceptual difference, especially for future processing, between retained query results and a real table stemming from " CREATE TABLE AS SELECT ". On the other hand, perhaps this argues for (eventually) dumping SQL and instead having results be like entries in a Jupyter notebook with something like Pandas as a query language...
            Hide
            xiuqin Xiuqin Wu [X] (Inactive) added a comment -

            Jacek Becla, Kian-Tat Lim I don't want to influence how the DB group going to cache the results to make subsequence searches fast. My concern is that when user wants to save the search results (after waiting long time to finish, say more than serval minutes ) for later work (I believe the dataset becomes L3 data after saving), DB does not have to redo the search again. So the saving would be instant and user can start query his own saved table right away.

            Show
            xiuqin Xiuqin Wu [X] (Inactive) added a comment - Jacek Becla , Kian-Tat Lim I don't want to influence how the DB group going to cache the results to make subsequence searches fast. My concern is that when user wants to save the search results (after waiting long time to finish, say more than serval minutes ) for later work (I believe the dataset becomes L3 data after saving), DB does not have to redo the search again. So the saving would be instant and user can start query his own saved table right away.

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              fritzm Fritz Mueller
              Watchers:
              Fritz Mueller, Jacek Becla, John Gates, Kian-Tat Lim, Tatiana Goldina, Tim Jenness, Xiuqin Wu [X] (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:

                  Jenkins

                  No builds found.