Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-13365

Minimal on-disk caching Datastore

    XMLWordPrintable

    Details

    • Story Points:
      10
    • Team:
      Ops Middleware
    • Urgent?:
      No

      Description

      Should support caching some DatasetTypes but not others. Also perhaps needs some garbage collector. Can / should probably be implemented by keeping two Datastore instances as children (the PosixDatastore cache and another upstream Datastore) and forwarding / transferring between them as needed.

        Attachments

          Issue Links

            Activity

            Hide
            pschella Pim Schellart [X] (Inactive) added a comment -

            What is the epic for this?

            Show
            pschella Pim Schellart [X] (Inactive) added a comment - What is the epic for this?
            Hide
            tjenness Tim Jenness added a comment -

            We haven't assigned one. I assumed we were going to assign it once it was put in a sprint.

            Show
            tjenness Tim Jenness added a comment - We haven't assigned one. I assumed we were going to assign it once it was put in a sprint.
            Hide
            tjenness Tim Jenness added a comment -

            I am going to repurpose this ticket to extend the functionality of the existing DatastoreCacheManager functionality added for remote datastores in DM-29383.

            The issues to solve on this ticket mostly revolve around cache expiry. This boils down to:

            • When do we decide that a file can be removed from the cache?
            • How do we decide which file should be removed from the cache?
            • Where are we caching files?

            Inside pipetask run it's possible for pipetask to know when all jobs have been completed and can therefore explicitly trigger cache expiration. A notebook user doing a couple of butler.get calls does not have the same opportunity and it seems unfair to make people call butler.datastore.empty_cache.

            For many cases when transferring a file between PipeTask in the same node, a configuration option to declare that the file should be cached on put and then deleted immediately after get would be feasible.

            For a temporary cache location every butler process gets a brand new temp directory. Do we register a handler with atexit to clean that up? Presumably it still needs to be able to have the size capped.

            For a reusable cache directory proper cache management is needed with a cap on the cache size. Is it reasonable to run cache expiry code on every get (from a remote location) and put? Delete FIFO order? Largest first? Is it okay to readdir the cache on every get/put?

            Show
            tjenness Tim Jenness added a comment - I am going to repurpose this ticket to extend the functionality of the existing DatastoreCacheManager functionality added for remote datastores in DM-29383 . The issues to solve on this ticket mostly revolve around cache expiry. This boils down to: When do we decide that a file can be removed from the cache? How do we decide which file should be removed from the cache? Where are we caching files? Inside pipetask run it's possible for pipetask to know when all jobs have been completed and can therefore explicitly trigger cache expiration. A notebook user doing a couple of butler.get calls does not have the same opportunity and it seems unfair to make people call butler.datastore.empty_cache . For many cases when transferring a file between PipeTask in the same node, a configuration option to declare that the file should be cached on put and then deleted immediately after get would be feasible. For a temporary cache location every butler process gets a brand new temp directory. Do we register a handler with atexit to clean that up? Presumably it still needs to be able to have the size capped. For a reusable cache directory proper cache management is needed with a cap on the cache size. Is it reasonable to run cache expiry code on every get (from a remote location) and put? Delete FIFO order? Largest first? Is it okay to readdir the cache on every get/put?
            Hide
            tjenness Tim Jenness added a comment -

            Nate Lust suggests that there be an API to declare that specific refs should be cached since this would allow pipetask to look at the graph in advance and determine which files will be needed across different quanta and so should be cached. pipetask can then clear the cache itself on completion (it may need to configure execution butler to use a shared cache directory rather than a dynamic one).

            Show
            tjenness Tim Jenness added a comment - Nate Lust suggests that there be an API to declare that specific refs should be cached since this would allow pipetask to look at the graph in advance and determine which files will be needed across different quanta and so should be cached. pipetask can then clear the cache itself on completion (it may need to configure execution butler to use a shared cache directory rather than a dynamic one).
            Hide
            tjenness Tim Jenness added a comment -

            Kian-Tat Lim Sorry about the size but this ticket ballooned a little beyond the initial plan:

            • Five types of cache expiration
            • Allow environment variable to override config
            • Change find_in_cache to be a context manager to allow other processes to run cache expiration.
            • Remove from cache when dataset is pruned.
            Show
            tjenness Tim Jenness added a comment - Kian-Tat Lim Sorry about the size but this ticket ballooned a little beyond the initial plan: Five types of cache expiration Allow environment variable to override config Change find_in_cache to be a context manager to allow other processes to run cache expiration. Remove from cache when dataset is pruned.
            Hide
            ktl Kian-Tat Lim added a comment -

            A bunch of comments, but most are either "think about this" or small doc changes. Overall it looks OK.

            Show
            ktl Kian-Tat Lim added a comment - A bunch of comments, but most are either "think about this" or small doc changes. Overall it looks OK.

              People

              Assignee:
              tjenness Tim Jenness
              Reporter:
              pschella Pim Schellart [X] (Inactive)
              Reviewers:
              Kian-Tat Lim
              Watchers:
              Jim Bosch, Kian-Tat Lim, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.