I am going to repurpose this ticket to extend the functionality of the existing DatastoreCacheManager functionality added for remote datastores in DM-29383.
The issues to solve on this ticket mostly revolve around cache expiry. This boils down to:
- When do we decide that a file can be removed from the cache?
- How do we decide which file should be removed from the cache?
- Where are we caching files?
Inside pipetask run it's possible for pipetask to know when all jobs have been completed and can therefore explicitly trigger cache expiration. A notebook user doing a couple of butler.get calls does not have the same opportunity and it seems unfair to make people call butler.datastore.empty_cache.
For many cases when transferring a file between PipeTask in the same node, a configuration option to declare that the file should be cached on put and then deleted immediately after get would be feasible.
For a temporary cache location every butler process gets a brand new temp directory. Do we register a handler with atexit to clean that up? Presumably it still needs to be able to have the size capped.
For a reusable cache directory proper cache management is needed with a cap on the cache size. Is it reasonable to run cache expiry code on every get (from a remote location) and put? Delete FIFO order? Largest first? Is it okay to readdir the cache on every get/put?
What is the epic for this?