Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-15887

Make Pipeline work with non-standard storage classes

    XMLWordPrintable

    Details

      Description

      Pipeline builder needs to instantiate DatasetTypes which depend on StorageClass (and they both are configured via task config). We want to keep pipeline builder independent of butler but that means that initialization of StorageClassFactory becomes an issue. DM-15850 adds support for loading all standard StorageClasses (which come from a standard YAML config) but any non-standard configuration will become an issue in this approach.

      I want to see how we can solve this problem by either pre-loading non-standard config for the factory or avoiding its use entirely. 

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment -

            Reading through the code I have an impression that things could be made significantly simpler if we delay the instantiation of StorageClass instances until they are actually used (i.e. somewhere in Butler). DatasetType instances can store StorageClass name instead of instance and that should make few things simpler:

            • we can use DatasetType in a context when Butler is not initialized (e.g. Pipeline construction)
            • serialization of DatasetType will become trivial
            • fewer dependencies on StorageClassFactory singleton (maybe we could even eliminate that singleton which would be a great win for everyone)

            Tim Jenness, Jim Bosch, what do think about this idea, am I being too naive?

            Show
            salnikov Andy Salnikov added a comment - Reading through the code I have an impression that things could be made significantly simpler if we delay the instantiation of StorageClass instances until they are actually used (i.e. somewhere in Butler). DatasetType instances can store StorageClass name instead of instance and that should make few things simpler: we can use DatasetType in a context when Butler is not initialized (e.g. Pipeline construction) serialization of DatasetType will become trivial fewer dependencies on StorageClassFactory singleton (maybe we could even eliminate that singleton which would be a great win for everyone) Tim Jenness , Jim Bosch , what do think about this idea, am I being too naive?
            Hide
            tjenness Tim Jenness added a comment -

            I don't understand how we can remove StorageClassFactory since that's the thing that does all the hard work. It is entirely feasible for DatasetType to have a storageClassName property and a storageClass property, with the latter defaulting to None and dynamically creating itself on in the @property getter from name and StorageClassFactory.

            Show
            tjenness Tim Jenness added a comment - I don't understand how we can remove StorageClassFactory since that's the thing that does all the hard work. It is entirely feasible for DatasetType to have a storageClassName property and a storageClass property, with the latter defaulting to None and dynamically creating itself on in the @property getter from name and StorageClassFactory.
            Hide
            salnikov Andy Salnikov added a comment -

            I do not propose removal of StorageClassFactory, I'm saying that maybe we can get rid of the singleton. StorageClassFactory instance could become a Butler attribute if Butler is the only place where we need to call the factory. And clearly I do not know about all use cases for the factory, maybe there is a reasonable use cases for having that singleton outside butler. I have encountered that when I needed to register new StorageClass, but that potentially can be done via Butler method.

            I could live with dynamic storageClass property in DatasetType, still using that approach means that Butler needs to be initialized/configured when accessing that property, which is hard to enforce. To me it feels cleaner/safer if we only try to instantiate StorageClass inside butler if possible.

             

            Show
            salnikov Andy Salnikov added a comment - I do not propose removal of StorageClassFactory, I'm saying that maybe we can get rid of the singleton. StorageClassFactory instance could become a Butler attribute if Butler is the only place where we need to call the factory. And clearly I do not know about all use cases for the factory, maybe there is a reasonable use cases for having that singleton outside butler. I have encountered that when I needed to register new StorageClass, but that potentially can be done via Butler method. I could live with dynamic  storageClass property in DatasetType, still using that approach means that Butler needs to be initialized/configured when accessing that property, which is hard to enforce. To me it feels cleaner/safer if we only try to instantiate StorageClass inside butler if possible.  
            Hide
            tjenness Tim Jenness added a comment -

            I had to use a singleton because there are bits inside datastores that weren't being passed in the configuration knowledge of how to construct a StorageClass. There was a long debate about this a long time ago. Why is there a problem with dynamic storage class creation? You'd only encounter it if you asked for the storageClass property and if you were asking for that then you really need the StorageClass. You were stating that the name was enough for you so you'd access storageClassName property.

            Show
            tjenness Tim Jenness added a comment - I had to use a singleton because there are bits inside datastores that weren't being passed in the configuration knowledge of how to construct a StorageClass. There was a long debate about this a long time ago. Why is there a problem with dynamic storage class creation? You'd only encounter it if you asked for the storageClass property and if you were asking for that then you really need the StorageClass. You were stating that the name was enough for you so you'd access storageClassName property.
            Hide
            salnikov Andy Salnikov added a comment -

            My problem with that is mostly philosophical in nature, I prefer fewer code dependencies and public API that works consistently in any environment. Still as I said I can survive with dynamic attribute for my purposes.

            Show
            salnikov Andy Salnikov added a comment - My problem with that is mostly philosophical in nature, I prefer fewer code dependencies and public API that works consistently in any environment. Still as I said I can survive with dynamic attribute for my purposes.
            Hide
            jbosch Jim Bosch added a comment -

            Could we make attachment of a StorageClass instance to a DatasetType instance something only a Butler/Registry/Datastore can do, in the same sense that a Registry is essentially necessary to add a dataset_id to a DatasetRef?

            I think in most user-facing contexts only requiring a string is clearly better, but I gather not having the StorageClass attached to the DatasetType would be a pretty big annoyance in the internals (which I think is confirmed by Tim Jenness's comments above).  We could consider having a variant/subclass of DatasetType that adds that information, and in more a strongly-typed or rigorously OO language I think that'd be the way to go.  In Python it's less clear.

            Show
            jbosch Jim Bosch added a comment - Could we make attachment of a StorageClass instance to a DatasetType instance something only a Butler/Registry/Datastore can do, in the same sense that a Registry is essentially necessary to add a dataset_id to a DatasetRef? I think in most user-facing contexts only requiring a string is clearly better, but I gather not having the StorageClass attached to the DatasetType would be a pretty big annoyance in the internals (which I think is confirmed by Tim Jenness 's comments above).  We could consider having a variant/subclass of DatasetType that adds that information, and in more a strongly-typed or rigorously OO language I think that'd be the way to go.  In Python it's less clear.
            Hide
            salnikov Andy Salnikov added a comment -

            I'm not really sure how to do that. We could make a class which "wraps" string-only-DatasetType with DatasetType-with-storageclass when DatasetType instances cross from a user land into a butler land, but that looks potentially fragile and high-maintenance. What Tim proposed (dynamic storageClass property) is probably better if you accept that it can behave unpredictably before you initialize Butler (which probably only matters for unit tests).

             

            Show
            salnikov Andy Salnikov added a comment - I'm not really sure how to do that. We could make a class which "wraps" string-only-DatasetType with DatasetType-with-storageclass when DatasetType instances cross from a user land into a butler land, but that looks potentially fragile and high-maintenance. What Tim proposed (dynamic storageClass property) is probably better if you accept that it can behave unpredictably before you initialize Butler (which probably only matters for unit tests).  
            Hide
            salnikov Andy Salnikov added a comment -

            Ready for review, the only non-trivial thing is that I had to re-implement _eq and hash_ methods for DatasetType. Jenkins has built is successfully for centos6/7, macos is still building.

            Show
            salnikov Andy Salnikov added a comment - Ready for review, the only non-trivial thing is that I had to re-implement _ eq and hash _ methods for DatasetType. Jenkins has built is successfully for centos6/7, macos is still building.
            Hide
            tjenness Tim Jenness added a comment -

            Looks good to me. Sorry for delay in reviewing.

            Show
            tjenness Tim Jenness added a comment - Looks good to me. Sorry for delay in reviewing.
            Hide
            salnikov Andy Salnikov added a comment -

            Thanks for review! Merged all three packages.

            Show
            salnikov Andy Salnikov added a comment - Thanks for review! Merged all three packages.

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Reviewers:
              Tim Jenness
              Watchers:
              Andy Salnikov, Jim Bosch, Tim Jenness, Vaikunth Thukral
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.