Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-15887

Make Pipeline work with non-standard storage classes

    XMLWordPrintable

Details

    Description

      Pipeline builder needs to instantiate DatasetTypes which depend on StorageClass (and they both are configured via task config). We want to keep pipeline builder independent of butler but that means that initialization of StorageClassFactory becomes an issue. DM-15850 adds support for loading all standard StorageClasses (which come from a standard YAML config) but any non-standard configuration will become an issue in this approach.

      I want to see how we can solve this problem by either pre-loading non-standard config for the factory or avoiding its use entirely. 

      Attachments

        Issue Links

          Activity

            Reading through the code I have an impression that things could be made significantly simpler if we delay the instantiation of StorageClass instances until they are actually used (i.e. somewhere in Butler). DatasetType instances can store StorageClass name instead of instance and that should make few things simpler:

            • we can use DatasetType in a context when Butler is not initialized (e.g. Pipeline construction)
            • serialization of DatasetType will become trivial
            • fewer dependencies on StorageClassFactory singleton (maybe we could even eliminate that singleton which would be a great win for everyone)

            tjenness, jbosch, what do think about this idea, am I being too naive?

            salnikov Andy Salnikov added a comment - Reading through the code I have an impression that things could be made significantly simpler if we delay the instantiation of StorageClass instances until they are actually used (i.e. somewhere in Butler). DatasetType instances can store StorageClass name instead of instance and that should make few things simpler: we can use DatasetType in a context when Butler is not initialized (e.g. Pipeline construction) serialization of DatasetType will become trivial fewer dependencies on StorageClassFactory singleton (maybe we could even eliminate that singleton which would be a great win for everyone) tjenness , jbosch , what do think about this idea, am I being too naive?
            tjenness Tim Jenness added a comment -

            I don't understand how we can remove StorageClassFactory since that's the thing that does all the hard work. It is entirely feasible for DatasetType to have a storageClassName property and a storageClass property, with the latter defaulting to None and dynamically creating itself on in the @property getter from name and StorageClassFactory.

            tjenness Tim Jenness added a comment - I don't understand how we can remove StorageClassFactory since that's the thing that does all the hard work. It is entirely feasible for DatasetType to have a storageClassName property and a storageClass property, with the latter defaulting to None and dynamically creating itself on in the @property getter from name and StorageClassFactory.

            I do not propose removal of StorageClassFactory, I'm saying that maybe we can get rid of the singleton. StorageClassFactory instance could become a Butler attribute if Butler is the only place where we need to call the factory. And clearly I do not know about all use cases for the factory, maybe there is a reasonable use cases for having that singleton outside butler. I have encountered that when I needed to register new StorageClass, but that potentially can be done via Butler method.

            I could live with dynamic storageClass property in DatasetType, still using that approach means that Butler needs to be initialized/configured when accessing that property, which is hard to enforce. To me it feels cleaner/safer if we only try to instantiate StorageClass inside butler if possible.

             

            salnikov Andy Salnikov added a comment - I do not propose removal of StorageClassFactory, I'm saying that maybe we can get rid of the singleton. StorageClassFactory instance could become a Butler attribute if Butler is the only place where we need to call the factory. And clearly I do not know about all use cases for the factory, maybe there is a reasonable use cases for having that singleton outside butler. I have encountered that when I needed to register new StorageClass, but that potentially can be done via Butler method. I could live with dynamic  storageClass property in DatasetType, still using that approach means that Butler needs to be initialized/configured when accessing that property, which is hard to enforce. To me it feels cleaner/safer if we only try to instantiate StorageClass inside butler if possible.  
            tjenness Tim Jenness added a comment -

            I had to use a singleton because there are bits inside datastores that weren't being passed in the configuration knowledge of how to construct a StorageClass. There was a long debate about this a long time ago. Why is there a problem with dynamic storage class creation? You'd only encounter it if you asked for the storageClass property and if you were asking for that then you really need the StorageClass. You were stating that the name was enough for you so you'd access storageClassName property.

            tjenness Tim Jenness added a comment - I had to use a singleton because there are bits inside datastores that weren't being passed in the configuration knowledge of how to construct a StorageClass. There was a long debate about this a long time ago. Why is there a problem with dynamic storage class creation? You'd only encounter it if you asked for the storageClass property and if you were asking for that then you really need the StorageClass. You were stating that the name was enough for you so you'd access storageClassName property.

            My problem with that is mostly philosophical in nature, I prefer fewer code dependencies and public API that works consistently in any environment. Still as I said I can survive with dynamic attribute for my purposes.

            salnikov Andy Salnikov added a comment - My problem with that is mostly philosophical in nature, I prefer fewer code dependencies and public API that works consistently in any environment. Still as I said I can survive with dynamic attribute for my purposes.
            jbosch Jim Bosch added a comment -

            Could we make attachment of a StorageClass instance to a DatasetType instance something only a Butler/Registry/Datastore can do, in the same sense that a Registry is essentially necessary to add a dataset_id to a DatasetRef?

            I think in most user-facing contexts only requiring a string is clearly better, but I gather not having the StorageClass attached to the DatasetType would be a pretty big annoyance in the internals (which I think is confirmed by tjenness's comments above).  We could consider having a variant/subclass of DatasetType that adds that information, and in more a strongly-typed or rigorously OO language I think that'd be the way to go.  In Python it's less clear.

            jbosch Jim Bosch added a comment - Could we make attachment of a StorageClass instance to a DatasetType instance something only a Butler/Registry/Datastore can do, in the same sense that a Registry is essentially necessary to add a dataset_id to a DatasetRef? I think in most user-facing contexts only requiring a string is clearly better, but I gather not having the StorageClass attached to the DatasetType would be a pretty big annoyance in the internals (which I think is confirmed by tjenness 's comments above).  We could consider having a variant/subclass of DatasetType that adds that information, and in more a strongly-typed or rigorously OO language I think that'd be the way to go.  In Python it's less clear.

            I'm not really sure how to do that. We could make a class which "wraps" string-only-DatasetType with DatasetType-with-storageclass when DatasetType instances cross from a user land into a butler land, but that looks potentially fragile and high-maintenance. What Tim proposed (dynamic storageClass property) is probably better if you accept that it can behave unpredictably before you initialize Butler (which probably only matters for unit tests).

             

            salnikov Andy Salnikov added a comment - I'm not really sure how to do that. We could make a class which "wraps" string-only-DatasetType with DatasetType-with-storageclass when DatasetType instances cross from a user land into a butler land, but that looks potentially fragile and high-maintenance. What Tim proposed (dynamic storageClass property) is probably better if you accept that it can behave unpredictably before you initialize Butler (which probably only matters for unit tests).  

            Ready for review, the only non-trivial thing is that I had to re-implement _eq and hash_ methods for DatasetType. Jenkins has built is successfully for centos6/7, macos is still building.

            salnikov Andy Salnikov added a comment - Ready for review, the only non-trivial thing is that I had to re-implement _ eq and hash _ methods for DatasetType. Jenkins has built is successfully for centos6/7, macos is still building.
            tjenness Tim Jenness added a comment -

            Looks good to me. Sorry for delay in reviewing.

            tjenness Tim Jenness added a comment - Looks good to me. Sorry for delay in reviewing.

            Thanks for review! Merged all three packages.

            salnikov Andy Salnikov added a comment - Thanks for review! Merged all three packages.

            People

              salnikov Andy Salnikov
              salnikov Andy Salnikov
              Tim Jenness
              Andy Salnikov, Jim Bosch, Tim Jenness, Vaikunth Thukral
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Jenkins

                  No builds found.