Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-26407

Change how dimensions are stored in a Butler repository

    XMLWordPrintable

    Details

      Description

      Dimensions.yaml should be locked in and read-only as soon as the schema is created during the makeRepo phase. At the moment there is nothing stopping a repository from breaking if dimensions.yaml is updated in daf_butler and there is nothing stopping a user editing the config file after repository creation and breaking things.

      We propose to remove dimensions.yaml from ButlerConfig and have it as an explicit argument for makeRepo (defaulting to pkg://lsst.daf.butler/configs/dimensions.yaml) which will then ensure that the dimensions configuration is locked in for that repository.

      We will still require that it be possible to migrate a registry to a new dimensions definition but that migration should be by explicit request.

        Attachments

          Issue Links

            Activity

            No builds found.
            tjenness Tim Jenness created issue -
            Hide
            jbosch Jim Bosch added a comment -

            I have prototyped some of this on DM-26336 (the part about how to serialize the DimensionUniverse to database tables), and it seems quite doable. That got me thinking that we might also want to make DimensionUniverse mutable, and have Registry APIs for adding new dimensions and columns dynamically to a repo (analogous to DatasetType registration); the config file in daf_butler would then just represent the initial/default. I'm writing a Confluence page to document what that prototyping ticket came up with, and will have more details there.

            Show
            jbosch Jim Bosch added a comment - I have prototyped some of this on DM-26336 (the part about how to serialize the DimensionUniverse to database tables), and it seems quite doable. That got me thinking that we might also want to make DimensionUniverse mutable, and have Registry APIs for adding new dimensions and columns dynamically to a repo (analogous to DatasetType registration); the config file in daf_butler would then just represent the initial/default. I'm writing a Confluence page to document what that prototyping ticket came up with, and will have more details there.
            tjenness Tim Jenness made changes -
            Field Original Value New Value
            Remote Link This issue links to "Page (Confluence)" [ 25769 ]
            salnikov Andy Salnikov made changes -
            Assignee Andy Salnikov [ salnikov ]
            salnikov Andy Salnikov made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            Hide
            jbosch Jim Bosch added a comment -

            Now that I've gotten my head out of CALIBRATION collections, I'm going to try to write down the thoughts I had on this topic while prototyping, in the hopes that'll help Andy Salnikov and I coordinate as we both make rather big changes to the dimensions system (mine will be over on DM-26692, or subtask tickets thereof).  My thoughts are not very well organized, and I'd be happy to have a live chat of some kind about them, but hopefully just writing them down now will help somewhat:

            My vision of the end-state for dimensions (beyond just this ticket) is that DimensionUniverse would become mutable, and either merge with DimensionRecordStorageManager somehow (e.g. DimensionUniverse becomes an ABC, with DimensionRecordStorageManager an implementation) or have some kind of Observer pattern with it so DimensionRecordStorageManager could update the database when new dimensions are registered (or other kinds of "additive" changes are requested).  For this ticket, I think we only care about saving the dimensions in the database instead of the configs (but keeping them fixed after repo creation), but I'm hoping we could do that in a way that is compatible (i.e. with no future schema changes) with the longer-term vision.

            If we had the time, I would definitely start by making DimensionUniverse mutable (and I think we might want to anyway; read on).  That's because DimensionUniverse construction is currently very tightly coupled to the config representation, and I think that needs to change in order to also support a database representation (yes, we could serialize the config to a blob column, but right now we can't even write out the config representation from an in-memory DimensionUniverse, and I think a blob-config database representation is a poor fit to the longer-term vision anyway).  So a natural way to decouple DimensionUniverse construction from config would be to make DimensionUniverse default-constructable and mutable - we could then have separate from-config and to/from-database serialization code that would both operate on DimensionUniverse methods, delegating to them the harder work of making sure relationships make sense and invariants are preserved.

            The problem with that is that DimensionUniverse is very fragile right now; adding a new Dimension or DimensionElement affects many others that do not seem like they should be related at all.  That's basically because DimensionUniverse keeps big sorted collections of all elements, and indexes into those collections are used in the DimensionGraph.encode and DimensionGraph.decode methods that are used to save DimensionGraph objects into the database (e.g. in the dataset_type table).  So adding a new Dimension totally breaks all existing DimensionGraph objects (and hence all DatasetTypes, etc.) unless it is only appended to the very end of those collections (which is obviously incompatible with maintaining a deterministic+topological sort order).

            I think the right way to fix this is to remove DimensionGraph.encode and DimensionGraph.decode, and instead give DimensionRecordStorageManager methods to register and retrieve persistent DimensionGraphs to the database, mapping them to some kind of opaque, unique, persistent key (autoincrement or secure hash of names both seem like they would work, though secure hash is better if we want it to be meaningful across repos/universes).  We'd use that key everywhere we use the encoding right now; DatasetRecordStorageManager would need to register a DimensionGraph every time it registers a new DatasetType, for example.  I think we could then make the indexes into those sorted collections of dimensions totally private; I think we want to keep the sorted collections themselves to implement DimensionUniverse.sorted.

            It's worth noting that DimensionGraph has its own sorted collections of dimensions whose indexes are used, but because adding a new Dimension to the universe doesn't affect those per-DimensionGraph indexes, this isn't a problem.  I think it's also nice to have a notion of "registered/in-use" DimensionGraphs from the perspective of knowing whether a requested change to a DimensionUniverse should be allowed or not, even though that's a future concern.

            So, all of the above suggests that one approach to this ticket (and the one I think I'd recommend) is:

            1. Implement registration/retrieval methods for DimensionGraphs on DimensionRecordStorageManager.
            2. Remove DimensionGraph.encode/decode in favor of those.
            3. Make DimensionUniverse default-constructable and mutable, and rewrite the DimensionUniverse-from-config code to use those methods.  We'd probably want a runtime flag on DimensionUniverse to freeze it after we're done constructing it, for now.
            4. Add new code to use the same mutator methods to construct a DimensionUniverse from a database representation, as well as code to write the DimensionUniverse to that database representation.
            5. (On some other ticket, after Nov. 1), remove the runtime-freezing flag on DimensionUniverse and instead connect it to DimensionRecordStorageManager so new dimensions could be registered (etc.) after repo construction.

            I like that order in part because I think steps (1) and (2) are less related to changes I want to make to the dimensions system for DM-26692, so I'm hoping I can get those done and we can talk again before you get to (3) and (4).

            Is that at all compatible with what you had in mind for this ticket?  Is there anything important that I've missed?  Is there some other order that would get us the important-for-schema-stability-parts first?  At first I thought we might want to start by leaving DimensionUniverse immutable and adding a Builder class for both the config and database serializiaton code to use, and then not have to worry about DimensionGraph's fragile encoding until after Nov. 1.  But removing DimensionGraph encoding is a schema change, too, so if we're going to do that at all I don't think we want to put it off.

            Show
            jbosch Jim Bosch added a comment - Now that I've gotten my head out of CALIBRATION collections, I'm going to try to write down the thoughts I had on this topic while prototyping, in the hopes that'll help Andy Salnikov and I coordinate as we both make rather big changes to the dimensions system (mine will be over on DM-26692 , or subtask tickets thereof).  My thoughts are not very well organized, and I'd be happy to have a live chat of some kind about them, but hopefully just writing them down now will help somewhat: My vision of the end-state for dimensions (beyond just this ticket) is that DimensionUniverse would become mutable, and either merge with DimensionRecordStorageManager somehow (e.g. DimensionUniverse becomes an ABC, with DimensionRecordStorageManager an implementation) or have some kind of Observer pattern with it so DimensionRecordStorageManager could update the database when new dimensions are registered (or other kinds of "additive" changes are requested).  For this ticket, I think we only care about saving the dimensions in the database instead of the configs (but keeping them fixed after repo creation), but I'm hoping we could do that in a way that is compatible (i.e. with no future schema changes) with the longer-term vision. If we had the time, I would definitely start by making DimensionUniverse mutable (and I think we might want to anyway; read on).  That's because DimensionUniverse construction is currently very tightly coupled to the config representation, and I think that needs to change in order to also support a database representation (yes, we could serialize the config to a blob column, but right now we can't even write out the config representation from an in-memory DimensionUniverse, and I think a blob-config database representation is a poor fit to the longer-term vision anyway).  So a natural way to decouple DimensionUniverse construction from config would be to make DimensionUniverse default-constructable and mutable - we could then have separate from-config and to/from-database serialization code that would both operate on DimensionUniverse methods, delegating to them the harder work of making sure relationships make sense and invariants are preserved. The problem with that is that DimensionUniverse is very fragile right now; adding a new Dimension or DimensionElement affects many others that do not seem like they should be related at all.  That's basically because DimensionUniverse keeps big sorted collections of all elements, and indexes into those collections are used in the DimensionGraph.encode and DimensionGraph.decode methods that are used to save DimensionGraph objects into the database (e.g. in the dataset_type table).  So adding a new Dimension totally breaks all existing DimensionGraph objects (and hence all DatasetTypes, etc.) unless it is only appended to the very end of those collections (which is obviously incompatible with maintaining a deterministic+topological sort order). I think the right way to fix this is to remove DimensionGraph.encode and DimensionGraph.decode, and instead give DimensionRecordStorageManager methods to register and retrieve persistent DimensionGraphs to the database, mapping them to some kind of opaque, unique, persistent key (autoincrement or secure hash of names both seem like they would work, though secure hash is better if we want it to be meaningful across repos/universes).  We'd use that key everywhere we use the encoding right now; DatasetRecordStorageManager would need to register a DimensionGraph every time it registers a new DatasetType, for example.  I think we could then make the indexes into those sorted collections of dimensions totally private; I think we want to keep the sorted collections themselves to implement DimensionUniverse.sorted . It's worth noting that DimensionGraph has its own sorted collections of dimensions whose indexes are used, but because adding a new Dimension to the universe doesn't affect those per-DimensionGraph indexes, this isn't a problem.  I think it's also nice to have a notion of "registered/in-use" DimensionGraphs from the perspective of knowing whether a requested change to a DimensionUniverse should be allowed or not, even though that's a future concern. So, all of the above suggests that one approach to this ticket (and the one I think I'd recommend) is: Implement registration/retrieval methods for DimensionGraphs on DimensionRecordStorageManager. Remove DimensionGraph.encode/decode in favor of those. Make DimensionUniverse default-constructable and mutable, and rewrite the DimensionUniverse-from-config code to use those methods.  We'd probably want a runtime flag on DimensionUniverse to freeze it after we're done constructing it, for now. Add new code to use the same mutator methods to construct a DimensionUniverse from a database representation, as well as code to write the DimensionUniverse to that database representation. (On some other ticket, after Nov. 1), remove the runtime-freezing flag on DimensionUniverse and instead connect it to DimensionRecordStorageManager so new dimensions could be registered (etc.) after repo construction. I like that order in part because I think steps (1) and (2) are less related to changes I want to make to the dimensions system for DM-26692 , so I'm hoping I can get those done and we can talk again before you get to (3) and (4). Is that at all compatible with what you had in mind for this ticket?  Is there anything important that I've missed?  Is there some other order that would get us the important-for-schema-stability-parts first?  At first I thought we might want to start by leaving DimensionUniverse immutable and adding a Builder class for both the config and database serializiaton code to use, and then not have to worry about DimensionGraph's fragile encoding until after Nov. 1.  But removing DimensionGraph encoding is a schema change, too, so if we're going to do that at all I don't think we want to put it off.
            Hide
            salnikov Andy Salnikov added a comment -

            I was doing reading/thinking over weekend, I still don't have complete picture in my head but I want to share some thoughts of my own.

            I agree that we need better config presentation in database than YAML BLOB, having some structure should make it easier to work with. I think it should be easy to map current config structure to a small number of new tables.

            I am not sure I like mutable DimensionUniverse though, I think it brings a lot of complications. It probably implies some sort of synchronization of in-memory universe with database. One way to do it is to reflect every change to database immediately, this means that universe is tightly couple with registry to support that (observer can help but it's not optional, so you'd have to setup that observer for every universe), and if there are few updates that need to be done we also have to think about transactions and concurrency. Another way to do it is to have an explicit synchronization after all in-memory updates, this will help with transactions but it can become really complicated, you need to compare in-memory state with database and decide which updates are needed (may not be a big problem if we support append-only modifications, but if more complicated updates are needed it will be non-trivial). In both cases we need to be careful about concurrent updates.

            I think I would be more comfortable if universe updates were a separate management operation not exposed as DimensionUniverse API but maybe a set of registry methods or even a lower manager-level interface. We could keep universe immutable then though there are still issues with synchronization, if database dimensions change then in-memory state could become inconsistent depending on what sort of changes we allow. Maybe the correct way to handle this is to avoid having in-memory universe and always go to the database for correct info (I know it's terrible for performance but I only care about correctness now).

            One thing that may be useful to understand now is what people would do operationally to update database. Initial storage would probably remain based on YAML config, I think the options could be to make a universe instance out of that and use that instance to initialize database or just use Config directly (but maybe make universe to validate Config), and that will happen in butler make-repo method. What would people need to update it later - full new YAML config, small YAML with updated info only, or just write some Python script to call registry methods? I think that API we need to provide for updates will depend on which of these options we want to support.

            Show
            salnikov Andy Salnikov added a comment - I was doing reading/thinking over weekend, I still don't have complete picture in my head but I want to share some thoughts of my own. I agree that we need better config presentation in database than YAML BLOB, having some structure should make it easier to work with. I think it should be easy to map current config structure to a small number of new tables. I am not sure I like mutable DimensionUniverse though, I think it brings a lot of complications. It probably implies some sort of synchronization of in-memory universe with database. One way to do it is to reflect every change to database immediately, this means that universe is tightly couple with registry to support that (observer can help but it's not optional, so you'd have to setup that observer for every universe), and if there are few updates that need to be done we also have to think about transactions and concurrency. Another way to do it is to have an explicit synchronization after all in-memory updates, this will help with transactions but it can become really complicated, you need to compare in-memory state with database and decide which updates are needed (may not be a big problem if we support append-only modifications, but if more complicated updates are needed it will be non-trivial). In both cases we need to be careful about concurrent updates. I think I would be more comfortable if universe updates were a separate management operation not exposed as DimensionUniverse API but maybe a set of registry methods or even a lower manager-level interface. We could keep universe immutable then though there are still issues with synchronization, if database dimensions change then in-memory state could become inconsistent depending on what sort of changes we allow. Maybe the correct way to handle this is to avoid having in-memory universe and always go to the database for correct info (I know it's terrible for performance but I only care about correctness now). One thing that may be useful to understand now is what people would do operationally to update database. Initial storage would probably remain based on YAML config, I think the options could be to make a universe instance out of that and use that instance to initialize database or just use Config directly (but maybe make universe to validate Config), and that will happen in butler make-repo method. What would people need to update it later - full new YAML config, small YAML with updated info only, or just write some Python script to call registry methods? I think that API we need to provide for updates will depend on which of these options we want to support.
            Hide
            jbosch Jim Bosch added a comment -

            I think I would be more comfortable if universe updates were a separate management operation not exposed as DimensionUniverse API but maybe a set of registry methods or even a lower manager-level interface. We could keep universe immutable then though there are still issues with synchronization, if database dimensions change then in-memory state could become inconsistent depending on what sort of changes we allow. Maybe the correct way to handle this is to avoid having in-memory universe and always go to the database for correct info (I know it's terrible for performance but I only care about correctness now).

            I think with the changes I proposed to track "registered" DimensionGraphs in the database, it would probably become viable to always go to the database when constructing any DimensionGraph, and I think constructing (and caching) DimensionGraphs is really DimensionUniverse's main role right now (at least outside of unit tests).  There is a bit more that it does to support the query system, but this is what DM-26692 will be changing, and I think that state could just as easily move to DimensionRecordStorageManager. So maybe if the interface for registering/creating DimensionGraphs also moves to DimensionRecordStorageManager, DimensionUniverse could just go away entirely, and we wouldn't have to worry about whether it is mutable or not.  Removing it would be pretty disruptive, though, and very hard to do until after we have a system for registering DimensionGraphs in the database, so I'm not sure how to do the transition.  I do think it would be necessary to do some caching of DimensionGraphs in the manager class, but then that wouldn't be substantially different from the DatasetType and collection caching we do in other managers.

            What would people need to update it later - full new YAML config, small YAML with updated info only, or just write some Python script to call registry methods? I think that API we need to provide for updates will depend on which of these options we want to support.

            I do think updates will be very rare, such that we can demand that they happen in their own special process and that no other Dimension objects be active when they happen. I don't think it makes sense to ask the user to provide a full new YAML config - that seems like it just puts burdens on them to reproduce the current DB state in config and us to verify that they are consistent (and then some kind of diff to learn what changed). I think we may have a couple of different command-line tool "verbs" for updating dimensions (at least adding new dimensions and adding new columns), and it may make sense for those to take YAML snippets as input to reduce how much structured information has to be passed through the CLI options. But I don't think that YAML needs to be strictly identical to what the same definitions might look like in the original config - that's obviously nice for users if it is possible, but I think for correctness it's most important for us to recognize that a DSL for defining a completely new, self-consistent set of dimensions is not automatically a good DSL for modifying a set of dimensions.

            Show
            jbosch Jim Bosch added a comment - I think I would be more comfortable if universe updates were a separate management operation not exposed as DimensionUniverse API but maybe a set of registry methods or even a lower manager-level interface. We could keep universe immutable then though there are still issues with synchronization, if database dimensions change then in-memory state could become inconsistent depending on what sort of changes we allow. Maybe the correct way to handle this is to avoid having in-memory universe and always go to the database for correct info (I know it's terrible for performance but I only care about correctness now). I think with the changes I proposed to track "registered" DimensionGraphs in the database, it would probably become viable to always go to the database when constructing any DimensionGraph, and I think constructing (and caching) DimensionGraphs is really DimensionUniverse's main role right now (at least outside of unit tests).  There is a bit more that it does to support the query system, but this is what DM-26692 will be changing, and I think that state could just as easily move to DimensionRecordStorageManager. So maybe if the interface for registering/creating DimensionGraphs also moves to DimensionRecordStorageManager, DimensionUniverse could just go away entirely, and we wouldn't have to worry about whether it is mutable or not.  Removing it would be pretty disruptive, though, and very hard to do until after we have a system for registering DimensionGraphs in the database, so I'm not sure how to do the transition.  I do think it would be necessary to do some caching of DimensionGraphs in the manager class, but then that wouldn't be substantially different from the DatasetType and collection caching we do in other managers. What would people need to update it later - full new YAML config, small YAML with updated info only, or just write some Python script to call registry methods? I think that API we need to provide for updates will depend on which of these options we want to support. I do think updates will be very rare, such that we can demand that they happen in their own special process and that no other Dimension objects be active when they happen. I don't think it makes sense to ask the user to provide a full new YAML config - that seems like it just puts burdens on them to reproduce the current DB state in config and us to verify that they are consistent (and then some kind of diff to learn what changed). I think we may have a couple of different command-line tool "verbs" for updating dimensions (at least adding new dimensions and adding new columns), and it may make sense for those to take YAML snippets as input to reduce how much structured information has to be passed through the CLI options. But I don't think that YAML needs to be strictly identical to what the same definitions might look like in the original config - that's obviously nice for users if it is possible, but I think for correctness it's most important for us to recognize that a DSL for defining a completely new, self-consistent set of dimensions is not automatically a good DSL for modifying a set of dimensions.
            Hide
            jbosch Jim Bosch added a comment -

            I think we may have a couple of different command-line tool "verbs" for updating dimensions

            Thinking about it a bit more, this will be so rare that having dedicated command-line scripts that read YAML and translate that into Registry calls is probably overkill. Just writing a custom Python script that calls Registry or manager methods directly seems fine.

            Show
            jbosch Jim Bosch added a comment - I think we may have a couple of different command-line tool "verbs" for updating dimensions Thinking about it a bit more, this will be so rare that having dedicated command-line scripts that read YAML and translate that into Registry calls is probably overkill. Just writing a custom Python script that calls Registry or manager methods directly seems fine.
            Hide
            salnikov Andy Salnikov added a comment -

            If we can guarantee that dimensions in database are never updated during the regular running then I think it is OK to keep universe as an immutable instance and only load it once from database. If nothing changes in the database then all caches will stay consistent. I think with this assumption the changes should be relatively small, basically replace/extend construction of universe from Config with similar code to build it from database.

            Show
            salnikov Andy Salnikov added a comment - If we can guarantee that dimensions in database are never updated during the regular running then I think it is OK to keep universe as an immutable instance and only load it once from database. If nothing changes in the database then all caches will stay consistent. I think with this assumption the changes should be relatively small, basically replace/extend construction of universe from Config with similar code to build it from database.
            Hide
            tjenness Tim Jenness added a comment -

            I think if people are regularly tweaking dimensions in an active registry we have some big problems. The main use case I thought was:

            • Make it simpler for non-Rubin people to change the dimensions globally for their use case.
            • Allow us to do dimension changes without breaking existing installations and deployments (effectively makeRepo burns in the dimensions and any changes to the dimensions in daf_butler require us to run an explicit migration script)
            Show
            tjenness Tim Jenness added a comment - I think if people are regularly tweaking dimensions in an active registry we have some big problems. The main use case I thought was: Make it simpler for non-Rubin people to change the dimensions globally for their use case. Allow us to do dimension changes without breaking existing installations and deployments (effectively makeRepo burns in the dimensions and any changes to the dimensions in daf_butler require us to run an explicit migration script)
            Hide
            salnikov Andy Salnikov added a comment -

            Tim, I understand and agree with what you describe. My concern is that even with these limited use cases we still do not want to cause problems for jobs running concurrently with the dimensions upgrade, either because there cannot be such jobs (by policy decision) or because we designed our system to handle that case completely transparently. Latter is hard to do and probably needs a lot of changes everywhere, so I hope we can decide on former instead.

            Show
            salnikov Andy Salnikov added a comment - Tim, I understand and agree with what you describe. My concern is that even with these limited use cases we still do not want to cause problems for jobs running concurrently with the dimensions upgrade, either because there cannot be such jobs (by policy decision) or because we designed our system to handle that case completely transparently. Latter is hard to do and probably needs a lot of changes everywhere, so I hope we can decide on former instead.
            Hide
            tjenness Tim Jenness added a comment -

            I'm fine with project policy enforcing downtime rather than trying to do a live change to dimensions (with version increment).

            Show
            tjenness Tim Jenness added a comment - I'm fine with project policy enforcing downtime rather than trying to do a live change to dimensions (with version increment).
            Hide
            jbosch Jim Bosch added a comment -

            I'm on board with requiring downtime for dimensions changes. And I'm on board with keeping DimensionUniverse around as an immutable object; that definitely makes this ticket a lot easier, and if I'm right that DimensionUniverse can become less important or be removed in the future, that can be a completely separate change that isn't a schema change.

            Show
            jbosch Jim Bosch added a comment - I'm on board with requiring downtime for dimensions changes. And I'm on board with keeping DimensionUniverse around as an immutable object; that definitely makes this ticket a lot easier, and if I'm right that DimensionUniverse can become less important or be removed in the future, that can be a completely separate change that isn't a schema change.
            Hide
            jbosch Jim Bosch added a comment -

            (Sorry, hit enter too soon).

            I do still think we need to register DimensionGraph objects in the database instead of the super-fragile encoding system (i.e. steps 1 and 2 from my big post earlier). If we don't do that, then any actual dimensions migration is still going to be very painful, because we'll have to update all of the encoded values even for sets of dimensions that weren't otherwise affected. But those steps no longer have to be done before this ticket, or even with it; that can be another ticket entirely - but if it is, I think we still want it done before we declare schema stability.

            Show
            jbosch Jim Bosch added a comment - (Sorry, hit enter too soon). I do still think we need to register DimensionGraph objects in the database instead of the super-fragile encoding system (i.e. steps 1 and 2 from my big post earlier). If we don't do that, then any actual dimensions migration is still going to be very painful, because we'll have to update all of the encoded values even for sets of dimensions that weren't otherwise affected. But those steps no longer have to be done before this ticket, or even with it; that can be another ticket entirely - but if it is, I think we still want it done before we declare schema stability.
            Hide
            salnikov Andy Salnikov added a comment -

            Looking at the implementation side of things - there is one minor complication at registry initialization time, sort of chicken-and-egg problem. When registry initializes all managers in declareStaticTables context it already needs an instance of Universe to pass it to few managers. When creating new registry this is not a problem because universe will come from a Config, but when we make registry for existing database we'll have to read universe from database before we instantiate managers. Looks like we need some special casing for this, and that may also interact with other things like version checking, etc. Need to dig around for more complications probably.

            Show
            salnikov Andy Salnikov added a comment - Looking at the implementation side of things - there is one minor complication at registry initialization time, sort of chicken-and-egg problem. When registry initializes all managers in declareStaticTables context it already needs an instance of Universe to pass it to few managers. When creating new registry this is not a problem because universe will come from a Config, but when we make registry for existing database we'll have to read universe from database before we instantiate managers. Looks like we need some special casing for this, and that may also interact with other things like version checking, etc. Need to dig around for more complications probably.
            Hide
            jbosch Jim Bosch added a comment -

            Andy Salnikov, I think this is now pretty well-defined and I don't have anything new to add in the way of design thoughts (or any particular wisdom on the latest complication you mentioned). But I wanted to warn you that I am (on DM-26692) going to be shuffling some code around in daf/butler/core/dimensions/*, and adding some new state to the dimension configuration and classes as well, so when you start coding in earnest in a way that would be affected by that, we may want to coordinate a bit so neither one of us ends with a particularly painful rebase conflict later.

            Show
            jbosch Jim Bosch added a comment - Andy Salnikov , I think this is now pretty well-defined and I don't have anything new to add in the way of design thoughts (or any particular wisdom on the latest complication you mentioned). But I wanted to warn you that I am (on DM-26692 ) going to be shuffling some code around in daf/butler/core/dimensions/*, and adding some new state to the dimension configuration and classes as well, so when you start coding in earnest in a way that would be affected by that, we may want to coordinate a bit so neither one of us ends with a particularly painful rebase conflict later.
            Hide
            salnikov Andy Salnikov added a comment -

            I had a bit of time to look at he issue I mentioned above, I think it should not be too difficult to make a reasonable workaround. The declareStaticTables context is used to build a sqlalchemy metadata instance for database. For querying universe configuration we do not need complete schema but only a small set of tables that define that configuration, so we can it in two steps:

            • first make a context and only include dimensions manager (plus attributes manager that is needed to verify versions), call version manager to check versions and then use dimensions manager to retrieve the universe
            • second make a context to add all managers and check versions for all of those, currently declareStaticTables makes new metadata instance on each call so we'll have to repeat adding dimensions and attributes manager, alternatively I can look at how to extend metadata instead of replacing it.

            Jim Bosch, I think you mentioned you prototyped a schema for storing universe, do you have a code I can look at?

            Show
            salnikov Andy Salnikov added a comment - I had a bit of time to look at he issue I mentioned above, I think it should not be too difficult to make a reasonable workaround. The declareStaticTables context is used to build a sqlalchemy metadata instance for database. For querying universe configuration we do not need complete schema but only a small set of tables that define that configuration, so we can it in two steps: first make a context and only include dimensions manager (plus attributes manager that is needed to verify versions), call version manager to check versions and then use dimensions manager to retrieve the universe second make a context to add all managers and check versions for all of those, currently declareStaticTables makes new metadata instance on each call so we'll have to repeat adding dimensions and attributes manager, alternatively I can look at how to extend metadata instead of replacing it. Jim Bosch , I think you mentioned you prototyped a schema for storing universe, do you have a code I can look at?
            Hide
            jbosch Jim Bosch added a comment -

            All I had were some dataclasses that I thought would map pretty well to tables:

            https://github.com/lsst/daf_butler/blob/u/jbosch/DM-26336/prototyping/python/lsst/daf/butler/core/dimensions3/_meta_rows.py

            I'm not certain they're quite right for either the dimensions definition state as it exists now or as it will exist after I'm done with my changes (I hope to have something ready for review early tomorrow, but it's a pretty big refactor), but it's close.

            Show
            jbosch Jim Bosch added a comment - All I had were some dataclasses that I thought would map pretty well to tables: https://github.com/lsst/daf_butler/blob/u/jbosch/DM-26336/prototyping/python/lsst/daf/butler/core/dimensions3/_meta_rows.py I'm not certain they're quite right for either the dimensions definition state as it exists now or as it will exist after I'm done with my changes (I hope to have something ready for review early tomorrow, but it's a pretty big refactor), but it's close.
            Hide
            salnikov Andy Salnikov added a comment -

            Thanks Jim Bosch, this is useful. I'm not quite sure yet what to make out of alias elements, we can probably talk about it tomorrow at the meeting.

            Show
            salnikov Andy Salnikov added a comment - Thanks Jim Bosch , this is useful. I'm not quite sure yet what to make out of alias elements, we can probably talk about it tomorrow at the meeting.
            Hide
            jbosch Jim Bosch added a comment -

            The most I think we want to do with alias elements now is try to leave room for them in the database representation if there are obvious ways to do so. I have no plans to get them in master soon.

            Show
            jbosch Jim Bosch added a comment - The most I think we want to do with alias elements now is try to leave room for them in the database representation if there are obvious ways to do so. I have no plans to get them in master soon.
            Hide
            salnikov Andy Salnikov added a comment - - edited

            Jim Bosch, I'm thinking about schema, and have couple of more questions related to that:

            • I did not see it in your prototype code but I guess that "packers" section in dimensions.yaml should also need a representation in database? Is packer's fixed dimensions a subset of dimensions?
            • some elements have two "keys" (typically id:int and name:str), my guess is that they are not a composite key but two separate unique keys instead? Is there any chance that we may need a composite key for any future elements?
            Show
            salnikov Andy Salnikov added a comment - - edited Jim Bosch , I'm thinking about schema, and have couple of more questions related to that: I did not see it in your prototype code but I guess that "packers" section in dimensions.yaml should also need a representation in database? Is packer's fixed dimensions a subset of dimensions ? some elements have two "keys" (typically id:int and name:str ), my guess is that they are not a composite key but two separate unique keys instead? Is there any chance that we may need a composite key for any future elements?
            jbosch Jim Bosch made changes -
            Link This issue relates to DM-27033 [ DM-27033 ]
            tjenness Tim Jenness made changes -
            Labels gen2-deprecation-blocker gen3-middleware gen2-deprecation-blocker gen3-middleware gen3-registry-incompatibility
            Hide
            salnikov Andy Salnikov added a comment -

            Re-thinking my approach to the schema in light of all recent changes - I'm not sure anymore that I want to go with the complete decomposition of all configuration items into relational schema. I'm not very comfortable with the complexity and stability, I think it will need about 10 tables to implement dimensions.yaml relationships in the database and it is likely that we'll need to migrate that schema ~soon based on our experience. I think I'm more comfortable for now at least with keeping that complexity on client side and storing configuration in a database in a most flexible way, meaning just as text (YAML) representation as a single blob for the whole dimensions.yaml contents. So the primary source of its configuration will be Config instance, but we don't decompose that into bunch of tables but store it as a whole and retrieve that later. This has a drawback that database now cannot be used for enforcing constraints in configuration, so it all has to be done on client side, but I think this is acceptable given that alternative has a similar issue for possible migrations.

            What I'm not quite sure about how this would interact with schema migration. I can imagine that some simple changes (like adding new dimension) are totally fine without any changes in schema versions. What will probably be more complicated is a change in the schema (existing dimensions tables) that will also need an update for configuration. That cannot be done entirely in SQL but will require additional Python code to read existing config, modify, and save it, which may or may not be trivial (but probably the same complexity level as trying to figure out migration for SQL).

            Show
            salnikov Andy Salnikov added a comment - Re-thinking my approach to the schema in light of all recent changes - I'm not sure anymore that I want to go with the complete decomposition of all configuration items into relational schema. I'm not very comfortable with the complexity and stability, I think it will need about 10 tables to implement dimensions.yaml relationships in the database and it is likely that we'll need to migrate that schema ~soon based on our experience. I think I'm more comfortable for now at least with keeping that complexity on client side and storing configuration in a database in a most flexible way, meaning just as text (YAML) representation as a single blob for the whole dimensions.yaml contents. So the primary source of its configuration will be Config instance, but we don't decompose that into bunch of tables but store it as a whole and retrieve that later. This has a drawback that database now cannot be used for enforcing constraints in configuration, so it all has to be done on client side, but I think this is acceptable given that alternative has a similar issue for possible migrations. What I'm not quite sure about how this would interact with schema migration. I can imagine that some simple changes (like adding new dimension) are totally fine without any changes in schema versions. What will probably be more complicated is a change in the schema (existing dimensions tables) that will also need an update for configuration. That cannot be done entirely in SQL but will require additional Python code to read existing config, modify, and save it, which may or may not be trivial (but probably the same complexity level as trying to figure out migration for SQL).
            Hide
            jbosch Jim Bosch added a comment -

            to just serializing YAML or some other blob.  I was coming to the same conclusion just from thinking about the additional changes I want to make this month, ideally after this ticket, and how seemingly small changes in Python could yield bigger changes to the database representation.  We should make sure that what we save is totally deterministic, though, and YAML/Config may not be as conducive to that (it's at least not safe to just save YAML/Config as written by some human without some deterministic sorting first).

            I'm not sure this addresses the exact problem you were thinking about in the second paragraph, but storing a semver version triple for the persisted dimensions definitions seems like that would bring it into the same system as the rest of rest of the versions that are there, with patch number increments then being a way to signal that a migration isn't needed for compatibility with a change.  We could add a user-defined string tag as well to represent some kind of branch name if we're worried about monotonically.

            Show
            jbosch Jim Bosch added a comment - to just serializing YAML or some other blob.  I was coming to the same conclusion just from thinking about the additional changes I want to make this month, ideally after this ticket, and how seemingly small changes in Python could yield bigger changes to the database representation.  We should make sure that what we save is totally deterministic, though, and YAML/Config may not be as conducive to that (it's at least not safe to just save YAML/Config as written by some human without some deterministic sorting first). I'm not sure this addresses the exact problem you were thinking about in the second paragraph, but storing a semver version triple for the persisted dimensions definitions seems like that would bring it into the same system as the rest of rest of the versions that are there, with patch number increments then being a way to signal that a migration isn't needed for compatibility with a change.  We could add a user-defined string tag as well to represent some kind of branch name if we're worried about monotonically.
            Hide
            salnikov Andy Salnikov added a comment -

            I think this is ready for initial review. Most of the changes are in daf_butler but there are few dependencies that needed an update too. daf_butler github checks fail for two reasons:

            • Documentation build fails due to recursion limit - not sure what to do about it
            • mypy fails for NotImplemented - I think we'll have to rebase our development branch to later master, will do it after this merge.

            Jenkins fails on macos in ci_hsc_gen3 with SEGV while running pipetask, not sure what to do it about this one.

            The packages:

            Tim Jenness, I'm sure you are looking at this too, can you specifically check what I did to DimensionConfig class?

            Krzysztof Findeisen, you are also welcome to check ap_verify_testdata, just note that this will be merged to development branch (tickets/DM-27033), not master.

            Show
            salnikov Andy Salnikov added a comment - I think this is ready for initial review. Most of the changes are in daf_butler but there are few dependencies that needed an update too. daf_butler github checks fail for two reasons: Documentation build fails due to recursion limit - not sure what to do about it mypy fails for NotImplemented - I think we'll have to rebase our development branch to later master, will do it after this merge. Jenkins fails on macos in ci_hsc_gen3 with SEGV while running pipetask, not sure what to do it about this one. The packages: daf_butler obs_base pipe_base ctrl_mpexec ap_verify_testdata - I regenerated sqlite file with add_gen3_repo which also updated config/export.yam , where it replaced bias with flat , I hope this is right but please check. Tim Jenness , I'm sure you are looking at this too, can you specifically check what I did to DimensionConfig class? Krzysztof Findeisen , you are also welcome to check ap_verify_testdata , just note that this will be merged to development branch (tickets/ DM-27033 ), not master.
            salnikov Andy Salnikov made changes -
            Reviewers Jim Bosch [ jbosch ]
            Status In Progress [ 3 ] In Review [ 10004 ]
            Hide
            tjenness Tim Jenness added a comment -

            macOS failure is probably DM-27118.

            We know the docs build on master so there must be some oddity with some your branch in particular.

            Regarding the in-memory registry – I'm not sure if you are saying that this is now impossible in butler or if you are saying it's something about our tests that mean we can't use it in tests.

            Show
            tjenness Tim Jenness added a comment - macOS failure is probably DM-27118 . We know the docs build on master so there must be some oddity with some your branch in particular. Regarding the in-memory registry – I'm not sure if you are saying that this is now impossible in butler or if you are saying it's something about our tests that mean we can't use it in tests.
            Hide
            salnikov Andy Salnikov added a comment -

            In-memory registry use is very limited now - only works if you instantiate a Registry yourself and use it through the test. In most other cases when you need Butler instance it will not work anymore because we always create more than one butler and registry and that needs multiple sqlite connections. So, indeed, if you need butler, you cannot use in-memory sqlite anymore.

            Show
            salnikov Andy Salnikov added a comment - In-memory registry use is very limited now - only works if you instantiate a Registry yourself and use it through the test. In most other cases when you need Butler instance it will not work anymore because we always create more than one butler and registry and that needs multiple sqlite connections. So, indeed, if you need butler, you cannot use in-memory sqlite anymore.
            Hide
            salnikov Andy Salnikov added a comment -

            One possible way to use in-memory registry with butler is to make makeRepo return Butler instance instead of config, then this butler instance can use Registry that was just created (butler needs to be initialized specially) and that butler can be reused without making new database connection.

            Show
            salnikov Andy Salnikov added a comment - One possible way to use in-memory registry with butler is to make makeRepo return Butler instance instead of config, then this butler instance can use Registry that was just created (butler needs to be initialized specially) and that butler can be reused without making new database connection.
            Hide
            tjenness Tim Jenness added a comment -

            A makeRepoAndButler method seems fine with me given that that is quite a common paradigm for testing. It might have to return the butler and the config so that makeRepo can be a thin wrapper around it for compatibility.

            Show
            tjenness Tim Jenness added a comment - A makeRepoAndButler method seems fine with me given that that is quite a common paradigm for testing. It might have to return the butler and the config so that makeRepo can be a thin wrapper around it for compatibility.
            Hide
            salnikov Andy Salnikov added a comment -

            makeRepo does not create butler today, it only makes configuration and initializes Registry. I am looking at how makeRepo is used today, in many cases this looks like

                newConfig = Butler.makeRepo(root, ...)
                butler = Butler(newConfig, ...)
            

            or

                Butler.makeRepo(self.root, ...)
                butler = Butler(self.root, ...)
            

            (and there are more advanced cases in daf_butler which test multiple butlers).

            I think for the cases above it does not make much sense to even have that config around, having a Butler returned is sufficient. And we could also expose config instance as Butler attribute for people who need access to that.

            So, maybe makeRepoButler(*makeRepoArgs, *ButlerArgs) -> Butler is sufficient? Too many arguments bother me though.

            (there is also a a horrible hackish way to return a Registry from makeRepo() via the same Config instance via a "secret" key and re-use that registry in Butler constructor, but I myself cannot suggest this in a public forum)

            Show
            salnikov Andy Salnikov added a comment - makeRepo does not create butler today, it only makes configuration and initializes Registry. I am looking at how makeRepo is used today, in many cases this looks like newConfig = Butler.makeRepo(root, ...) butler = Butler(newConfig, ...) or Butler.makeRepo(self.root, ...) butler = Butler(self.root, ...) (and there are more advanced cases in daf_butler which test multiple butlers). I think for the cases above it does not make much sense to even have that config around, having a Butler returned is sufficient. And we could also expose config instance as Butler attribute for people who need access to that. So, maybe makeRepoButler(*makeRepoArgs, *ButlerArgs) -> Butler is sufficient? Too many arguments bother me though. (there is also a a horrible hackish way to return a Registry from makeRepo() via the same Config instance via a "secret" key and re-use that registry in Butler constructor, but I myself cannot suggest this in a public forum)
            Hide
            tjenness Tim Jenness added a comment -

            Yes, those two schemes are very common (and I have been switching to the first solely because makeRepo might write the file somewhere else and there is no reason to use the file on disk if you already have the config in python).

            Should createRegistry ever be false in makeRepo? I think there's an outstanding usability issue with makeRepo as well because of the overwrite flag interacting badly with createRegistry flag (since surely makeRepo should either write a butler.yaml and init the registry or fail to do either of those things; writing a new butler yaml but allowing registry to be incompatible seems bad).

            I'm willing to see if makeRepoButler API would work – seems like it's worth trying.

            Show
            tjenness Tim Jenness added a comment - Yes, those two schemes are very common (and I have been switching to the first solely because makeRepo might write the file somewhere else and there is no reason to use the file on disk if you already have the config in python). Should createRegistry ever be false in makeRepo? I think there's an outstanding usability issue with makeRepo as well because of the overwrite flag interacting badly with createRegistry flag (since surely makeRepo should either write a butler.yaml and init the registry or fail to do either of those things; writing a new butler yaml but allowing registry to be incompatible seems bad). I'm willing to see if makeRepoButler API would work – seems like it's worth trying.
            Hide
            salnikov Andy Salnikov added a comment -

            I do not know if there is a use case for createRegistry = False, especially with more involved registry initialization it makes sense to keep it all together. OTOH I have no idea if BPS will want to do something smart to create repo, e.g. making a copy of Registry from some other location after makeRepo.

            I can play with makeRepoButler on my next ticket.

            Show
            salnikov Andy Salnikov added a comment - I do not know if there is a use case for createRegistry = False , especially with more involved registry initialization it makes sense to keep it all together. OTOH I have no idea if BPS will want to do something smart to create repo, e.g. making a copy of Registry from some other location after makeRepo . I can play with makeRepoButler on my next ticket.
            Hide
            salnikov Andy Salnikov added a comment -

            Documentation build failure is due to too recent documenteer version, on master branch problem was fixed by this commit, I do not want to do it on this branch to avoid rebase conflict, but once we rebase our development branch then both failures in github actions should disappear.

            Show
            salnikov Andy Salnikov added a comment - Documentation build failure is due to too recent documenteer version, on master branch problem was fixed by this commit , I do not want to do it on this branch to avoid rebase conflict, but once we rebase our development branch then both failures in github actions should disappear.
            Hide
            salnikov Andy Salnikov added a comment -

            Jenkins is failing for this branch due to updates in master of few obs_ packages. I do not want to rebase our development branch just yet as it may mess PR so I'll stop running Jenkins and rebase development branch as soon as I merge this ticket.

             

            Show
            salnikov Andy Salnikov added a comment - Jenkins is failing for this branch due to updates in master of few obs_ packages. I do not want to rebase our development branch just yet as it may mess PR so I'll stop running Jenkins and rebase development branch as soon as I merge this ticket.  
            Hide
            jbosch Jim Bosch added a comment -

            Looks good!  I have not waded into the config discussions, but I have essentially nothing to comment on (except a typo) for anything else.

            Show
            jbosch Jim Bosch added a comment - Looks good!  I have not waded into the config discussions, but I have essentially nothing to comment on (except a typo) for anything else.
            jbosch Jim Bosch made changes -
            Status In Review [ 10004 ] Reviewed [ 10101 ]
            salnikov Andy Salnikov made changes -
            Resolution Done [ 10000 ]
            Status Reviewed [ 10101 ] Done [ 10002 ]
            Hide
            tjenness Tim Jenness added a comment -

            Looking at a dump of the database, I do wonder about storing dimensions.yaml as JSON in the database rather than YAML. It would be a fairly easy fix to make Config.fromJSON work (you only lose the special include support in the parser and we don't need that for this anyhow)

            Show
            tjenness Tim Jenness added a comment - Looking at a dump of the database, I do wonder about storing dimensions.yaml as JSON in the database rather than YAML. It would be a fairly easy fix to make Config.fromJSON work (you only lose the special include support in the parser and we don't need that for this anyhow)
            Hide
            salnikov Andy Salnikov added a comment -

            Is JSON better than YAML? Are you worried about line breaks? With YAML you can at least dump that into dimension.yaml without going through Config class.

            Show
            salnikov Andy Salnikov added a comment - Is JSON better than YAML? Are you worried about line breaks? With YAML you can at least dump that into dimension.yaml without going through Config class.
            tjenness Tim Jenness made changes -
            Watchers Andy Salnikov, Jim Bosch, Michelle Gower, Tim Jenness [ Andy Salnikov, Jim Bosch, Michelle Gower, Tim Jenness ] Andy Salnikov, Jim Bosch, Kian-Tat Lim, Michelle Gower, Tim Jenness [ Andy Salnikov, Jim Bosch, Kian-Tat Lim, Michelle Gower, Tim Jenness ]
            Hide
            tjenness Tim Jenness added a comment -

            It's more a general comment about storing all the \n and whitespace characters in the table when that's exactly what JSON is designed to solve.

            Supporting JSON natively in Config is not difficult. Kian-Tat Lim do you have an opinion on storing JSON vs YAML in database?

            Show
            tjenness Tim Jenness added a comment - It's more a general comment about storing all the \n and whitespace characters in the table when that's exactly what JSON is designed to solve. Supporting JSON natively in Config is not difficult. Kian-Tat Lim do you have an opinion on storing JSON vs YAML in database?
            Hide
            ktl Kian-Tat Lim added a comment -

            My general philosophy is: "YAML is better if humans are writing/reading it. JSON is better if machines are writing/reading it."

            Show
            ktl Kian-Tat Lim added a comment - My general philosophy is: "YAML is better if humans are writing/reading it. JSON is better if machines are writing/reading it."
            Hide
            salnikov Andy Salnikov added a comment -

            To make everyone (un)happy we should do XML then 

            Show
            salnikov Andy Salnikov added a comment - To make everyone (un)happy we should do XML then 
            tjenness Tim Jenness made changes -
            Link This issue relates to DM-27256 [ DM-27256 ]
            Hide
            tjenness Tim Jenness added a comment -

            JSON form of dimensions.yaml is 10% smaller than YAML version. I think Kian-Tat Lim is on the side of switching to JSON. I will do that once DM-27256 is merged (which adds JSON support to Config).

            Show
            tjenness Tim Jenness added a comment - JSON form of dimensions.yaml is 10% smaller than YAML version. I think Kian-Tat Lim is on the side of switching to JSON. I will do that once DM-27256 is merged (which adds JSON support to Config).
            tjenness Tim Jenness made changes -
            Link This issue relates to DM-27266 [ DM-27266 ]
            fritzm Fritz Mueller made changes -
            Epic Link DM-25244 [ 435560 ]

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              tjenness Tim Jenness
              Reviewers:
              Jim Bosch
              Watchers:
              Andy Salnikov, Jim Bosch, Kian-Tat Lim, Michelle Gower, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.