Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-15686

Re-implement task execution in laptop activator

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Changes in gen3 butler need few updates for execution framework. I think that piece responsible for running the task should still work (but it has not been tested recently), but post-pre-flight part needs to do some additional work:

      •  make sure that output collection exists in registry
      • copy/associate (if needed) all input datasets into output collection, this probably needs to be done recursively
      • provenance information (Quanta) needs to be saved in registry

      I probably need more realistic examples of PipelineTask to properly test it.

        Attachments

          Issue Links

            Activity

            No builds found.
            salnikov Andy Salnikov created issue -
            salnikov Andy Salnikov made changes -
            Field Original Value New Value
            Epic Link DM-14661 [ 106365 ]
            salnikov Andy Salnikov made changes -
            Risk Score 0
            salnikov Andy Salnikov made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            Hide
            salnikov Andy Salnikov added a comment -

            Thoughts on filling output collection:

            • input to that process is a QuantumGraph which has a bunch of input and output DatasetRefs (DatasetType + DataId)
            • we need "copy" all input Datasets into output collection, only if they are not already there; DatasetRef is non-unique in registry and can appear in multiple collections, we should me sure that correct DatasetRef from correct input collection is copied
              • DatasetRef in QuantumGraph already has unique id for Dataset (dataset_id) which uniquely identifies DatasetRef that we need to copy
            • current implementation of a collection in SQL registry is a table with records (collection_name, dataset_id)
            • a collection by convention can only contain unique DatasetRefs, but schema does not have corresponding constraints, so this should be verified in client code

            So for each input DatasetRef we need to add its dataset_id to output collection but we have to make sure that corresponding DatasetType+DataId is not in the output collection already (with different dataset_id). Registry currently has method to check for individual DatasetRefs in a collection but this may not scale if we want to check many DatasetRefs for large QuantumGraphs (even forgetting now about concurrency issues, which we should not).

            Additionally it would be to check that output DatasetRefs do not appear in output collection, though this can be a moot point because in concurrently-updated registry there is no guarantee that something checked right now will stay true for very long.

            Show
            salnikov Andy Salnikov added a comment - Thoughts on filling output collection: input to that process is a QuantumGraph which has a bunch of input and output DatasetRefs (DatasetType + DataId) we need "copy" all input Datasets into output collection, only if they are not already there; DatasetRef is non-unique in registry and can appear in multiple collections, we should me sure that correct DatasetRef from correct input collection is copied DatasetRef in QuantumGraph already has unique id for Dataset (dataset_id) which uniquely identifies DatasetRef that we need to copy current implementation of a collection in SQL registry is a table with records (collection_name, dataset_id) a collection by convention can only contain unique DatasetRefs, but schema does not have corresponding constraints, so this should be verified in client code So for each input DatasetRef we need to add its dataset_id to output collection but we have to make sure that corresponding DatasetType+DataId is not in the output collection already (with different dataset_id ). Registry currently has method to check for individual DatasetRefs in a collection but this may not scale if we want to check many DatasetRefs for large QuantumGraphs (even forgetting now about concurrency issues, which we should not). Additionally it would be to check that output DatasetRefs do not appear in output collection, though this can be a moot point because in concurrently-updated registry there is no guarantee that something checked right now will stay true for very long.
            Hide
            salnikov Andy Salnikov added a comment -

            Adding of DatasetRefs to collection is implemented in the SqlRegistry's method associate() which in current implementation just unconditionally adds all dataset_ids to a DatasetCollection table (caller is supposed to verify that new dataset_ids have no conflicts with the existing ones). This method is also used by registry's addDataset() method. addDataset() is implemented as:

            • check that DataId does not exist yet in a collection by calling find(collection, datasetType, dataId)
            • add new Dataset to Dataset table and get its dataset_id
            • call associate(collection, [DatasetRef(datasetType, dataId, dataset_id)])

            There is a concurrency issue with this too, if there is more than one client executing above code both of them may succeed.

            What we really need here is a unique constraint involving more than one relation, and this is not something that can be done in a portable way (some databases may support something like constraint on views or triggers with checks). The reasonably portable way is to try to implement as a part of associate() method though it needs a lot of care to handle concurrency. The addDataset() can be re-implemented then to move checks from find() to associate (still need to insert into Dataset but that will be rolled back if check in associate() fails).

            Show
            salnikov Andy Salnikov added a comment - Adding of DatasetRefs to collection is implemented in the SqlRegistry's method associate()  which in current implementation just unconditionally adds all dataset_ids to a DatasetCollection table (caller is supposed to verify that new dataset_ids have no conflicts with the existing ones). This method is also used by registry's addDataset() method. addDataset() is implemented as: check that DataId does not exist yet in a collection by calling find(collection, datasetType, dataId) add new Dataset to Dataset table and get its dataset_id call associate(collection, [DatasetRef(datasetType, dataId, dataset_id)] ) There is a concurrency issue with this too, if there is more than one client executing above code both of them may succeed. What we really need here is a unique constraint involving more than one relation, and this is not something that can be done in a portable way (some databases may support something like constraint on views or triggers with checks). The reasonably portable way is to try to implement as a part of associate() method though it needs a lot of care to handle concurrency. The addDataset() can be re-implemented then to move checks from find() to associate (still need to insert into Dataset but that will be rolled back if check in associate() fails).
            Hide
            salnikov Andy Salnikov added a comment -

            Thinking about it more proves that this is a tough problem, can probably be solved only with locks. Trouble is that row-level locking in different implementations works differently (I think what we need is a "gap lock" which exists in InnoDB but not other databases?) and table-level locking is not portable. Dunno what to do.

            Show
            salnikov Andy Salnikov added a comment - Thinking about it more proves that this is a tough problem, can probably be solved only with locks. Trouble is that row-level locking in different implementations works differently (I think what we need is a "gap lock" which exists in InnoDB but not other databases?) and table-level locking is not portable. Dunno what to do.
            Hide
            ktl Kian-Tat Lim added a comment -

            I haven't looked at this closely, but it sounds like you need a canonicalize method to generate a consistent dataset_id for all possible DatasetRefs that refer to the same thing. Then a simple unique key on collection_id, canonical_dataset_id will deal with your concurrency issues.

            Show
            ktl Kian-Tat Lim added a comment - I haven't looked at this closely, but it sounds like you need a canonicalize method to generate a consistent dataset_id for all possible DatasetRefs that refer to the same thing. Then a simple unique key on collection_id, canonical_dataset_id will deal with your concurrency issues.
            Hide
            salnikov Andy Salnikov added a comment -

            Kian-Tat Lim, I was trying to imagine a way to make it work in current schema. I agree that extending schema with "canonical id" would work, but there are non-trivial issues with that:

            • we need to upgrade registry schema (though this is rather trivial at this point in time)
            • building canonical ID in completely backward- an forward-compatible way is problematic and it does not fit very well in the schema that we have now
            • I think it will be either rather inefficient space-wise or will need to rely on some database locking mechanism (which I want to avoid as potentially non-portable).

            One implementation of canonical id that I can imagine is just a string representation of a DatasetRef (DatasetType and DataId part of it, e.g. Patch(patch=42,skymap=MySkyMap,tract=100)).  This should be done very carefully to avoid ambiguities and keep it compatible w.r.t. potential schema changes (which is hard when you cannot predict the future). 

            Still, I agree with one thing - we need table-level constraint check for this, otherwise things will get very ugly. I think implementing that kind of thing is beyond the scope of this ticket, what I want to do here is to make some trivial check that works in a single-user environment, basically more or less the same thing that we have today in addDataset() but try to make it in a more efficient way.

             

            Show
            salnikov Andy Salnikov added a comment - Kian-Tat Lim , I was trying to imagine a way to make it work in current schema. I agree that extending schema with "canonical id" would work, but there are non-trivial issues with that: we need to upgrade registry schema (though this is rather trivial at this point in time) building canonical ID in completely backward- an forward-compatible way is problematic and it does not fit very well in the schema that we have now I think it will be either rather inefficient space-wise or will need to rely on some database locking mechanism (which I want to avoid as potentially non-portable). One implementation of canonical id that I can imagine is just a string representation of a DatasetRef (DatasetType and DataId part of it, e.g. Patch( patch=42, skymap=MySkyMap,tract=100) ).  This should be done very carefully to avoid ambiguities and keep it compatible w.r.t. potential schema changes (which is hard when you cannot predict the future).  Still, I agree with one thing - we need table-level constraint check for this, otherwise things will get very ugly. I think implementing that kind of thing is beyond the scope of this ticket, what I want to do here is to make some trivial check that works in a single-user environment, basically more or less the same thing that we have today in addDataset() but try to make it in a more efficient way.  
            Hide
            jbosch Jim Bosch added a comment -

            re canonical IDs and collection integrity: this is something Pim Schellart [X] and I thought about and always planned on doing eventually but haven't yet faced up to trying to design.  This might be related to the problem of generating deterministic integer IDs for certain combinations of DataUnits (the successor to Gen2 pseudo-datasets like "ccdExposureId").  If we can extend that functionality to all combinations of DataUnits (which would probably require using something larger than an int64 in at least some cases), it would take a big step towards solving this problem.

            If the PipelineTask activator needs a solution to this for concurrency sooner, we could add a boolean option to various Registry methods that callers can use to disable collection integrity checks (and by doing so promise that they'll maintain collection integrity themselves).

            Show
            jbosch Jim Bosch added a comment - re canonical IDs and collection integrity: this is something Pim Schellart [X] and I thought about and always planned on doing eventually but haven't yet faced up to trying to design.  This might be related to the problem of generating deterministic integer IDs for certain combinations of DataUnits (the successor to Gen2 pseudo-datasets like "ccdExposureId").  If we can extend that functionality to all combinations of DataUnits (which would probably require using something larger than an int64 in at least some cases), it would take a big step towards solving this problem. If the PipelineTask activator needs a solution to this for concurrency sooner, we could add a boolean option to various Registry methods that callers can use to disable collection integrity checks (and by doing so promise that they'll maintain collection integrity themselves).
            vaikunth Vaikunth Thukral made changes -
            Sprint BG3_F18_09 [ 779 ] BG3_F18_09, BG3_F18_10 [ 779, 797 ]
            Hide
            salnikov Andy Salnikov added a comment -

            Having a deterministic integer DataId would certainly help with implementing constraints on DatasetCollection, though I'm not sure how that integer thing can be implemented in practice for arbitrary DataIds. OTOH constraints on DatasetCollection could be implemented using different (e.g. database-specific) mechanism which is not required to be deterministic, only unique. This would likely need extending registry schema with new table(s) in addition to new column(s) in DatasetCollection (which also brings a question of how do we manage schema versioning).

            I don't particularly like the idea of disabling integrity checks (even optionally). For PipelineTask I think I will implement something that would work for single-user case, i.e. it would not be broken more than it is today.

             

            Show
            salnikov Andy Salnikov added a comment - Having a deterministic integer DataId would certainly help with implementing constraints on DatasetCollection, though I'm not sure how that integer thing can be implemented in practice for arbitrary DataIds. OTOH constraints on DatasetCollection could be implemented using different (e.g. database-specific) mechanism which is not required to be deterministic, only unique. This would likely need extending registry schema with new table(s) in addition to new column(s) in DatasetCollection (which also brings a question of how do we manage schema versioning). I don't particularly like the idea of disabling integrity checks (even optionally). For PipelineTask I think I will implement something that would work for single-user case, i.e. it would not be broken more than it is today.  
            salnikov Andy Salnikov made changes -
            Story Points 10 8
            salnikov Andy Salnikov made changes -
            Link This issue is triggering DM-16077 [ DM-16077 ]
            Hide
            salnikov Andy Salnikov added a comment -

            OK, I think this is ready for review. We'll need better implementation for constraints checking in registry but that would need small schema redesign. The "activator" mostly works now (-j option is broken though, I'll work on that on other ticket), example tasks were tested with ci_hsc butler repo and work fine.

            Show
            salnikov Andy Salnikov added a comment - OK, I think this is ready for review. We'll need better implementation for constraints checking in registry but that would need small schema redesign. The "activator" mostly works now (-j option is broken though, I'll work on that on other ticket), example tasks were tested with ci_hsc butler repo and work fine.
            salnikov Andy Salnikov made changes -
            Reviewers Jim Bosch [ jbosch ]
            Status In Progress [ 3 ] In Review [ 10004 ]
            jbosch Jim Bosch made changes -
            Status In Review [ 10004 ] Reviewed [ 10101 ]
            Hide
            salnikov Andy Salnikov added a comment -

            Thanks for review! Fixed my mistake and merged, done.

            Show
            salnikov Andy Salnikov added a comment - Thanks for review! Fixed my mistake and merged, done.
            salnikov Andy Salnikov made changes -
            Resolution Done [ 10000 ]
            Status Reviewed [ 10101 ] Done [ 10002 ]
            jbosch Jim Bosch made changes -
            Link This issue is triggering DM-16159 [ DM-16159 ]
            jbosch Jim Bosch made changes -
            Link This issue relates to DM-16221 [ DM-16221 ]

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Reviewers:
              Jim Bosch
              Watchers:
              Andy Salnikov, Jim Bosch, Kian-Tat Lim, Vaikunth Thukral
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.