Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-15686

Re-implement task execution in laptop activator

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Changes in gen3 butler need few updates for execution framework. I think that piece responsible for running the task should still work (but it has not been tested recently), but post-pre-flight part needs to do some additional work:

      •  make sure that output collection exists in registry
      • copy/associate (if needed) all input datasets into output collection, this probably needs to be done recursively
      • provenance information (Quanta) needs to be saved in registry

      I probably need more realistic examples of PipelineTask to properly test it.

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment -

            Thanks for review! Fixed my mistake and merged, done.

            Show
            salnikov Andy Salnikov added a comment - Thanks for review! Fixed my mistake and merged, done.
            Hide
            salnikov Andy Salnikov added a comment -

            OK, I think this is ready for review. We'll need better implementation for constraints checking in registry but that would need small schema redesign. The "activator" mostly works now (-j option is broken though, I'll work on that on other ticket), example tasks were tested with ci_hsc butler repo and work fine.

            Show
            salnikov Andy Salnikov added a comment - OK, I think this is ready for review. We'll need better implementation for constraints checking in registry but that would need small schema redesign. The "activator" mostly works now (-j option is broken though, I'll work on that on other ticket), example tasks were tested with ci_hsc butler repo and work fine.
            Hide
            salnikov Andy Salnikov added a comment -

            Having a deterministic integer DataId would certainly help with implementing constraints on DatasetCollection, though I'm not sure how that integer thing can be implemented in practice for arbitrary DataIds. OTOH constraints on DatasetCollection could be implemented using different (e.g. database-specific) mechanism which is not required to be deterministic, only unique. This would likely need extending registry schema with new table(s) in addition to new column(s) in DatasetCollection (which also brings a question of how do we manage schema versioning).

            I don't particularly like the idea of disabling integrity checks (even optionally). For PipelineTask I think I will implement something that would work for single-user case, i.e. it would not be broken more than it is today.

             

            Show
            salnikov Andy Salnikov added a comment - Having a deterministic integer DataId would certainly help with implementing constraints on DatasetCollection, though I'm not sure how that integer thing can be implemented in practice for arbitrary DataIds. OTOH constraints on DatasetCollection could be implemented using different (e.g. database-specific) mechanism which is not required to be deterministic, only unique. This would likely need extending registry schema with new table(s) in addition to new column(s) in DatasetCollection (which also brings a question of how do we manage schema versioning). I don't particularly like the idea of disabling integrity checks (even optionally). For PipelineTask I think I will implement something that would work for single-user case, i.e. it would not be broken more than it is today.  
            Hide
            jbosch Jim Bosch added a comment -

            re canonical IDs and collection integrity: this is something Pim Schellart [X] and I thought about and always planned on doing eventually but haven't yet faced up to trying to design.  This might be related to the problem of generating deterministic integer IDs for certain combinations of DataUnits (the successor to Gen2 pseudo-datasets like "ccdExposureId").  If we can extend that functionality to all combinations of DataUnits (which would probably require using something larger than an int64 in at least some cases), it would take a big step towards solving this problem.

            If the PipelineTask activator needs a solution to this for concurrency sooner, we could add a boolean option to various Registry methods that callers can use to disable collection integrity checks (and by doing so promise that they'll maintain collection integrity themselves).

            Show
            jbosch Jim Bosch added a comment - re canonical IDs and collection integrity: this is something Pim Schellart [X] and I thought about and always planned on doing eventually but haven't yet faced up to trying to design.  This might be related to the problem of generating deterministic integer IDs for certain combinations of DataUnits (the successor to Gen2 pseudo-datasets like "ccdExposureId").  If we can extend that functionality to all combinations of DataUnits (which would probably require using something larger than an int64 in at least some cases), it would take a big step towards solving this problem. If the PipelineTask activator needs a solution to this for concurrency sooner, we could add a boolean option to various Registry methods that callers can use to disable collection integrity checks (and by doing so promise that they'll maintain collection integrity themselves).
            Hide
            salnikov Andy Salnikov added a comment -

            Kian-Tat Lim, I was trying to imagine a way to make it work in current schema. I agree that extending schema with "canonical id" would work, but there are non-trivial issues with that:

            • we need to upgrade registry schema (though this is rather trivial at this point in time)
            • building canonical ID in completely backward- an forward-compatible way is problematic and it does not fit very well in the schema that we have now
            • I think it will be either rather inefficient space-wise or will need to rely on some database locking mechanism (which I want to avoid as potentially non-portable).

            One implementation of canonical id that I can imagine is just a string representation of a DatasetRef (DatasetType and DataId part of it, e.g. Patch(patch=42,skymap=MySkyMap,tract=100)).  This should be done very carefully to avoid ambiguities and keep it compatible w.r.t. potential schema changes (which is hard when you cannot predict the future). 

            Still, I agree with one thing - we need table-level constraint check for this, otherwise things will get very ugly. I think implementing that kind of thing is beyond the scope of this ticket, what I want to do here is to make some trivial check that works in a single-user environment, basically more or less the same thing that we have today in addDataset() but try to make it in a more efficient way.

             

            Show
            salnikov Andy Salnikov added a comment - Kian-Tat Lim , I was trying to imagine a way to make it work in current schema. I agree that extending schema with "canonical id" would work, but there are non-trivial issues with that: we need to upgrade registry schema (though this is rather trivial at this point in time) building canonical ID in completely backward- an forward-compatible way is problematic and it does not fit very well in the schema that we have now I think it will be either rather inefficient space-wise or will need to rely on some database locking mechanism (which I want to avoid as potentially non-portable). One implementation of canonical id that I can imagine is just a string representation of a DatasetRef (DatasetType and DataId part of it, e.g. Patch( patch=42, skymap=MySkyMap,tract=100) ).  This should be done very carefully to avoid ambiguities and keep it compatible w.r.t. potential schema changes (which is hard when you cannot predict the future).  Still, I agree with one thing - we need table-level constraint check for this, otherwise things will get very ugly. I think implementing that kind of thing is beyond the scope of this ticket, what I want to do here is to make some trivial check that works in a single-user environment, basically more or less the same thing that we have today in addDataset() but try to make it in a more efficient way.  

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Reviewers:
              Jim Bosch
              Watchers:
              Andy Salnikov, Jim Bosch, Kian-Tat Lim, Vaikunth Thukral
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.