Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-169

Implement HSC-like stack provenance

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Implemented
    • Resolution: Done
    • Component/s: DM
    • Labels:
      None
    • Location:
      Here

      Description

      We need a means of recording the version of the stack used in processing data, and of verifying that a consistent version is being used. HSC has some code used to do this in CmdLineTask in the same kind of way that we currently do with configurations, and this provenance recording is an important part of our production runs.

      Following a discussion on CLO, it seems the best way forward is to port the HSC code and adapting it to pull the versions from the code itself rather than EUPS. The intention is that this code will eventually move to the orchestration or execution packages, but this plan provides a near-term solution and the foundation for the longer-term solution. Barring objections, I propose to implement this plan in DM-3372.

      For details on the HSC implementation, see the discussion on CLO, and the HSC code in pipe_base and daf_persistence (though that code in daf_persistence should probably live next to the implementation in pipe_base; but let's not worry about such details here).

        Attachments

          Issue Links

            Activity

            Hide
            spietrowicz Steve Pietrowicz added a comment -

            Sorry I didn't participate in the discussion on CLO... been buried in meetings.

            I read through the CLO posting and I'm still not clear on what exactly is going to be implemented. Is this entirely in the CmdLineTask code? How do you envision this would be used by orchestration, if at all?

            Is this recorded before an execution occurs?

            I'm concerned about introspecting python modules for their versions because 1) unless versioning is automated people will forget to bump the version number, 2) this doesn't take into account changes in underlying C++ code, 3) not all underlying libraries we depend on even have a Python component (apr), 4) this does not take into account information in the overall environment where the software is running (os, machine, etc).

            In the old style ctrl_provenance, we recorded all packages that were set up in the environment, with all tagged version numbers. This was all captured as part of orchestration. We found this particularly valuable.

            In at least one case I can remember, we encountered a problem where one package was interacting with another completely unrelated package in an unexpected way. Someone had an object in a package it should have been in, and it was a duplicate of another object. If we hadn't captured all the environment information, it would have been a lot harder to track down.

            For orchestrated production runs, there is additional support software that aren't directly related to the CmdLineTask code but will still have provenance recorded.

            I'd like to handle this in a uniform way across all software in the stack, and not have one solution for provenance for CmdLineTask and another for orchestration.

            Show
            spietrowicz Steve Pietrowicz added a comment - Sorry I didn't participate in the discussion on CLO... been buried in meetings. I read through the CLO posting and I'm still not clear on what exactly is going to be implemented. Is this entirely in the CmdLineTask code? How do you envision this would be used by orchestration, if at all? Is this recorded before an execution occurs? I'm concerned about introspecting python modules for their versions because 1) unless versioning is automated people will forget to bump the version number, 2) this doesn't take into account changes in underlying C++ code, 3) not all underlying libraries we depend on even have a Python component (apr), 4) this does not take into account information in the overall environment where the software is running (os, machine, etc). In the old style ctrl_provenance, we recorded all packages that were set up in the environment, with all tagged version numbers. This was all captured as part of orchestration. We found this particularly valuable. In at least one case I can remember, we encountered a problem where one package was interacting with another completely unrelated package in an unexpected way. Someone had an object in a package it should have been in, and it was a duplicate of another object. If we hadn't captured all the environment information, it would have been a lot harder to track down. For orchestrated production runs, there is additional support software that aren't directly related to the CmdLineTask code but will still have provenance recorded. I'd like to handle this in a uniform way across all software in the stack, and not have one solution for provenance for CmdLineTask and another for orchestration.
            Hide
            price Paul Price added a comment -

            Steve Pietrowicz wrote:

            I read through the CLO posting and I'm still not clear on what exactly is going to be implemented. Is this entirely in the CmdLineTask code?

            I propose putting together a module that will be run by the CmdLineTask code. This module will determine the versions of the various components of the stack in multiple ways:

            • For our python packages, using the version.py files we write as part of building (this gets set from git);
            • For external python packages, using __version__; and
            • For select C++ modules, using whatever mechanism they supply.

            If the butler is aware of a persisted set of versions, we will retrieve that and compare, raising an exception if they differ (this behaviour may be disabled by the user); otherwise the set of versions will be persisted by the butler for future comparisons.

            How do you envision this would be used by orchestration, if at all?

            I hope that the proposed module might be useful in orchestration when the time comes to pull the implementation into orchestration, either by providing code that can be utilised directly, or through providing examples of what worked or didn't work.

            Is this recorded before an execution occurs?

            It would run after we start python and import a bunch of stuff, but before we call CmdLineTask.run.

            (The astute reader will notice that this is therefore subject to a race condition, just as when we validate the configuration. There's not much we can do about that, but the workaround is the same as for the configuration validation — first run a CmdLineTask with no input data, e.g., here.)

            I'm concerned about introspecting python modules for their versions because 1) unless versioning is automated people will forget to bump the version number, 2) this doesn't take into account changes in underlying C++ code, 3) not all underlying libraries we depend on even have a Python component (apr), 4) this does not take into account information in the overall environment where the software is running (os, machine, etc).

            I admit that we cannot capture the complete state of the entire system. Nevertheless, we can capture a lot more of the state of the system than we are currently, and I think we can capture enough information to be useful much of the time, at least useful enough to keep a user from polluting a production run with results from different versions of the stack. Upon implementation of this proposal we would have a module that captures at least some of the state of the system, and this could be expanded as it is deemed useful and effort becomes available.

            For orchestrated production runs, there is additional support software that aren't directly related to the CmdLineTask code but will still have provenance recorded.
            I'd like to handle this in a uniform way across all software in the stack, and not have one solution for provenance for CmdLineTask and another for orchestration.

            I would love to re-use existing code. Could you please point me to what you're using in orchestration?

            Show
            price Paul Price added a comment - Steve Pietrowicz wrote: I read through the CLO posting and I'm still not clear on what exactly is going to be implemented. Is this entirely in the CmdLineTask code? I propose putting together a module that will be run by the CmdLineTask code. This module will determine the versions of the various components of the stack in multiple ways: For our python packages, using the version.py files we write as part of building (this gets set from git); For external python packages, using __version__ ; and For select C++ modules, using whatever mechanism they supply. If the butler is aware of a persisted set of versions, we will retrieve that and compare, raising an exception if they differ (this behaviour may be disabled by the user); otherwise the set of versions will be persisted by the butler for future comparisons. How do you envision this would be used by orchestration, if at all? I hope that the proposed module might be useful in orchestration when the time comes to pull the implementation into orchestration, either by providing code that can be utilised directly, or through providing examples of what worked or didn't work. Is this recorded before an execution occurs? It would run after we start python and import a bunch of stuff, but before we call CmdLineTask.run . (The astute reader will notice that this is therefore subject to a race condition, just as when we validate the configuration. There's not much we can do about that, but the workaround is the same as for the configuration validation — first run a CmdLineTask with no input data, e.g., here .) I'm concerned about introspecting python modules for their versions because 1) unless versioning is automated people will forget to bump the version number, 2) this doesn't take into account changes in underlying C++ code, 3) not all underlying libraries we depend on even have a Python component (apr), 4) this does not take into account information in the overall environment where the software is running (os, machine, etc). I admit that we cannot capture the complete state of the entire system. Nevertheless, we can capture a lot more of the state of the system than we are currently, and I think we can capture enough information to be useful much of the time, at least useful enough to keep a user from polluting a production run with results from different versions of the stack. Upon implementation of this proposal we would have a module that captures at least some of the state of the system, and this could be expanded as it is deemed useful and effort becomes available. For orchestrated production runs, there is additional support software that aren't directly related to the CmdLineTask code but will still have provenance recorded. I'd like to handle this in a uniform way across all software in the stack, and not have one solution for provenance for CmdLineTask and another for orchestration. I would love to re-use existing code. Could you please point me to what you're using in orchestration?
            Hide
            ktl Kian-Tat Lim added a comment -

            The problem is that the existing code relies on using eups. I don't think this is useful in a CmdLineTask context, only in a wrapping workflow context, because we don't want to have dependencies on eups at that level. It also becomes less useful the less we use eups to manage dependencies (such as "system dependencies" or "*conda dependencies").

            Unfortunately, there seem to be no simple solutions to this problem. It looks like we either accept multiple provenance extraction and recording systems or we complicate our lives in other ways.

            I'm also a little concerned about adding yet another "write-once-compare-same" output that implies a globally-shared output and can potentially impact startup performance, but I think my worry is probably premature optimization, and there doesn't appear to be a good way around it, either.

            So while I would still prefer to delay a bit until we can see better if and how production workflow, development/testing runs, and single task execution can all work in a more integrated way, I do not object to moving forward with this code as long as its limitations are kept clearly in mind.

            Show
            ktl Kian-Tat Lim added a comment - The problem is that the existing code relies on using eups . I don't think this is useful in a CmdLineTask context, only in a wrapping workflow context, because we don't want to have dependencies on eups at that level. It also becomes less useful the less we use eups to manage dependencies (such as "system dependencies" or "*conda dependencies"). Unfortunately, there seem to be no simple solutions to this problem. It looks like we either accept multiple provenance extraction and recording systems or we complicate our lives in other ways. I'm also a little concerned about adding yet another "write-once-compare-same" output that implies a globally-shared output and can potentially impact startup performance, but I think my worry is probably premature optimization, and there doesn't appear to be a good way around it, either. So while I would still prefer to delay a bit until we can see better if and how production workflow, development/testing runs, and single task execution can all work in a more integrated way, I do not object to moving forward with this code as long as its limitations are kept clearly in mind.
            Hide
            spietrowicz Steve Pietrowicz added a comment -

            I would love to re-use existing code. Could you please point me to what you're using in orchestration?

            Wish it would be of help, but it relied completely on ctrl_provenance, which has been made obsolete when the switch from policy files occurred (so it's been a while). When that happened, the code in orchestration that used ctrl_provenance was pulled, with the intent that it would go back in whenever the new provenance code was ready. Jacek was working on something a few months ago, but that probably had to be set aside because of recent changes in responsibilities.

            Show
            spietrowicz Steve Pietrowicz added a comment - I would love to re-use existing code. Could you please point me to what you're using in orchestration? Wish it would be of help, but it relied completely on ctrl_provenance, which has been made obsolete when the switch from policy files occurred (so it's been a while). When that happened, the code in orchestration that used ctrl_provenance was pulled, with the intent that it would go back in whenever the new provenance code was ready. Jacek was working on something a few months ago, but that probably had to be set aside because of recent changes in responsibilities.
            Hide
            price Paul Price added a comment -

            No objections, and explicit approval from the one person I was worried about — approved for implementation.

            Show
            price Paul Price added a comment - No objections, and explicit approval from the one person I was worried about — approved for implementation.

              People

              Assignee:
              price Paul Price
              Reporter:
              price Paul Price
              Watchers:
              John Parejko, John Swinbank, Kian-Tat Lim, Paul Price, Steve Pietrowicz
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Planned End:

                  Jenkins

                  No builds found.