Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-68

Backport HSC parallelization framework

    Details

    • Type: RFC
    • Status: Implemented
    • Resolution: Done
    • Component/s: DM
    • Labels:
      None
    • Location:
      this issue page

      Description

      As part of porting pipeline code from the HSC fork of the stack, we'll need more parallelization features than are currently available in the LSST codebase. Our requirements and the HSC solutions are described here:

      https://confluence.lsstcorp.org/display/DM/S15+Short-Term+Parallelization+Middleware+Requirements

      I propose that we simply bring the HSC framework over with very little modification - essentially nomenclature and code-cleanup changes only, into a new LSST package. This new package would depend on mpi4py, which would be a new third-party package included in the stack. I propose we call the new package ctrl_pool, but I'd be very happy to hear other suggestions.

      New parallel driver scripts that rely on ctrl_pool would not go in pipe_tasks; we'd add a new package for these as well (pipe_drivers?). For the most part, these would delegate their work to conventional CmdLineTasks in pipe_tasks that could be run manually on smaller scales, but this may not be entirely possible for all pipelines.

      We do not anticipate this being the long-term solution for our parallel execution framework, but we believe the concepts are general enough and the interface abstract enough that it should be fairly easy to modify pipeline code to adapt to a new framework in the future.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            So far, this RFC is all crickets, and I'd like to get started on the implementation. But it's a pretty broad change, and while we've discussed it vaguely a lot, we've haven't discussed the details at all. Kian-Tat Lim, should I go ahead and accept this, and assume we'll discuss changes after the HSC prototype is on master, or try to find some RFD time for it before Bremerton?

            Show
            jbosch Jim Bosch added a comment - So far, this RFC is all crickets, and I'd like to get started on the implementation. But it's a pretty broad change, and while we've discussed it vaguely a lot, we've haven't discussed the details at all. Kian-Tat Lim , should I go ahead and accept this, and assume we'll discuss changes after the HSC prototype is on master, or try to find some RFD time for it before Bremerton?
            Hide
            ktl Kian-Tat Lim added a comment -

            Since this looks like it's a parallel universe (pun somewhat intended), I had no particular objection to the MPI, as long as we don't start building a lot of new pipelines using the functionality without thinking more carefully about it.

            As to the consistent batch system interface, this is one thing that ctrl_execute was supposed to provide. Have you looked at that at all?

            Show
            ktl Kian-Tat Lim added a comment - Since this looks like it's a parallel universe (pun somewhat intended), I had no particular objection to the MPI, as long as we don't start building a lot of new pipelines using the functionality without thinking more carefully about it. As to the consistent batch system interface, this is one thing that ctrl_execute was supposed to provide. Have you looked at that at all?
            Hide
            jbosch Jim Bosch added a comment -

            I'll have a look at ctrl_execute before I do the merge, but I don't anticipate making any changes to the HSC-side stuff to use stuff in ctrl_execute unless it turns out to be really easy.

            Show
            jbosch Jim Bosch added a comment - I'll have a look at ctrl_execute before I do the merge, but I don't anticipate making any changes to the HSC-side stuff to use stuff in ctrl_execute unless it turns out to be really easy.
            Hide
            ktl Kian-Tat Lim added a comment -

            Unless the HSC-side stuff is all integrated together, my suggestion would be to look at using ctrl_execute in place of the HSC batch submission, rather than retargeting existing code.

            Show
            ktl Kian-Tat Lim added a comment - Unless the HSC-side stuff is all integrated together, my suggestion would be to look at using ctrl_execute in place of the HSC batch submission, rather than retargeting existing code.
            Hide
            jbosch Jim Bosch added a comment -

            I believe it's pretty integrated, but I'll take a look and report back here before I do any work in either direction.

            Show
            jbosch Jim Bosch added a comment - I believe it's pretty integrated, but I'll take a look and report back here before I do any work in either direction.
            Hide
            jbosch Jim Bosch added a comment - - edited

            I've added a section on comparison with ctrl_execute to the Confluence page linked above. Summary is:

            • I'd have to do some significant refactoring of ctrl_execute to make use of anything there or the existing configuration in ctrl_platform_, and that's something I'm not at all comfortable doing. It's just too focused on condor glide-in.
            • I think it makes sense to add completely distinct configuration to ctrl_platform_ to support the new batch submission stuff from the HSC side, and use that to set defaults for what are otherwise a lot of user-supplied platform-specific command-line argument. Unfortunately, all of the ctrl_platform_* packages depend on ctrl_execute, which brings in a lot of other packages, and if we could avoid those dependencies when using the HSC-side tools, that'd be ideal. In fact, I suspect we'd probably prefer to continue supply the configuration ourselves on the command-line rather than bring in those dependencies when we use this for HSC processing needs.
            • I'm worried that we'll have no way of running at NCSA using the HSC tools if we follow this approach (since the HSC tools don't have any support for condor). But I'm also not sure if condor is what we'd want to target if we wanted to run on the OpenStack machines (rather than the systems I think we're planning to decommission soon).
            Show
            jbosch Jim Bosch added a comment - - edited I've added a section on comparison with ctrl_execute to the Confluence page linked above. Summary is: I'd have to do some significant refactoring of ctrl_execute to make use of anything there or the existing configuration in ctrl_platform_ , and that's something I'm not at all comfortable doing. It's just too focused on condor glide-in. I think it makes sense to add completely distinct configuration to ctrl_platform_ to support the new batch submission stuff from the HSC side, and use that to set defaults for what are otherwise a lot of user-supplied platform-specific command-line argument. Unfortunately, all of the ctrl_platform_* packages depend on ctrl_execute , which brings in a lot of other packages, and if we could avoid those dependencies when using the HSC-side tools, that'd be ideal. In fact, I suspect we'd probably prefer to continue supply the configuration ourselves on the command-line rather than bring in those dependencies when we use this for HSC processing needs. I'm worried that we'll have no way of running at NCSA using the HSC tools if we follow this approach (since the HSC tools don't have any support for condor). But I'm also not sure if condor is what we'd want to target if we wanted to run on the OpenStack machines (rather than the systems I think we're planning to decommission soon).
            Hide
            jbosch Jim Bosch added a comment -

            I'm accepting this with the resolution that we'll be porting over the HSC framework with only superficial changes for now, and the understanding that the long-term solution may work very differently from this (e.g. we won't have "super tasks" at all).

            For now, because we'd need to use Condor instead of PBS or Slurm to make use of these features on LSST machines at NCSA, we'll be limited to either working within a single node or running on other systems (e.g. those at Princeton). This port will still bring more functionality than we currently have for single-node parallelization, however, and it's not worth the effort to try to remove the code for the other batch systems (especially as it's probably best to just add Condor in the medium-term, unless a better long-term solution is coming soon).

            Show
            jbosch Jim Bosch added a comment - I'm accepting this with the resolution that we'll be porting over the HSC framework with only superficial changes for now, and the understanding that the long-term solution may work very differently from this (e.g. we won't have "super tasks" at all). For now, because we'd need to use Condor instead of PBS or Slurm to make use of these features on LSST machines at NCSA, we'll be limited to either working within a single node or running on other systems (e.g. those at Princeton). This port will still bring more functionality than we currently have for single-node parallelization, however, and it's not worth the effort to try to remove the code for the other batch systems (especially as it's probably best to just add Condor in the medium-term, unless a better long-term solution is coming soon).
            Hide
            tjenness Tim Jenness added a comment -

            Jim Bosch can you please mark this RFC as implemented if there is no further work required?

            Show
            tjenness Tim Jenness added a comment - Jim Bosch can you please mark this RFC as implemented if there is no further work required?

              People

              • Assignee:
                jbosch Jim Bosch
                Reporter:
                jbosch Jim Bosch
                Watchers:
                Jim Bosch, John Swinbank, Kian-Tat Lim, Paul Price, Tim Jenness
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Planned End:

                  Summary Panel