Details
-
Type:
RFC
-
Status: Withdrawn
-
Resolution: Done
-
Component/s: DM
-
Labels:None
Description
The goal of this RFC is to add a new package to `lsst_distrib` which will be a library used in the creation of ci_* packages. The proposed name for this package is `ci_builder`.
This is meant to solve the issues that arise from using `scons` to orchestrate running ci jobs, as well as some other pain points developers have run into.
Currently the plan is to use this system to implement a new `ci_imsim` package and overhaul `ci_hsc_gen3`, but it is expected to be generally useful for many other ci tools, and potentially other tools in the future.
Attachments
Issue Links
Activity
"Adding a New Package to the Build" requires escalation.
Sorry. I was thinking about DMCCB flagging when I said "no RFC required".
It is of course fine to have an RFC to discuss the new package and where it fits in.
Can you explain the motivation a bit better? I thought using scons to run CI jobs was discouraged, and the architecture of ci_hsc_gen3 was considered a mistake.
ci_hsc_gen3 barely uses scons at all (unlike ci_hsc_gen2) but should be simplified much more to use even less SCons. ci_hsc_gen2 uses scons as the workflow execution framework.
I thought we were heading to where pipelines_check is. That uses scons but just to run the shell script that does the actual pipeline execution and set up. This has the added advantage that it's obvious what is happening.
My understanding is that this does not use SCons at all; it is a replacement for our (awkward) usage of SCons to "build" integration tests, when "build" actually means "set up a repo and run them, but in a way that gives build-system like warm-starts when something goes wrong". Trying to do that with any kind of third-party software build system is awkward because the state doesn't map simply to file content (and different steps make different changes to the SQLite file, which is a real problem for failed/interrupted steps).
I can't speak to the latest version of the code itself, or whether it duplicates anything in e.g. pipelines_check, as the last time I looked at it was early in prototyping, but I can say that the goals of the project (which I gather have been met) very much address real pain points in how ci_hsc_gen2 and ci_hsc_gen3 are "built" today.
ci_hsc_gen2 is not a good example here because gen2 is relying critically on scons for managing execution of everything. In gen3 there is set up code and then there is pipetask itself doing the heavy lifting, so they aren't comparable.
I'd like to see the code that is being proposed and I'd like to understand why this is more complex than what pipelines_check does in a shell script.
Sorry I have not been able to get back to this until now, I will try to outline much more about this plan now, as well as push out the rfc for a while.
This orchestration tool is written in python and consists of two basic components, a CommandRunner, and BaseCommand.
Each package (or whatever sub-unit intended to do a workflow) has its own instance of a CommandRunner (or in fact could have more than one for different behaviors). This instance then has one or more commands (more on that below) registered to it. During the registration process each command is associated with a label which will be used to checkpoint the state of the filesystem once the command is run, and a float that will be used to determine the order the commands will be run.
When the CommandRunner executes it can be told to just run (in which the command target is the last command) or it can be told to run a specific command. In either case, the runner will examine the state of the filesystem to determine which commands have already been run, and generate a list of what commands must be run prior to the requested target command.
As each command is successfully run, the state of the filesystem is checkpointed with the label associated to the command in the registration process.
When a command executes the default behavior is to reset the file system to the state it was in for the last successfully dependent command. This means there is no need to clear and redo any previous stages. As a side note, there is a mode where you can run in a "dirty" state, if your command can make use of that, but that is more advanced than this description.
The runner accepts arguments either from the command line, or as a NameSpace. The arguments can be used to do things like reset to a specific command, investigate what has been run, etc.
Commands themselves are python classes that derive from the BaseCommand class. They must implement a run method, which is where they do whatever kind of work is required to complete their task. For instance this may be pure python code, or calls to subprocess. Commands also have the ability to add arguments to the command line parser to get anything specific from the user they may need.
The `ci_builder` package also provides several specific template commands for common things that a ci_package might do, such as constructing a butler, which specific packages can customize for their needs.
Below is an example of what ci_hsc_gen3 looks like using this system. This has been tested and runs (in fact this is how I run my copy as to not need to re-do steps during testing). The ci_builder code can be found here
#!/usr/bin/env python
|
|
import os
|
import subprocess
|
|
from lsst.ci.builder import CommandRunner, BaseCommand, BuildState, CommandError
|
from lsst.ci.builder.commands import (CreateButler, RegisterInstrument, WriteCuratedCalibrations,
|
RegisterSkyMap, IngestRaws, DefineVisits, ButlerImport,
|
TestRunner)
|
|
TESTDATA_DIR = os.environ['TESTDATA_CI_HSC_DIR']
|
INSTRUMENT_NAME = "HSC"
|
INPUTCOL = "HSC/defaults"
|
COLLECTION = "HSC/runs/ci_hsc"
|
QGRPAH_FILE = "qgraph_file.qgraph"
|
|
|
ciHscRunner = CommandRunner(os.environ["CI_HSC_GEN3_DIR"])
|
|
|
ciHscRunner.register("butler", 0)(CreateButler)
|
|
|
@ciHscRunner.register("instrument", 1)
|
class HscRegisterInstrument(RegisterInstrument):
|
instrumentName = "lsst.obs.subaru.HyperSuprimeCam"
|
|
|
@ciHscRunner.register("calibrations", 2)
|
class HscWriteCuratedCalibrations(WriteCuratedCalibrations):
|
instrumentName = INSTRUMENT_NAME
|
|
|
ciHscRunner.register("skymap", 3)(RegisterSkyMap)
|
|
|
@ciHscRunner.register("ingest", 4)
|
class HscIngestRaws(IngestRaws):
|
rawLocation = os.path.join(TESTDATA_DIR, "raw")
|
|
|
@ciHscRunner.register("defineVisits", 5)
|
class HscDefineVisits(DefineVisits):
|
instrumentName = INSTRUMENT_NAME
|
collectionsName = "HSC/raw/all"
|
|
|
@ciHscRunner.register("importExternalBase", 6)
|
class HscBaseButlerImport(ButlerImport):
|
dataLocation = TESTDATA_DIR
|
|
@property
|
def importFileLocation(self) -> str:
|
return os.path.join(self.runner.pkgRoot, "resources", "external.yaml")
|
|
|
@ciHscRunner.register("importExternalJointcal", 7)
|
class HscJointcalButlerImport(ButlerImport):
|
dataLocation = TESTDATA_DIR
|
|
@property
|
def importFileLocation(self) -> str:
|
return os.path.join(self.runner.pkgRoot, "resources", "external_jointcal.yaml")
|
|
|
@ciHscRunner.register("qgraph", 8)
|
class QgraphCommand(BaseCommand):
|
def run(self, currentState: BuildState):
|
args = ("qgraph",
|
"-d", '''skymap='discrete/ci_hsc' AND tract=0 AND patch=69''',
|
"-b", self.runner.RunDir,
|
"--input", INPUTCOL,
|
"--output", COLLECTION,
|
"-p", "$CI_HSC_GEN3_DIR/pipelines/DRP.yaml",
|
"--save-qgraph", os.path.join(self.runner.RunDir, QGRPAH_FILE))
|
pipetask = self.runner.getExecutableCmd("CTRL_MPEXEC_DIR", "pipetask", args)
|
result = subprocess.run(pipetask)
|
if result.returncode != 0:
|
raise CommandError("Issue Running QuntumGraph")
|
|
|
@ciHscRunner.register("processing", 9)
|
class ProcessingCommand(BaseCommand):
|
def run(self, currentState: BuildState):
|
args = ("run",
|
"-j", str(self.arguments.num_cores),
|
"-b", self.runner.RunDir,
|
"--input", INPUTCOL,
|
"--output", COLLECTION,
|
"--register-dataset-types",
|
"--qgraph", os.path.join(self.runner.RunDir, QGRPAH_FILE))
|
pipetask = self.runner.getExecutableCmd("CTRL_MPEXEC_DIR", "pipetask", args)
|
result = subprocess.run(pipetask)
|
if result.returncode != 0:
|
raise CommandError("Issue Running Pipeline")
|
|
|
ciHscRunner.register("tests", 10)(TestRunner)
|
|
ciHscRunner.run()
|
|
Some of this is reminiscent of Parsl (e.g. https://parsl.readthedocs.io/en/stable/userguide/checkpoints.html#app-caching) or other workflow systems. What are the key factors that require us to build our own?
Basically it is tailored exactly to our needs, so there is "less to it". Everything we are doing is also technically possible with scons too (though certainly harder than say Parsl). Using any other system would likely still involve writing wrappers or customization classes to stream line use, which would end up at the total scale of code proposed.
I do see your point, and it is always a double edged sword between leveraging upstream work and adapting it, or doing everything in house, but you do it all. In this case by having something tailored, it is just a small bit of code focused on our needs, with a low learning / maintenance overhead.
However this also does not lock us in. As I mentioned if we did decide we needed the functionality of Parsl, or another system, we would likely write some helper code anyway, and would be able to leverage that under the hood later, without loosing the work any developers put in on top of this code.
This system has already been useful in both ci_hsc_gen3 as well as a new ci_imsim package, so it seems to fit our needs well.
Nate Lust I'd still like you to explain why we need the complexity at all rather than the shell script approach of pipelines_check. Most of the complexity is in the "make the graph and execute the graph" isn't it? Which is already external. Does your system automatically support rerunning a partially executed graph from that point? Why can't that be a part of pipetask directly?
1) Re-usability, I find it easier to share functionality and remix it using something like python
2) Edit-ability, It is my opinion, and is shared by some others that I talked with, that most people are much more comfortable making changes (especially in something more than a basic script) and maintaining something that is in python vs a shell script
3) Abstraction around maintaining filesystem checkpoints resetting and such, is better in python than shell
This system supports arbitrary checkpoints between commands. This is useful when working on say the pipelines bit of something, and don't want the "setting up the buttler" stuff yelling at you that things already exist for instance. We of course could handle graph failure/rerun things within pipetasks, but we can also split it up into different commands if you want basically instant results without graph queries to run and determine existence etc.
Nate Lust what do you want to do with this RFC? I think if you tweak the phrasing a bit you can adopt it since ci_imsim is already released and using it and it does not affect lsst_ci or lsst_distrib.
While I still think this would be a useful addition to the stack in many ways, There is no current need to have it included, and so I don't want to spend the energy advocating for it at this time. I will introduce another RFC if a new need arises
ci packages aren't in lsst_distrib so I don't think this needs an RFC at all.