Paul Price Thank you for your kind words and suggestions, and sorry for the long delayed response.
How much of your time did running the production take (e.g., X hours a day for Y days)? I seem to remember it taking me something like two weeks of 12 hour days (though of course there's a lot of time for twiddling thumbs in there followed by panic, and I think that includes some large-scale mistakes like having to re-run all the coadds) when I ran a similar volume.
The first job was May 8 and the last job was May 22 (not counting forcedPhotCcd), so it was ~2 weeks for the processing too. Then some more days for forcedPhotCcd.
(To the management: there was preparation time before processing starts too, so please keep that in mind when scheduling.)
Did the production require investments of time from NCSA people other than yourself?
Yes. Greg Daues has been helping tremendously with any hardware or admin related issues. Sometimes he solved it himself, sometimes he communicated with other NCSA staff to get it done, sometimes we keep non-urgent issues in mind to investigate later.
What support did you need from pipelines people?
Most of the communications with the SciPi team happened before the production actually started: defining what to run, processing RC, deciding what software version to use, what bug fixes to include on top of that, and so on. The preparation was also to get more familiar with the software version, especially if there were big changes such as butler repo or new reference catalogs and those changes may require adjustments in operations.
During the run I've been asking pipeline questions that may impact the run. Special thanks to Paul Price and Jim Bosch for tirelessly helping me.
Were you forced to use different stack versions (e.g., due to some important bug being fixed halfway through production)?
This time everything was run with the same stack version and config; I did not reprocess anything with a new software. The goal this time was not as much to get the best science ever. Depending on the goals this may change and a new version with important bug fixes may be necessary.
Although not in this ticket, we might want to reprocess forcedPhotCcd when DM-10755 is fixed.
What tools did you have to put together to facilitate the production run? (I'd like to get them into the pipeline in some form or other so they don't have to be reinvented.)
What were the things that you had to do manually? (These are areas we can concentrate effort to make production easier.)
Things I needed to do manually include a lot of bookkeeping and data IDs handling, checking if jobs completed successfully or failed, actions for the failed jobs, butler repo manipulation such as cleanup partial results from bad executions or combining repos, cleanup repos/logs for the good and bad executions. I have a few quick-and-dirty scripts, but they are merely to get the job done for now.
When something failed, figuring out what really went wrong could take a while. For this ticket, if I could not understand right away and the cluster has idle nodes, I tended to just retry, maybe with a longer time limit. This time I had plenty of computing and storage resources so I took the easy routes knowingly sometimes.
What documentation should we write/improve?
Specifically for running production, it might be helpful to have an official and versioned documentation including information somewhat like my cheat sheet here to keep track of the data products expected by the pipelines for a weekly or as evolved with the master stack.
I would also appreciate a clearer documentation what configs operations are allowed to change (before the code/framework supports it programmatically).
In general, I'm very looking forward to the end user science pipeline doc that will appear at pipelines.lsst.io or wherever the new site will be. Heard many great ideas that may happen there.
What was your expectation of the task of shepherding the production run beforehand, and how was the actual experience relative to that?
There were fewer pipeline issues than I expected! I expected more issues, probably because my experience of processing DECam data with the LSST stack taught me to expect them. ci_hsc is awesome. Also most problems were resolved during the RC processing phase (thanks to the SciPi team!). This time I'm happy enough to see jobs successful and not trying to optimize anything (yet). Mixtures of panic and mistakes were expected and experienced.
Anything you'd do differently next time?
Do not freak out and submit jobs in late night. Manual mistakes could be expensive.
Probably should use --doraise everywhere. Would like to have some basic QC checks. Naming of the jobs and the log files could be more consistent. More thoughts needed for the submission strategies. Ensure pre-runs are done.
If you could change three things in the pipeline, what would they be?
1. Allow changes of operational config, maybe by splitting operational and algorithmic config, or changing how CmdLineTask checks the config, or something else.
2. Improve integration tests, include meas_mosaic in lsst_distrib and CI, give it weekly tag
3. Things about butler-CmdLineTask-repositories need changes, although I'm not sure what's the best, and STWG is working on that.
How much compute power did the production run require? The number I have from running the production at Princeton is around 2000 core-weeks.
I understand your report to NCSA must go through Channels, but I hope I can see it when that's done.
I hope to have the answers in
DM-10649. My current plan is to add any report to confluence, so anyone will be able to see.
I'll incorporate some of the above into the confluence page.