# S17B  full HSC reprocessing with the SSP PDR1 data

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s: None
• Labels:
None
• Story Points:
25
• Team:
Data Facility

#### Description

Based on the steps/configs agreed in DM-9800, process the full set of the HSC PDR1 data. Monitor the execution progress and resolve non-stack issues (e.g. issues from bad execution).

#### Activity

Hide
Hsin-Fang Chiang added a comment -

Starting single frame processing the w_2017_17 stack + master meas_mosaic/obs_subaru/ctrl_pool as of today built with w_2017_17; effectively this is w_2017_17 + DM-10315 + DM-10449 + DM-10430.

Show
Hsin-Fang Chiang added a comment - Starting single frame processing the w_2017_17 stack + master meas_mosaic/obs_subaru/ctrl_pool as of today built with w_2017_17; effectively this is w_2017_17 + DM-10315 + DM-10449 + DM-10430 .
Hide
Hsin-Fang Chiang added a comment -

Greg Daues may you please create /datasets/hsc/repo/rerun/DM-10404 and give me the permission? I'll transfer the processed data there. The first batch (from single frame processing) will be about 52TB.

Show
Hsin-Fang Chiang added a comment - Greg Daues may you please create /datasets/hsc/repo/rerun/ DM-10404  and give me the permission? I'll transfer the processed data there. The first batch (from single frame processing) will be about 52TB.
Hide
Greg Daues added a comment -

Hsin-Fang Chiang The directory /datasets/hsc/repo/rerun/DM-10404 has been created, it has ownership hchiang2:lsst_users, and has immutable flag 'no'. I think it is ready for data transfer.

Show
Greg Daues added a comment - Hsin-Fang Chiang The directory /datasets/hsc/repo/rerun/ DM-10404 has been created, it has ownership hchiang2:lsst_users , and has immutable flag 'no'. I think it is ready for data transfer.
Hide
Hsin-Fang Chiang added a comment -

The data have been processed through the pipelines and the output repos are available at:

 /datasets/hsc/repo/rerun/DM-10404/UDEEP/ /datasets/hsc/repo/rerun/DM-10404/DEEP/ /datasets/hsc/repo/rerun/DM-10404/WIDE/ 

Failed runs from non-science-pipelines issues have been resolved and these data are from the successful runs. Reproducible failures are reported on the confluence page https://confluence.lsstcorp.org/display/DM/S17B+HSC+PDR1+reprocessing along with more information about the reprocessing.

I'm checking through the document from DM-9886 and I think all is done for S17B, except that I did not file tickets for the processCcd failures as they are somewhat known already. I did not include the two failed patches at multiband, although DM-10574 is done now so I could reprocess them with a new meas_deblender, if wanted.

Internally the NCSA team does expect me and other team members to report more from this reprocessing effort in future tickets (for example, we plan to compile some slurm statistics). I may ask more pipeline questions for my learning. I was wondering if there is anything that the DRP team wants to see in this cycle before closing this ticket?

Show
Hsin-Fang Chiang added a comment - The data have been processed through the pipelines and the output repos are available at: /datasets/hsc/repo/rerun/DM- 10404 /UDEEP/ /datasets/hsc/repo/rerun/DM- 10404 /DEEP/ /datasets/hsc/repo/rerun/DM- 10404 /WIDE/ Failed runs from non-science-pipelines issues have been resolved and these data are from the successful runs. Reproducible failures are reported on the confluence page https://confluence.lsstcorp.org/display/DM/S17B+HSC+PDR1+reprocessing along with more information about the reprocessing. I'm checking through the document from DM-9886 and I think all is done for S17B, except that I did not file tickets for the processCcd failures as they are somewhat known already. I did not include the two failed patches at multiband, although DM-10574 is done now so I could reprocess them with a new meas_deblender, if wanted. Internally the NCSA team does expect me and other team members to report more from this reprocessing effort in future tickets (for example, we plan to compile some slurm statistics). I may ask more pipeline questions for my learning. I was wondering if there is anything that the DRP team wants to see in this cycle before closing this ticket?
Hide
Hsin-Fang Chiang added a comment -

Upon request, I've started forcedPhotCcd processing. It was not included in our RC processing before, and seemed not included in the HSC PDR1 release at the naoj site either.

Show
Hsin-Fang Chiang added a comment - Upon request, I've started forcedPhotCcd processing. It was not included in our RC processing before, and seemed not included in the HSC PDR1 release at the naoj site either.
Hide
John Swinbank added a comment - - edited

Hsin-Fang Chiang — awesome job! Many thanks for this.

I am currently travelling & will be on vacation next week, and may be travelling again the week after that, so I can't promise to review this in a meaningful way any time soon. Here's what I suggest:

• Tim Morton, could you take a look at the output repos Hsin-Fang Chiang specified above and make sure that everything you expect is there. If so, mark this as "reviewed" and let Hsin-Fang close it out so that she can claim the work done.
• Paul Price, since I know you are concerned concerned to make sure the process was well documented, perhaps you could take a quick look at the Confluence page marked above and make some notes on how you think it could be improved. If they're easy to implement, Hsin-Fang Chiang can probably do them right away; if they're harder, they can feed into the next point...
• Early in June let's convene a meeting of interested parties (I expect that includes at least Hsin-Fang Chiang, one of Tim Morton and Lauren MacArthur, one of Jim Bosch and Robert Lupton, perhaps Paul Price, and me) to take stock of how this process went, refine the process for next time, and discuss our plans for F17.

Given that, I'm going to change the reviewers on this ticket to Tim Morton & Paul Price. Please push back if you aren't happy!

Show
John Swinbank added a comment - - edited Hsin-Fang Chiang — awesome job! Many thanks for this. I am currently travelling & will be on vacation next week, and may be travelling again the week after that, so I can't promise to review this in a meaningful way any time soon. Here's what I suggest: Tim Morton , could you take a look at the output repos Hsin-Fang Chiang specified above and make sure that everything you expect is there. If so, mark this as "reviewed" and let Hsin-Fang close it out so that she can claim the work done. Paul Price , since I know you are concerned concerned to make sure the process was well documented, perhaps you could take a quick look at the Confluence page marked above and make some notes on how you think it could be improved. If they're easy to implement, Hsin-Fang Chiang can probably do them right away; if they're harder, they can feed into the next point... Early in June let's convene a meeting of interested parties (I expect that includes at least Hsin-Fang Chiang , one of Tim Morton and Lauren MacArthur , one of Jim Bosch and Robert Lupton , perhaps Paul Price , and me) to take stock of how this process went, refine the process for next time, and discuss our plans for F17. Given that, I'm going to change the reviewers on this ticket to Tim Morton & Paul Price . Please push back if you aren't happy!
Hide
John Swinbank added a comment -

And to be clear — once Tim & Paul are content on the first two points above, we should mark this as "reviewed" and close it out.

Show
John Swinbank added a comment - And to be clear — once Tim & Paul are content on the first two points above, we should mark this as "reviewed" and close it out.
Hide
Paul Price added a comment -

This is terrific! You've done a wonderful job running and documenting this production run — congratulations, and thank you for your heroic efforts!

Thank you for adding those details to the Confluence page. I'd also be interested in knowing:

• How much of your time did running the production take (e.g., X hours a day for Y days)? I seem to remember it taking me something like two weeks of 12 hour days (though of course there's a lot of time for twiddling thumbs in there followed by panic, and I think that includes some large-scale mistakes like having to re-run all the coadds) when I ran a similar volume.
• Did the production require investments of time from NCSA people other than yourself?
• What support did you need from pipelines people?
• How much compute power did the production run require? The number I have from running the production at Princeton is around 2000 core-weeks.
• What tools did you have to put together to facilitate the production run? (I'd like to get them into the pipeline in some form or other so they don't have to be reinvented.)
• What were the things that you had to do manually? (These are areas we can concentrate effort to make production easier.)
• What documentation should we write/improve?
• Were you forced to use different stack versions (e.g., due to some important bug being fixed halfway through production)?
• What was your expectation of the task of shepherding the production run beforehand, and how was the actual experience relative to that?
• If you could change three things in the pipeline, what would they be?
• Anything you'd do differently next time?
• I understand your report to NCSA must go through Channels, but I hope I can see it when that's done.

My goal for all of this is to improve the experience for running the production, and hopefully make your job easier for the next time. We've got lots of compute power, but (wo)manpower is precious.

Again, well done, and thanks!

Show
Paul Price added a comment - This is terrific! You've done a wonderful job running and documenting this production run — congratulations, and thank you for your heroic efforts! Thank you for adding those details to the Confluence page. I'd also be interested in knowing: How much of your time did running the production take (e.g., X hours a day for Y days)? I seem to remember it taking me something like two weeks of 12 hour days (though of course there's a lot of time for twiddling thumbs in there followed by panic, and I think that includes some large-scale mistakes like having to re-run all the coadds) when I ran a similar volume. Did the production require investments of time from NCSA people other than yourself? What support did you need from pipelines people? How much compute power did the production run require? The number I have from running the production at Princeton is around 2000 core-weeks. What tools did you have to put together to facilitate the production run? (I'd like to get them into the pipeline in some form or other so they don't have to be reinvented.) What were the things that you had to do manually? (These are areas we can concentrate effort to make production easier.) What documentation should we write/improve? Were you forced to use different stack versions (e.g., due to some important bug being fixed halfway through production)? What was your expectation of the task of shepherding the production run beforehand, and how was the actual experience relative to that? If you could change three things in the pipeline, what would they be? Anything you'd do differently next time? I understand your report to NCSA must go through Channels, but I hope I can see it when that's done. My goal for all of this is to improve the experience for running the production, and hopefully make your job easier for the next time. We've got lots of compute power, but (wo)manpower is precious. Again, well done, and thanks!
Hide
Tim Morton added a comment -

There seems to be a discrepancy between the tracts that are actually calculated in /datasets/hsc/repo/rerun/DM-10404/WIDE and those listed in the table on the confluence page.

By my accounting, the following tracts have processing done (at least in HSC-Z, which is where I'm doing the counting), but do not seem to be documented in the confluence table:

 [8278, 8286, 8519, 8527, 8761, 8769, 9003, 9004, 9005, 9006, 9007, 9008, 9009, 9010, 9011, 9070, 9071, 9072, 9073, 9074, 9075, 9076, 9077, 9078, 9079, 9103, 9104, 9105, 9106, 9107, 9126, 9127, 9128, 9129, 9130, 9131, 9132, 9133, 9134, 9206, 9207, 9208, 9209, 9210, 9211, 9212, 9213, 9214, 9215, 9216, 9245, 9246, 9247, 9248, 9249, 9250, 9251, 9252, 9312, 9313, 9319, 9320, 9321, 9322, 9345, 9350, 9369, 9376, 9377, 9448, 9449, 9457, 9458, 9459, 9555, 9556, 9563, 9564, 9565, 9588, 9593, 9612, 9619, 9620, 9691, 9692, 9700, 9701, 9702, 9703, 9798, 9799, 9806, 9807, 9808, 9831, 9832, 9833, 9834, 9835, 9855, 9856, 9857, 9858, 9859, 9860, 9861, 9862, 9933, 9934, 9942, 9943, 9944, 10040, 10041, 10042, 10043, 10044, 10045, 10046, 10047, 10048, 10049, 10176, 10177, 10178, 10179, 10180, 10181, 10182, 10183, 10184, 10185, 10186, 15826, 15827, 15828, 15829, 16004, 16005, 16006, 16007, 16012, 16179, 16180, 16181, 16182, 16183, 16184, 16185, 16821, 16822, 16972, 16973] 

Show
Tim Morton added a comment - There seems to be a discrepancy between the tracts that are actually calculated in /datasets/hsc/repo/rerun/ DM-10404 /WIDE and those listed in the table on the confluence page. By my accounting, the following tracts have processing done (at least in HSC-Z, which is where I'm doing the counting), but do not seem to be documented in the confluence table: [8278, 8286, 8519, 8527, 8761, 8769, 9003, 9004, 9005, 9006, 9007, 9008, 9009, 9010, 9011, 9070, 9071, 9072, 9073, 9074, 9075, 9076, 9077, 9078, 9079, 9103, 9104, 9105, 9106, 9107, 9126, 9127, 9128, 9129, 9130, 9131, 9132, 9133, 9134, 9206, 9207, 9208, 9209, 9210, 9211, 9212, 9213, 9214, 9215, 9216, 9245, 9246, 9247, 9248, 9249, 9250, 9251, 9252, 9312, 9313, 9319, 9320, 9321, 9322, 9345, 9350, 9369, 9376, 9377, 9448, 9449, 9457, 9458, 9459, 9555, 9556, 9563, 9564, 9565, 9588, 9593, 9612, 9619, 9620, 9691, 9692, 9700, 9701, 9702, 9703, 9798, 9799, 9806, 9807, 9808, 9831, 9832, 9833, 9834, 9835, 9855, 9856, 9857, 9858, 9859, 9860, 9861, 9862, 9933, 9934, 9942, 9943, 9944, 10040, 10041, 10042, 10043, 10044, 10045, 10046, 10047, 10048, 10049, 10176, 10177, 10178, 10179, 10180, 10181, 10182, 10183, 10184, 10185, 10186, 15826, 15827, 15828, 15829, 16004, 16005, 16006, 16007, 16012, 16179, 16180, 16181, 16182, 16183, 16184, 16185, 16821, 16822, 16972, 16973]
Hide
Tim Morton added a comment -

(I should mention, the DEEP- and UDEEP-layer tracts on disk and in the table do appear to match up.)

Show
Tim Morton added a comment - (I should mention, the DEEP- and UDEEP-layer tracts on disk and in the table do appear to match up.)
Hide
Tim Morton added a comment -

Also, for reference, the script I'm doing the comparison with is /home/tmorton/tickets/DM-10404/wide-tracts-compare.py.

Show
Tim Morton added a comment - Also, for reference, the script I'm doing the comparison with is /home/tmorton/tickets/ DM-10404 /wide-tracts-compare.py .
Hide
Hsin-Fang Chiang added a comment -

Tim Morton Yes, it's known that there are more tracts being processed and stored than necessary. In the conflence I copied the list from the official HSC PDR1 site, and I should have only needed to process those tracts. In Slack, I've confirmed that it's okay not to remove the additional tracts from the data repo. (Jim Bosch said "Yup, I don't think we care if there are some extra tracts.")

16821, 16822, 16972, 16973 are SSP_AEGIS and they do exist in the confluence table.

Currently in the confluence I noted "While unnecessary, some edge tracts outside of the PDR1 coverage were attempted in the processing. Those data outputs are kept in the repos as well. " I guess that alone is not clear enough; I'll rephrase and clarify more. Thanks for pointing that out.

Show
Hsin-Fang Chiang added a comment - Tim Morton Yes, it's known that there are more tracts being processed and stored than necessary. In the conflence I copied the list from the official HSC PDR1 site, and I should have only needed to process those tracts. In Slack , I've confirmed that it's okay not to remove the additional tracts from the data repo. ( Jim Bosch said "Yup, I don't think we care if there are some extra tracts.") 16821, 16822, 16972, 16973 are SSP_AEGIS and they do exist in the confluence table. Currently in the confluence I noted "While unnecessary, some edge tracts outside of the PDR1 coverage were attempted in the processing. Those data outputs are kept in the repos as well. " I guess that alone is not clear enough; I'll rephrase and clarify more. Thanks for pointing that out.
Hide
Tim Morton added a comment -

Got it, sounds good to me. Thanks! (and yes, I see I missed the SSP_AEGIS in my comparison). Maybe move that comment/note to just below the table? And I'm not sure how else specifically I can contribute to reviewing; John requests that I "make sure that everything I expect is there," and aside from running QA plots on all the tracts (which probably should be its own ticket, not part of a review process), I'm not sure how else to check it out. I'm happy to mark it reviewed, unless someone has a suggestion of a more detailed check I should do.

Show
Tim Morton added a comment - Got it, sounds good to me. Thanks! (and yes, I see I missed the SSP_AEGIS in my comparison). Maybe move that comment/note to just below the table? And I'm not sure how else specifically I can contribute to reviewing; John requests that I "make sure that everything I expect is there," and aside from running QA plots on all the tracts (which probably should be its own ticket, not part of a review process), I'm not sure how else to check it out. I'm happy to mark it reviewed, unless someone has a suggestion of a more detailed check I should do.
Hide
Hsin-Fang Chiang added a comment -

Paul Price Thank you for your kind words and suggestions, and sorry for the long delayed response.

How much of your time did running the production take (e.g., X hours a day for Y days)? I seem to remember it taking me something like two weeks of 12 hour days (though of course there's a lot of time for twiddling thumbs in there followed by panic, and I think that includes some large-scale mistakes like having to re-run all the coadds) when I ran a similar volume.

The first job was May 8 and the last job was May 22 (not counting forcedPhotCcd), so it was ~2 weeks for the processing too. Then some more days for forcedPhotCcd.

(To the management: there was preparation time before processing starts too, so please keep that in mind when scheduling.)

Did the production require investments of time from NCSA people other than yourself?

Yes. Greg Daues has been helping tremendously with any hardware or admin related issues. Sometimes he solved it himself, sometimes he communicated with other NCSA staff to get it done, sometimes we keep non-urgent issues in mind to investigate later.

What support did you need from pipelines people?

Most of the communications with the SciPi team happened before the production actually started: defining what to run, processing RC, deciding what software version to use, what bug fixes to include on top of that, and so on. The preparation was also to get more familiar with the software version, especially if there were big changes such as butler repo or new reference catalogs and those changes may require adjustments in operations.

During the run I've been asking pipeline questions that may impact the run. Special thanks to Paul Price and Jim Bosch for tirelessly helping me.

Were you forced to use different stack versions (e.g., due to some important bug being fixed halfway through production)?

This time everything was run with the same stack version and config; I did not reprocess anything with a new software. The goal this time was not as much to get the best science ever. Depending on the goals this may change and a new version with important bug fixes may be necessary.

Although not in this ticket, we might want to reprocess forcedPhotCcd when DM-10755 is fixed.

What tools did you have to put together to facilitate the production run? (I'd like to get them into the pipeline in some form or other so they don't have to be reinvented.)

What were the things that you had to do manually? (These are areas we can concentrate effort to make production easier.)

Things I needed to do manually include a lot of bookkeeping and data IDs handling, checking if jobs completed successfully or failed, actions for the failed jobs, butler repo manipulation such as cleanup partial results from bad executions or combining repos, cleanup repos/logs for the good and bad executions. I have a few quick-and-dirty scripts, but they are merely to get the job done for now.

When something failed, figuring out what really went wrong could take a while. For this ticket, if I could not understand right away and the cluster has idle nodes, I tended to just retry, maybe with a longer time limit. This time I had plenty of computing and storage resources so I took the easy routes knowingly sometimes.

What documentation should we write/improve?

Specifically for running production, it might be helpful to have an official and versioned documentation including information somewhat like my cheat sheet here to keep track of the data products expected by the pipelines for a weekly or as evolved with the master stack.
I would also appreciate a clearer documentation what configs operations are allowed to change (before the code/framework supports it programmatically).

In general, I'm very looking forward to the end user science pipeline doc that will appear at pipelines.lsst.io or wherever the new site will be. Heard many great ideas that may happen there.

What was your expectation of the task of shepherding the production run beforehand, and how was the actual experience relative to that?

There were fewer pipeline issues than I expected! I expected more issues, probably because my experience of processing DECam data with the LSST stack taught me to expect them. ci_hsc is awesome. Also most problems were resolved during the RC processing phase (thanks to the SciPi team!). This time I'm happy enough to see jobs successful and not trying to optimize anything (yet). Mixtures of panic and mistakes were expected and experienced.

Anything you'd do differently next time?

Do not freak out and submit jobs in late night. Manual mistakes could be expensive.

Probably should use --doraise everywhere. Would like to have some basic QC checks. Naming of the jobs and the log files could be more consistent. More thoughts needed for the submission strategies. Ensure pre-runs are done.

If you could change three things in the pipeline, what would they be?

1. Allow changes of operational config, maybe by splitting operational and algorithmic config, or changing how CmdLineTask checks the config, or something else.
2. Improve integration tests, include meas_mosaic in lsst_distrib and CI, give it weekly tag
3. Things about butler-CmdLineTask-repositories need changes, although I'm not sure what's the best, and STWG is working on that.

How much compute power did the production run require? The number I have from running the production at Princeton is around 2000 core-weeks.

I understand your report to NCSA must go through Channels, but I hope I can see it when that's done.

I hope to have the answers in DM-10649. My current plan is to add any report to confluence, so anyone will be able to see.

I'll incorporate some of the above into the confluence page.

Show
Hide
Hsin-Fang Chiang added a comment -

Tim Morton thanks for the suggestion. I added a note below the table and more details in the processing session about the extra tracts.
I'd be interested to hear how the QA goes and I agree that's out of scope of this ticket too. If something does not look right (given the stack version etc), I will be happy to fix it later.

Show
Hsin-Fang Chiang added a comment - Tim Morton thanks for the suggestion. I added a note below the table and more details in the processing session about the extra tracts. I'd be interested to hear how the QA goes and I agree that's out of scope of this ticket too. If something does not look right (given the stack version etc), I will be happy to fix it later.
Hide
Hsin-Fang Chiang added a comment -

The forced_src data have been archived to the same output repos. When running forcedPhotCcd, fatal errors were seen in many cases where the patch's reference doesn't exist; not every ccd has the forced_src file for all tracts it overlaps but DM-10755 has been filed.

I also added more details to the confluence page and reorganized a bit. I plan to announce the output repos in a CLO post.

John Swinbank Tim Morton Paul Price: is there anything I should do before closing this ticket?

Show
Hsin-Fang Chiang added a comment - The forced_src data have been archived to the same output repos. When running forcedPhotCcd, fatal errors were seen in many cases where the patch's reference doesn't exist; not every ccd has the forced_src file for all tracts it overlaps but DM-10755 has been filed. I also added more details to the confluence page and reorganized a bit. I plan to announce the output repos in a CLO post. John Swinbank Tim Morton Paul Price : is there anything I should do before closing this ticket?
Hide
Paul Price added a comment -

I am quite satisfied. Thanks!

Show
Paul Price added a comment - I am quite satisfied. Thanks!
Hide
Hsin-Fang Chiang added a comment -

Announced on CLO.

Show
Hsin-Fang Chiang added a comment - Announced on CLO .

#### People

Assignee:
Hsin-Fang Chiang
Reporter:
Hsin-Fang Chiang
Reviewers:
Paul Price, Tim Morton
Watchers:
Greg Daues, Hsin-Fang Chiang, John Swinbank, Paul Price, Tim Morton