# Create Gen3 repo from testdata_jointcal

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Story Points:
6
• Team:
Data Release Production
• Urgent?:
No

#### Description

When planning PipelineTask conversion of FGCM, we recognized a Gen3-converted testdata_jointcal repo as a useful prerequisite (presumably the same is also true for a PipelineTask conversion of jointcal).

I'm not familiar enough with the structure of that package to have a plan in mind for how to organize things (or whether to try to convert the HSC data separately and before trying to convert the DECam data - I assume we'll defer the CFHT regardless).  Given that there is data from multiple instruments here it could also be an interesting test of making multi-instrument Gen3 repos, but we shouldn't let that block getting something usable for PipelineTask testing.  I'm hoping John Parejko (who now knows both about this data and gen2to3) might have some thoughts on those points.

Tim Jenness, if we can get this scheduled to be done by someone more familiar with the conversion code this month, that will open up Eli Rykoff to work on the actual FGCM conversion next month.  I don't think it should be a lot of work, but again, I'm less knowledgeable about this package than others.

#### Activity

Hide
Tim Jenness added a comment -

In theory converting those two instrument gen2 repos into one gen3 repo with softlinks should be a couple of commands and a bit of a wait. I have already created gen3 repos with multiple instruments using the conversion command. In practice it doesn't seem to work because the HSC repo doesn't have any raws in it but the conversion script tries to get raws and fails. DECam seems to fail for the same reason but with a different error message.

The CFHT conversion would require we sort out gen3 in obs_cfht. That should be fairly straightforward but I've been told before now that it's not that important for people.

Show
Tim Jenness added a comment - In theory converting those two instrument gen2 repos into one gen3 repo with softlinks should be a couple of commands and a bit of a wait. I have already created gen3 repos with multiple instruments using the conversion command. In practice it doesn't seem to work because the HSC repo doesn't have any raws in it but the conversion script tries to get raws and fails. DECam seems to fail for the same reason but with a different error message. The CFHT conversion would require we sort out gen3 in obs_cfht. That should be fairly straightforward but I've been told before now that it's not that important for people.
Hide
Jim Bosch added a comment -

Ugh, I had no idea we had repos in the wild with no raws (or links to parent repos with raws). Probably best to just manually make a YAML file here with the visit and exposure dimension rows normally populated by raw ingest, and import that before the conversion as a one-off, though the conversion code may also need other minor changes.

I'm totally fine with continuing to defer CFHT.

Show
Jim Bosch added a comment - Ugh, I had no idea we had repos in the wild with no raws (or links to parent repos with raws). Probably best to just manually make a YAML file here with the visit and exposure dimension rows normally populated by raw ingest, and import that before the conversion as a one-off, though the conversion code may also need other minor changes. I'm totally fine with continuing to defer CFHT.
Hide
Tim Jenness added a comment -

I think this means:

1. Find the relevant raw files.
2. Ingest them into a temporary gen3 repo
3. Run butler.export
4. Create gen3 repo in testdata_jointcal and register instrument (do not writeCuratedCalibrations)
5. Run butler.import of the visit/exposure tables (but ignore the data files)
6. Then run conversion command as expected.

Repeat for second instrument.

Show
Tim Jenness added a comment - I think this means: Find the relevant raw files. Ingest them into a temporary gen3 repo Run butler.export Create gen3 repo in testdata_jointcal and register instrument (do not writeCuratedCalibrations) Run butler.import of the visit/exposure tables (but ignore the data files) Then run conversion command as expected. Repeat for second instrument.
Hide
Jim Bosch added a comment -

Yes, that would work, and would probably be easier than writing the YAML for the visit/exposure tables, which would be the alternative.

Show
Jim Bosch added a comment - Yes, that would work, and would probably be easier than writing the YAML for the visit/exposure tables, which would be the alternative.
Hide
Eli Rykoff added a comment -

So for HSC I've figured out how to do this simply in-place, with a small script that links in the relevant raw files. There's the additional complication that the default gen3 skymap covers the full sky and leads to a very large sqlite3 database. This can be pruned by hand, quite simply:

 delete from patch_htm7_overlap where tract != 9697; delete from patch where tract != 9697; delete from tract_htm7_overlap where tract != 9697; vacuum; 

which shrinks the db from 900Mb to 5Mb, which is perfectly fine.

I spent a few minutes trying to do the decam and failing miserably because there aren't raws, only instcal files, and then I'm totally out of my depth.

For the time being, the most important conversion for testdata_jointcal to get PipelineTask conversion underway for fgcmcal and jointcal is, I believe, the HSC conversion. I'm happy to take this ticket provided that it covers HSC only (or file a sub-ticket that is the HSC conversion?)

I also want to confirm that running the gen3 repo doesn't require that the raws are physically there, and that I can delete the raw links from my script.

Show
Eli Rykoff added a comment - So for HSC I've figured out how to do this simply in-place, with a small script that links in the relevant raw files. There's the additional complication that the default gen3 skymap covers the full sky and leads to a very large sqlite3 database. This can be pruned by hand, quite simply: delete from patch_htm7_overlap where tract != 9697; delete from patch where tract != 9697; delete from tract_htm7_overlap where tract != 9697; vacuum; which shrinks the db from 900Mb to 5Mb, which is perfectly fine. I spent a few minutes trying to do the decam and failing miserably because there aren't raws, only instcal files, and then I'm totally out of my depth. For the time being, the most important conversion for testdata_jointcal to get PipelineTask conversion underway for fgcmcal and jointcal is, I believe, the HSC conversion. I'm happy to take this ticket provided that it covers HSC only (or file a sub-ticket that is the HSC conversion?) I also want to confirm that running the gen3 repo doesn't require that the raws are physically there, and that I can delete the raw links from my script.
Hide
Tim Jenness added a comment -

Is the gen3 repo itself being committed to the git repo? We aren't promising repository stability yet so be prepared to remake it every week for a while. Or are you using butler export/import?

Show
Tim Jenness added a comment - Is the gen3 repo itself being committed to the git repo? We aren't promising repository stability yet so be prepared to remake it every week for a while. Or are you using butler export/import?
Hide
Eli Rykoff added a comment -

Ah, interesting point about stability! So this would just entail merging in the scripts that will quickly regenerate the repo, I guess. Does that make sense?

Show
Eli Rykoff added a comment - Ah, interesting point about stability! So this would just entail merging in the scripts that will quickly regenerate the repo, I guess. Does that make sense?
Hide
John Parejko added a comment -

Yeah, if gen3 repo stability isn't settled yet, I'd suggest writing the scripts and committing those, but not committing the data so that there isn't churn on git-lfs. That way, we can start working on the PipelineTask conversion while we wait for repo stability.

Show
John Parejko added a comment - Yeah, if gen3 repo stability isn't settled yet, I'd suggest writing the scripts and committing those, but not committing the data so that there isn't churn on git-lfs. That way, we can start working on the PipelineTask conversion while we wait for repo stability.
Hide
Tim Jenness added a comment -

What you can do is do a butler export, store the yaml representation of the repo and then do a butler import of the yaml – that should be faster than running conversion again. I'm not entirely sure what happens regarding the raws in this scenario since I don't think we've ever tried to do that where the raws are missing.

Show
Tim Jenness added a comment - What you can do is do a butler export, store the yaml representation of the repo and then do a butler import of the yaml – that should be faster than running conversion again. I'm not entirely sure what happens regarding the raws in this scenario since I don't think we've ever tried to do that where the raws are missing.
Hide
Jim Bosch added a comment -

No raws after conversion is done is fine - we just need the exposure dimension rows that raw ingest also provides, and those will round-trip through export/import without the raws.

Show
Jim Bosch added a comment - No raws after conversion is done is fine - we just need the exposure dimension rows that raw ingest also provides, and those will round-trip through export/import without the raws.
Hide
Eli Rykoff added a comment -

So I can link the races, do the conversion, slim the db, and export the butler yaml, delete the raws, and save the yaml to the repo? And then I can simply make a new current-version repo by importing the yaml? Would I have to slim the db again?

I am going to take this ticket and edit so that it only covers HSC at the moment, and will involve scripts and a yaml but not a full gen 3 repo (which will happen in the future).

Show
Eli Rykoff added a comment - So I can link the races, do the conversion, slim the db, and export the butler yaml, delete the raws, and save the yaml to the repo? And then I can simply make a new current-version repo by importing the yaml? Would I have to slim the db again? I am going to take this ticket and edit so that it only covers HSC at the moment, and will involve scripts and a yaml but not a full gen 3 repo (which will happen in the future).
Hide
Jim Bosch added a comment -

So I can link the ra[w]s, do the conversion, slim the db, and export the butler yaml, delete the raws, and save the yaml to the repo? And then I can simply make a new current-version repo by importing the yaml? Would I have to slim the db again?

Yup, that should all work; no need to re-slim the DB. Note that there are two yaml files in play here - one is the butler.yaml configuration file, and the other is the exported description of the repository contents.

One quick minor point: it'd be good to modify the name of the skymap to reflect the fact that it's a slimmed down version of the big one. I think that's probably easiest done via a global find-replace on the export yaml file, changing a string that's probably something like "hsc_rings_v1" to that + some suffix.

Show
Jim Bosch added a comment - So I can link the ra [w] s, do the conversion, slim the db, and export the butler yaml, delete the raws, and save the yaml to the repo? And then I can simply make a new current-version repo by importing the yaml? Would I have to slim the db again? Yup, that should all work; no need to re-slim the DB. Note that there are two yaml files in play here - one is the butler.yaml configuration file, and the other is the exported description of the repository contents. One quick minor point: it'd be good to modify the name of the skymap to reflect the fact that it's a slimmed down version of the big one. I think that's probably easiest done via a global find-replace on the export yaml file, changing a string that's probably something like "hsc_rings_v1" to that + some suffix.
Hide
Eli Rykoff added a comment -

Upon further review, the import/export code does not currently support the exporting of the sky map, so we have to jettison that idea. I'll just make the scripts to be able to generate a gen3 repo and slim it down, even if that takes a few minutes. See https://lsstc.slack.com/archives/C3UCAEW3D/p1592415109261700

Show
Eli Rykoff added a comment - Upon further review, the import/export code does not currently support the exporting of the sky map, so we have to jettison that idea. I'll just make the scripts to be able to generate a gen3 repo and slim it down, even if that takes a few minutes. See https://lsstc.slack.com/archives/C3UCAEW3D/p1592415109261700
Hide
Eli Rykoff added a comment -

This adds stub raw files so that the new scripts/convert_gen2_to_gen3_hsc.sh can be run anywhere.  The gen3 repo isn't being checked in because it will have to be remade semi-regularly ( ? ).

At the moment, for development, I think this is adequate for fgcmcal, but for Jenkins testing we would need to have this run (not sure how).

Show
Eli Rykoff added a comment - PR is here:  https://github.com/lsst/testdata_jointcal/pull/28 This adds stub raw files so that the new scripts/convert_gen2_to_gen3_hsc.sh can be run anywhere.  The gen3 repo isn't being checked in because it will have to be remade semi-regularly ( ? ). At the moment, for development, I think this is adequate for fgcmcal , but for Jenkins testing we would need to have this run (not sure how).
Hide
Tim Jenness added a comment -

Looks okay. Minor comments only. Presumably you can run this as part of a setUpClass in a gen3 test in fgcmcal? Or else add it as a scons target there so that it gets converted before the tests run.

Show
Tim Jenness added a comment - Looks okay. Minor comments only. Presumably you can run this as part of a setUpClass in a gen3 test in fgcmcal? Or else add it as a scons target there so that it gets converted before the tests run.
Hide
Eli Rykoff added a comment -

Thanks for the quick turnaround! I'm trying to think of ways to cache this usefully (in the future).  Is it possible to have a scons target in testdata_jointcal (separate ticket!) so that it can be "installed".  Would it then sit there on Jenkins until testdata_jointcal was updated (or a node was wiped, or something like that)?

Show
Eli Rykoff added a comment - Thanks for the quick turnaround! I'm trying to think of ways to cache this usefully (in the future).  Is it possible to have a scons target in testdata_jointcal (separate ticket!) so that it can be "installed".  Would it then sit there on Jenkins until testdata_jointcal was updated (or a node was wiped, or something like that)?
Hide
Tim Jenness added a comment -

It is possible to build and install testdata_jointcal and run the conversion dynamically. We have to balance that with the annoyance of having to reinstall the entire dataset every single time a dependency changes (obs_base and daf_butler). An alternative to that is to have a package that depends on testdata_jointcal and does the installation itself – that could soft link to testdata_jointcal so wouldn't take a huge amount of space when a dependency changes.

Show
Tim Jenness added a comment - It is possible to build and install testdata_jointcal and run the conversion dynamically. We have to balance that with the annoyance of having to reinstall the entire dataset every single time a dependency changes (obs_base and daf_butler). An alternative to that is to have a package that depends on testdata_jointcal and does the installation itself – that could soft link to testdata_jointcal so wouldn't take a huge amount of space when a dependency changes.
Hide
Eli Rykoff added a comment -

Oh, good point, I wasn't thinking about the fact that daf_butler changes would trigger the rebuild (which kind of is the point, though, isn't it?).  Anyway, at the moment this will unblock a ton of work, and I'm not sure how we'll be using it in a month, so let's punt this.  Thanks!

Show
Eli Rykoff added a comment - Oh, good point, I wasn't thinking about the fact that daf_butler changes would trigger the rebuild (which kind of is the point, though, isn't it?).  Anyway, at the moment this will unblock a ton of work, and I'm not sure how we'll be using it in a month, so let's punt this.  Thanks!
Hide
Tim Jenness added a comment -

Yes, daf_butler triggering the build is the point, but if the testdata package is multi GB you really don't want to be installing that each time daf_butler changes. You want to redo the conversion and install a soft link tree – hence the suggestion for a second package that does that.

Show
Tim Jenness added a comment - Yes, daf_butler triggering the build is the point, but if the testdata package is multi GB you really don't want to be installing that each time daf_butler changes. You want to redo the conversion and install a soft link tree – hence the suggestion for a second package that does that.

#### People

Assignee:
Eli Rykoff
Reporter:
Jim Bosch
Reviewers:
Tim Jenness
Watchers:
Eli Rykoff, Jim Bosch, John Parejko, Tim Jenness
0 Vote for this issue
Watchers:
4 Start watching this issue

#### Dates

Created:
Updated:
Resolved:

#### CI Builds

No builds found.