Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-21915

Support Gen 3 ingestion of ap_verify datasets

    XMLWordPrintable

    Details

    • Story Points:
      6
    • Sprint:
      AP S20-5 (April), AP S20-6 (May), AP F20-1 (June), AP F20-2 (July)
    • Team:
      Alert Production
    • Urgent?:
      No

      Description

      lsst.ap.verify.DatasetIngestTask is responsible for ingesting an ap_verify dataset into a Gen 2 repository. Create a second task that starts with the same inputs, but creates a Gen 3 repository. Provide a (command-line?) option in ingest_dataset.py that lets users choose between the two.

      Note that the current dataset format includes a stub Gen 2 repo (possibly redundant in Gen 3) and a Gen 2 templates repo (must be updated or replaced). Changes to this format are within scope of the ticket.

      Some overlap with DM-21862, which wants to use HiTS to run single-frame processing.

        Attachments

          Issue Links

            Activity

            Hide
            krzys Krzysztof Findeisen added a comment -

            I've split off Gen 3 processing in ap_verify.py into a separate ticket (DM-21919), for better separation of concerns.

            Show
            krzys Krzysztof Findeisen added a comment - I've split off Gen 3 processing in ap_verify.py into a separate ticket ( DM-21919 ), for better separation of concerns.
            Hide
            krzys Krzysztof Findeisen added a comment -

            I'm removing this ticket from active sprints, as it looks like we'll need to resolve and implement DM-22482 before ap_verify can properly convert dataset repos into valid Gen 3 collections. Even what the final collection(s) should look like is unclear at this point.

            Show
            krzys Krzysztof Findeisen added a comment - I'm removing this ticket from active sprints, as it looks like we'll need to resolve and implement DM-22482 before ap_verify can properly convert dataset repos into valid Gen 3 collections. Even what the final collection(s) should look like is unclear at this point.
            Hide
            krzys Krzysztof Findeisen added a comment - - edited

            Proposed design:

            • Keep old Git LFS repositories (i.e., ap_verify_hits2015 can be used by either Gen 2 or Gen 3 code). The existing calib/, templates/, and refcats/ directories are designated Gen 2-only
            • Provide a read-only Gen 3 repository in a new directory, preloaded/, that contains the collections calib/<instrument>, refcats, skymaps, and templates/<type> (plus <timestamp>?)
            • Export the Gen 3 repository to config/export.yaml
            • To keep the downloaded version clean, ap_verify and ingest_datasets first import export.yaml into a fresh repository, then (as in Gen 2) ingest the raws/ directory into the new repository.

            Unresolved questions:

            • It's almost possible to share a dataset's files between Gen 2 and Gen 3 to keep the download size small. Every conversion step (see below) can use a link, but this leads to links to links to links, with the Gen 3 links leading outside the LFS repository. I've filed DM-24314 to fix that.
            • If the ap_verify concept of a workspace is going away in Gen 3, then we need another way to handle the dataset-specific config files for pipeline tasks, as well as the association database. Or we could keep the workspace, which would contain only a config directory, a repository, and the database file.

            Steps to convert a dataset (need to be repeatable if e.g. registry schema changes):

            • Ingest the Gen 2 repository using ingestDataset.py (do not ingest raws, if possible)
            • Convert the Gen 2 repository to Gen 3 using convert_gen2_repo_to_gen3.py. Put it in its final location in the downloadable dataset.
            • Convert the dataset's Gen 2 template repository into the same Gen 3 repository; configure the script to use the desired name for template collections
            • Delete raws from the Gen 3 repository, if necessary
            Show
            krzys Krzysztof Findeisen added a comment - - edited Proposed design: Keep old Git LFS repositories (i.e., ap_verify_hits2015 can be used by either Gen 2 or Gen 3 code). The existing calib/ , templates/ , and refcats/ directories are designated Gen 2-only Provide a read-only Gen 3 repository in a new directory, preloaded/ , that contains the collections calib/<instrument> , refcats , skymaps , and templates/<type> (plus <timestamp> ?) Export the Gen 3 repository to config/export.yaml To keep the downloaded version clean, ap_verify and ingest_datasets first import export.yaml into a fresh repository, then (as in Gen 2) ingest the raws/ directory into the new repository. Unresolved questions: It's almost possible to share a dataset's files between Gen 2 and Gen 3 to keep the download size small. Every conversion step (see below) can use a link, but this leads to links to links to links, with the Gen 3 links leading outside the LFS repository. I've filed DM-24314 to fix that. If the ap_verify concept of a workspace is going away in Gen 3, then we need another way to handle the dataset-specific config files for pipeline tasks, as well as the association database. Or we could keep the workspace, which would contain only a config directory, a repository, and the database file. Steps to convert a dataset (need to be repeatable if e.g. registry schema changes): Ingest the Gen 2 repository using ingestDataset.py (do not ingest raws, if possible) Convert the Gen 2 repository to Gen 3 using convert_gen2_repo_to_gen3.py . Put it in its final location in the downloadable dataset. Convert the dataset's Gen 2 template repository into the same Gen 3 repository; configure the script to use the desired name for template collections Delete raws from the Gen 3 repository, if necessary
            Hide
            tjenness Tim Jenness added a comment -

            In place conversion of a gen2 repo could be made to work very simply so no softlinking would be needed at all (the problem at the moment is that the new convert script forces symlink transfer mode and that breaks in place). Also note that symlinks are absolute paths so if you are on the same file system you might find that hardlinks are better if you aren't in-place.

            Show
            tjenness Tim Jenness added a comment - In place conversion of a gen2 repo could be made to work very simply so no softlinking would be needed at all (the problem at the moment is that the new convert script forces symlink transfer mode and that breaks in place). Also note that symlinks are absolute paths so if you are on the same file system you might find that hardlinks are better if you aren't in-place.
            Hide
            krzys Krzysztof Findeisen added a comment -

            Thanks for the advice about how to get relative paths. Can you point me to documentation on in-place conversion? I suspect it would introduce Gen 2 stuff we don't want to keep, but it's an option that hadn't occurred to me.

            Show
            krzys Krzysztof Findeisen added a comment - Thanks for the advice about how to get relative paths. Can you point me to documentation on in-place conversion? I suspect it would introduce Gen 2 stuff we don't want to keep, but it's an option that hadn't occurred to me.
            Hide
            mrawls Meredith Rawls added a comment -

            One thing I like about the current datasets is they are straightforwardly human-navigable, and one can quickly figure out that calibration products are in calib, reference catalogs are in refcats, etc. The proposed design complicates this. I also envision some future time when gen2 was the distant past, yet this dataset has an odd pre-ingestion step with nonintuitive file organization.

            Would it be possible to keep all the existing raw files where they are, rename "repo" to "repo_gen2," and have a parallel "repo_gen3" where both the things you're calling "non-nightly" and the rest of the gen3-ingested data can land?

            I think something closer to this for the ap_verify datasets would give folks who are trying to make the switch from gen2 to gen3 a nice example of how both flavors of repo can exist side-by-side and use the same raws, calibs, and other inputs. It would also make it easy to ditch gen2 entirely in the future.

            Show
            mrawls Meredith Rawls added a comment - One thing I like about the current datasets is they are straightforwardly human-navigable, and one can quickly figure out that calibration products are in calib, reference catalogs are in refcats, etc. The proposed design complicates this. I also envision some future time when gen2 was the distant past, yet this dataset has an odd pre-ingestion step with nonintuitive file organization. Would it be possible to keep all the existing raw files where they are, rename "repo" to "repo_gen2," and have a parallel "repo_gen3" where both the things you're calling "non-nightly" and the rest of the gen3-ingested data can land? I think something closer to this for the ap_verify datasets would give folks who are trying to make the switch from gen2 to gen3 a nice example of how both flavors of repo can exist side-by-side and use the same raws, calibs, and other inputs. It would also make it easy to ditch gen2 entirely in the future.
            Hide
            tjenness Tim Jenness added a comment -

            I think in theory we could make the symlinks try to be relative by comparing paths but again I'm not sure if we want that to be a different transfer mode (relsymlink?). hardlinks would be better and inplace would be best (because inplace won't change any of the file names at all. I think it's generally considered to be difficult to generically determine if two directories are on the same physical disk (we could always try a hardlink and if that fails fall back to softlink). The conversion script doesn't support this yet. It would be quick to have it support in place by telling it to use a different transfer mode if the gen2root and gen3root are the same. These are all minor changes.

            Show
            tjenness Tim Jenness added a comment - I think in theory we could make the symlinks try to be relative by comparing paths but again I'm not sure if we want that to be a different transfer mode (relsymlink?). hardlinks would be better and inplace would be best (because inplace won't change any of the file names at all. I think it's generally considered to be difficult to generically determine if two directories are on the same physical disk (we could always try a hardlink and if that fails fall back to softlink). The conversion script doesn't support this yet. It would be quick to have it support in place by telling it to use a different transfer mode if the gen2root and gen3root are the same. These are all minor changes.
            Hide
            mrawls Meredith Rawls added a comment -

            Following a quick chat with Krzysztof, it seems the plan for gen3 is to do away with the concept of a "stub repo" entirely, which will be good for user simplicity. I misinterpreted what the "export" steps meant; it's very akin to what presently happens with mapper shenanigans.

            It sounds like at the end of this, one will be able to either

            1. run ap_verify in gen2 as it works today, by cloning a dataset and copying the gen2 stub repo to somewhere on disk and using that as the root repo

            or

            2. run ap_verify in gen3 by typing something akin to ap_verify.py /path/to/empty/dir/that/exists --dataset DatasetName and, provided DatasetName has been properly cloned and setup, it will just work (and turn said path into a gen3 repo along the way)

            Sounds wonderful, carry on, and good luck sorting out the best way to do all the linking!

            Show
            mrawls Meredith Rawls added a comment - Following a quick chat with Krzysztof, it seems the plan for gen3 is to do away with the concept of a "stub repo" entirely, which will be good for user simplicity. I misinterpreted what the "export" steps meant; it's very akin to what presently happens with mapper shenanigans. It sounds like at the end of this, one will be able to either 1. run ap_verify in gen2 as it works today, by cloning a dataset and copying the gen2 stub repo to somewhere on disk and using that as the root repo or 2. run ap_verify in gen3 by typing something akin to ap_verify.py /path/to/empty/dir/that/exists --dataset DatasetName  and, provided DatasetName has been properly cloned and setup, it will just work (and turn said path into a gen3 repo along the way) Sounds wonderful, carry on, and good luck sorting out the best way to do all the linking!
            Hide
            krzys Krzysztof Findeisen added a comment - - edited

            Following some discussion started by Meredith Rawls on #dm-naming-things, the Gen 3 repository for non-raws will be called preloaded/. Better to name the data products by what they are, not what they aren't.

            Show
            krzys Krzysztof Findeisen added a comment - - edited Following some discussion started by Meredith Rawls on #dm-naming-things , the Gen 3 repository for non-raws will be called preloaded/ . Better to name the data products by what they are, not what they aren't.
            Hide
            krzys Krzysztof Findeisen added a comment - - edited

            This ticket has suffered a bit from rabbitholeing (though not actual scope creep, I think). To mitigate this, I'm splitting it up as follows:

            • The conversion script that implements the Gen 3 format described above (and an obs_base bug that blocked it) will now go on DM-24260, along with the updated datasets. I'll put it in review now, even though there's a chance there are bugs that will only be detected on ingest.
            • To add Gen 3 support to ap_verify code I need a Gen 3-compatible test dataset; this is DM-24844.
            • This ticket will consist only of code changes in ap_verify related to processing the Gen 3 components of ap_verify datasets, plus any bug fixes in the conversion script discovered while developing that code.
            Show
            krzys Krzysztof Findeisen added a comment - - edited This ticket has suffered a bit from rabbitholeing (though not actual scope creep, I think). To mitigate this, I'm splitting it up as follows: The conversion script that implements the Gen 3 format described above (and an obs_base bug that blocked it) will now go on DM-24260 , along with the updated datasets. I'll put it in review now, even though there's a chance there are bugs that will only be detected on ingest. To add Gen 3 support to ap_verify code I need a Gen 3-compatible test dataset; this is DM-24844 . This ticket will consist only of code changes in ap_verify related to processing the Gen 3 components of ap_verify datasets, plus any bug fixes in the conversion script discovered while developing that code.
            Hide
            krzys Krzysztof Findeisen added a comment -

            Progress report: the key code has been written and tested, but final integration into ingest_dataset.py is blocked on DM-25786.

            Show
            krzys Krzysztof Findeisen added a comment - Progress report: the key code has been written and tested, but final integration into ingest_dataset.py is blocked on DM-25786 .
            Hide
            krzys Krzysztof Findeisen added a comment -

            I've asked Jim Bosch to review this as an expert on `RawIngestTask`. Changes:

            • ap_verify has the main work
            • obs_base has a bug fix related to transfer configuration
            • ap_verify_* are updates to the Gen 3 repositories in each dataset. This is essentially a re-conversion from Gen 2, and I don't think it requires review.
            Show
            krzys Krzysztof Findeisen added a comment - I've asked Jim Bosch to review this as an expert on `RawIngestTask`. Changes: ap_verify has the main work obs_base has a bug fix related to transfer configuration ap_verify_* are updates to the Gen 3 repositories in each dataset. This is essentially a re-conversion from Gen 2, and I don't think it requires review.
            Hide
            jbosch Jim Bosch added a comment -

            Looks good.  As suggested, I did not really look at the test data repos, but I left a few comments for ap_verify and did not have any for obs_base.

            Show
            jbosch Jim Bosch added a comment - Looks good.  As suggested, I did not really look at the test data repos, but I left a few comments for ap_verify and did not have any for obs_base.

              People

              Assignee:
              krzys Krzysztof Findeisen
              Reporter:
              krzys Krzysztof Findeisen
              Reviewers:
              Jim Bosch
              Watchers:
              Jim Bosch, John Parejko, Krzysztof Findeisen, Meredith Rawls, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.