Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-14017

False positives when testing for ingestion completion

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Won't Fix
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: ap_verify
    • Labels:
      None
    • Story Points:
      6
    • Sprint:
      AP F18-1, AP F18-3, AP F18-4
    • Team:
      Alert Production

      Description

      ap_verify tests whether particular types of data have been ingested, and only attempts ingestion if they have not. However, the checks currently address whether ingestion began, not whether it finished successfully. This may cause problems if ingestion fails partway through, or if the order in which datasets are ingested is changed during future refactoring.

      Either the ingestion checks need to be rewritten to directly test ingestion completion, or IngestTask needs better handling of duplicate files.

        Attachments

          Issue Links

            Activity

            Hide
            krzys Krzysztof Findeisen added a comment -

            After spending some time thinking about this, I'm closing the issue as Won't Fix. The circumstances in which the bug would actually appear are fairly unlikely (the user needs to interrupt or fail to ingest a dataset, then try to start over without clearing out the repositories), and every solution I can think of would cause more substantial problems:

            • Replacing DatasetIngestTask's current skipping behavior with calls to the ingester with --ignore-ingested will remove most of the speed benefits, assuming that reading the file headers to generate a data ID dominates the work of mode=link ingestion.
            • Modifying IngestTask to test for previously ingested files based on the filename rather than the data ID (I think this is what I meant by "better handling" in the issue description) will be both complex and brittle, since the relationships between input and ingested files are not obvious for mode=copy and mode=move ingestion. Worse, it is possible for uningested files to have the same name and be distinguished by directory (this seems to be NOAO's system for calibration products), but the directories used in a repository are completely different.
            • Adding an inventory manifest to the dataset format will make the process of creating datasets even more complex than it already is, essentially making datasets a second repository format instead of just a set of directories with files. I asked on #ap-prototype-pipeline and Meredith Rawls agreed that this is an undesirable change.
            • Reporting the repository contents, at the end of ingestion, for manual inspection will require frequent human intervention; in ap_verify runs, the inspection would have to interrupt processing. I'm also not sure the average user could tell if something were missing.
            • Automatically testing, at the end of ingestion, that all necessary files are available will require a test Task that strongly depends on implementation details of ProcessCcdTask and ImageDifferenceTask, since it requires determining whether, for example, whether there are calibration and template files for each science image. Failure to keep the implementations in sync will break processing without making it obvious that the test is out of date. This problem is made even worse by the fact that different instruments and different pipeline configurations require different dataset types.
            • Placing marker files in the workspace to indicate completion of ingestion steps cannot be done without permanently(?) assuming that the workspace and its repositories are directories (making DM-11482 impossible). Creating Butler dataset types for ap_verify's different markers will pollute obs_base with implementation details (in that the markers are not a "real", user-facing, pipeline product), and cannot be used to mark ingestion of templates (which may be read-only) or reference catalogs (which require out-of-repo handling).
            Show
            krzys Krzysztof Findeisen added a comment - After spending some time thinking about this, I'm closing the issue as Won't Fix. The circumstances in which the bug would actually appear are fairly unlikely (the user needs to interrupt or fail to ingest a dataset, then try to start over without clearing out the repositories), and every solution I can think of would cause more substantial problems: Replacing DatasetIngestTask 's current skipping behavior with calls to the ingester with --ignore-ingested will remove most of the speed benefits, assuming that reading the file headers to generate a data ID dominates the work of mode=link ingestion. Modifying IngestTask to test for previously ingested files based on the filename rather than the data ID (I think this is what I meant by "better handling" in the issue description) will be both complex and brittle, since the relationships between input and ingested files are not obvious for mode=copy and mode=move ingestion. Worse, it is possible for uningested files to have the same name and be distinguished by directory (this seems to be NOAO's system for calibration products), but the directories used in a repository are completely different. Adding an inventory manifest to the dataset format will make the process of creating datasets even more complex than it already is, essentially making datasets a second repository format instead of just a set of directories with files. I asked on #ap-prototype-pipeline and Meredith Rawls agreed that this is an undesirable change. Reporting the repository contents, at the end of ingestion, for manual inspection will require frequent human intervention; in ap_verify runs, the inspection would have to interrupt processing. I'm also not sure the average user could tell if something were missing. Automatically testing, at the end of ingestion, that all necessary files are available will require a test Task that strongly depends on implementation details of ProcessCcdTask and ImageDifferenceTask , since it requires determining whether, for example, whether there are calibration and template files for each science image. Failure to keep the implementations in sync will break processing without making it obvious that the test is out of date. This problem is made even worse by the fact that different instruments and different pipeline configurations require different dataset types. Placing marker files in the workspace to indicate completion of ingestion steps cannot be done without permanently(?) assuming that the workspace and its repositories are directories (making DM-11482 impossible). Creating Butler dataset types for ap_verify's different markers will pollute obs_base with implementation details (in that the markers are not a "real", user-facing, pipeline product), and cannot be used to mark ingestion of templates (which may be read-only) or reference catalogs (which require out-of-repo handling).

              People

              Assignee:
              krzys Krzysztof Findeisen
              Reporter:
              krzys Krzysztof Findeisen
              Watchers:
              Krzysztof Findeisen
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: