Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-14761

Fix problem causing only a portion of hits2015 to process

    Details

    • Story Points:
      4
    • Epic Link:
    • Sprint:
      AP F18-1
    • Team:
      Alert Production

      Description

      DM-14260 only had about 1/3 of the full ap_verify_hits2015 dataset successfully process. This ticket is to figure out why that happened and fix it so we have a complete processing run through ap_pipe.

        Attachments

          Issue Links

            Activity

            Hide
            mrawls Meredith Rawls added a comment -

            The problem turned out to be related to validity ranges set during ingestion. The rest of the dataset processed fine in /project/mrawls/hits2015/rerun/take2 and the full dataset (take1 + take2) was analyzed in the new notebook I pushed to ap_pipe/notebooks. It looks like many issues remain with the templates and diffim quality, but I now have code (in the notebook) to quickly plot light curves and cutouts which will help interpret what is happening once we fix the templates.

            Show
            mrawls Meredith Rawls added a comment - The problem turned out to be related to validity ranges set during ingestion. The rest of the dataset processed fine in /project/mrawls/hits2015/rerun/take2  and the full dataset (take1 + take2) was analyzed in the new notebook I pushed to ap_pipe/notebooks . It looks like many issues remain with the templates and diffim quality, but I now have code (in the notebook) to quickly plot light curves and cutouts which will help interpret what is happening once we fix the templates.
            Hide
            cmorrison Chris Morrison added a comment -

            Looks fine, however I should point out that the way you have combined the databases of associated sources will not correctly associate the data. You need to run from end to end with the same database or your future DIASources will never be associated with the previous DIAObjects. In the future, if a run for this test data stops part way through you'll want to point the database task's config to the location of the previous database. In AssociationDBSqliteTask you would specify the variable `db_name` as the full path to the sqlite file where the data are stored.

            Show
            cmorrison Chris Morrison added a comment - Looks fine, however I should point out that the way you have combined the databases of associated sources will not correctly associate the data. You need to run from end to end with the same database or your future DIASources will never be associated with the previous DIAObjects. In the future, if a run for this test data stops part way through you'll want to point the database task's config to the location of the previous database. In AssociationDBSqliteTask you would specify the variable `db_name` as the full path to the sqlite file where the data are stored.
            Hide
            krzys Krzysztof Findeisen added a comment - - edited

            Hmm... currently ApPipeTask will reuse a database if and only if it's reusing an output repository. I hadn't thought of reruns.

            Keeping the database in the output repository may actually be a serious design flaw, since in "realistic" (not everything at once) use that means we're treating the output repo as an input...

            Show
            krzys Krzysztof Findeisen added a comment - - edited Hmm... currently ApPipeTask will reuse a database if and only if it's reusing an output repository. I hadn't thought of reruns. Keeping the database in the output repository may actually be a serious design flaw, since in "realistic" (not everything at once) use that means we're treating the output repo as an input...
            Hide
            mrawls Meredith Rawls added a comment -

            Thanks for pointing this out! It sounds like there are a couple issues here:

            (1) My naive pandas dataframe concatenation didn't work as I intended because the DIAObjects have no way of being associated between two independent reruns. In other words, the DIAObject IDs differ, and database shenanigans operations alone can't solve that. I could run source association on the two resulting databases if I cared about doing a proper analysis with take1 + take2 together. But that's outside the scope of this ticket.

            (2) ap_pipe may not be able to handle the case where a user wants to point to an existing association database and have a new run continue associating sources to the stuff already in there. Chris Morrison says ap_association can do this via a config setting that points to the existing database, but Krzysztof Findeisen says ApPipeTask only reuses databases if it reuses an output repository. This needs to be addressed and may make a good CI test case. This functionality will be critical in the context of a survey where we regularly want to associate new DIASources to existing/known DIAObjects.

            I'm going to make a note in my jupyter notebook that the pandas concatenation isn't doing what I intended and close this ticket. Point (2) here will undoubtedly spawn new tickets (to be discussed at a future ap_pipe meeting).

            Show
            mrawls Meredith Rawls added a comment - Thanks for pointing this out! It sounds like there are a couple issues here: (1) My naive pandas dataframe concatenation didn't work as I intended because the DIAObjects have no way of being associated between two independent reruns. In other words, the DIAObject IDs differ, and database shenanigans  operations alone can't solve that. I could run source association on the two resulting databases if I cared about doing a proper analysis with take1 + take2 together. But that's outside the scope of this ticket. (2) ap_pipe may not be able to handle the case where a user wants to point to an existing association database and have a new run continue associating sources to the stuff already in there. Chris Morrison says ap_association can do this via a config setting that points to the existing database, but Krzysztof Findeisen says ApPipeTask only reuses databases if it reuses an output repository. This needs to be addressed and may make a good CI test case. This functionality will be critical in the context of a survey where we regularly want to associate new DIASources to existing/known DIAObjects. I'm going to make a note in my jupyter notebook that the pandas concatenation isn't doing what I intended and close this ticket. Point (2) here will undoubtedly spawn new tickets (to be discussed at a future ap_pipe meeting).

              People

              • Assignee:
                mrawls Meredith Rawls
                Reporter:
                mrawls Meredith Rawls
                Reviewers:
                Chris Morrison
                Watchers:
                Chris Morrison, Krzysztof Findeisen, Meredith Rawls
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: