Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-36199

Add optional Parquet outputs to diaPipe

    XMLWordPrintable

Details

    • Story
    • Status: Done
    • Resolution: Done
    • None
    • None
    • None
    • 2
    • AP F22-4 (September), AP F22-5 (October)
    • Alert Production
    • No

    Description

      Add optional outputs for diaForcedSources and diaObjectCatalog to diaPipe. These can use the same doWriteAssociatedSources config to control whether they are written.

      Attachments

        Issue Links

          Activity

            Parejkoj John Parejko added a comment -

            mrawls: are you willing to be the reviewer for this ticket? The number of lines changed in the PR is small, but what I really need is someone to say that the content of those files is useful to us, and I think you're best placed to do that.

            Parejkoj John Parejko added a comment - mrawls : are you willing to be the reviewer for this ticket? The number of lines changed in the PR is small, but what I really need is someone to say that the content of those files is useful to us, and I think you're best placed to do that.

            I definitely see the new persisted diaObject and diaForcedSource parquet tables. I see you chose to have them follow the naming convention with a fakes_deepDiff_ prefix, and that they are per visit/detector. This mirrors the fakes_deepDiff_diaSrc that was already being written, and is reasonable.

            Naming - consider diaForcedSrc instead of diaForcedSource (to mirror diaSrc, which already exists)?

            I suppose we will need an "afterburner" processing routine of some sort to consolidate these into a parquet analog of what lands in the APDB (over all visits and detectors). This is straightforward for diaSrc and diaForcedSource, but not for diaObject. I need to think about how to best turn these individual diaObject tables into the whole-run diaObject data I'm used to for making plots.

            mrawls Meredith Rawls added a comment - I definitely see the new persisted diaObject and diaForcedSource parquet tables. I see you chose to have them follow the naming convention with a fakes_deepDiff_ prefix, and that they are per visit/detector. This mirrors the fakes_deepDiff_diaSrc that was already being written, and is reasonable. Naming - consider diaForcedSrc instead of diaForcedSource (to mirror diaSrc, which already exists)? I suppose we will need an "afterburner" processing routine of some sort to consolidate these into a parquet analog of what lands in the APDB (over all visits and detectors). This is straightforward for diaSrc and diaForcedSource, but not for diaObject. I need to think about how to best turn these individual diaObject tables into the whole-run diaObject data I'm used to for making plots.

            I can rename the force source table easily enough.

            Consolidating the tables is outside of scope for this ticket. I just wanted to be sure that the output was what was expected.

            Can you please look at the PR and sign off if you're happy with it?

            Parejkoj John Parejko added a comment - I can rename the force source table easily enough. Consolidating the tables is outside of scope for this ticket. I just wanted to be sure that the output was what was expected. Can you please look at the PR and sign off if you're happy with it?

            I'm trying to compare the parquet tables to the APDB tables to see if the content agrees, which requires consolidating info from multiple dataIds in some halfway consistent way. John and I spent some time working together on this today.

            Summary

            • In all comparisons, APDB tables use None to indicate no data, whereas parquet tables use some fun mashup of nan, NaN, and sometimes even NaT ("not a time," I think) to indicate no data. This strikes me as annoying but not world-ending.
            • APDB DiaSource matches well with parquet fakes_deepDiff_assocDiaSrc. The only difference is the former has more columns (111 vs just 44). Reconciling this is out of scope for this ticket, and we may not care anyway.
            • APDB DiaObject matches mostly OK with parquet fakes_deepDiff_diaObject, with a caveat pertaining to validityStart. The columns are the same. However, the number of rows differ, because a naive pandas concatenation of all the parquet tables yields some near-duplicate rows. They differ in that validityStart has a timestamp in one "duplicate row" but not the other. The APDB rows all have a timestamp in validityStart, and some of them correspond to the time of the first dataId and some correspond to the time of the second dataId. In contrast, the parquet tables have either NaT or the timestamp of the first dataId only. This difference affects (1) DiaObject rows that appear in both dataIds - thus generating duplicates - and (2) all rows from the last dataId processed. This needs to be addressed.
            • APDB DiaForcedSource matches mostly OK with parquet fakes_deepDiff_diaForcedSource, with a caveat pertaining to flags. Both have 12 columns, so that part is fine. The number of rows differ in the same way as the number of DiaObject rows differ, which is mostly fine. However, the near-duplicate rows differ only in the flags column: one row has NaN and the other row has 0.0. Interestingly, the first dataId processed does not have a flags column in the parquet table, and the flags column only appears in for the second dataId processed, and all entries in this column are either NaN or 0.0 (both type np.float64). This needs to be resolved somehow. (The APDB DiaForcedSource table has a flags column and all entires are 0, type np.int64.)

            Some Fun Facts

            • This data set has 2 dataIds, and a total of 479 unique DiaObjectIds.
            • The first dataId processed has 289 DiaSources. The second has 222 DiaSources. So far so good.
            • The APDB DiaObject table has 479 rows. Great. All rows have a timestamp in the validityStart column corresponding to the time of either the first or second dataId processed.
            • A naive pandas concatenation of all the fakes_deepDiff_diaObject yields 691 rows. This difference of 212 makes sense because the extras are all near-duplicate (same DiaObjectId), but one has a timestamp in validityStart while the other does not. None of the new DiaObjects from the second dataId processed has a timestamp in validityStart. Overall, NaT is not an acceptable value for validityStart, but simply dropping these rows would have the side effect of dropping all the new DiaObjects from the last dataId processed.
            • The first dataId processed has 289 DiaForcedSources. The second has 481. This is probably OK because I think forced photometry happens at the location of all known DiaObjects and explicitly includes edge cases (chip edges), and 481 is just "pretty close" to 479.
            • The APDB DiaForcedSource table has 558 unique DiaForcedSourceIds.
            • A naive pandas concatenation of all the parquet fakes_deepDiff_diaForcedSource yields 770 rows. This difference is the same as the difference in DiaObjects - 212 more. The extras also correspond to near-duplicate rows, with the difference residing entirely in the flags column.
            mrawls Meredith Rawls added a comment - I'm trying to compare the parquet tables to the APDB tables to see if the content agrees, which requires consolidating info from multiple dataIds in some halfway consistent way. John and I spent some time working together on this today. Summary In all comparisons, APDB tables use None  to indicate no data, whereas parquet tables use some fun mashup of nan , NaN , and sometimes even NaT ("not a time," I think) to indicate no data. This strikes me as annoying but not world-ending. APDB DiaSource matches well with parquet fakes_deepDiff_assocDiaSrc . The only difference is the former has more columns (111 vs just 44). Reconciling this is out of scope for this ticket, and we may not care anyway. APDB DiaObject matches mostly OK with parquet fakes_deepDiff_diaObject , with a caveat pertaining to validityStart. The columns are the same. However, the number of rows differ, because a naive pandas concatenation of all the parquet tables yields some near-duplicate rows. They differ in that validityStart has a timestamp in one "duplicate row" but not the other. The APDB rows all have a timestamp in validityStart, and some of them correspond to the time of the first dataId and some correspond to the time of the second dataId. In contrast, the parquet tables have either NaT or the timestamp of the first dataId only. This difference affects (1) DiaObject rows that appear in both dataIds - thus generating duplicates - and (2) all rows from the last dataId processed. This needs to be addressed. APDB DiaForcedSource matches mostly OK with parquet fakes_deepDiff_diaForcedSource , with a caveat pertaining to flags. Both have 12 columns, so that part is fine. The number of rows differ in the same way as the number of DiaObject rows differ, which is mostly fine. However, the near-duplicate rows differ only in the flags column: one row has NaN and the other row has 0.0. Interestingly, the first dataId processed does not have a flags column in the parquet table, and the flags column only appears in for the second dataId processed, and all entries in this column are either NaN  or 0.0 (both type np.float64). This needs to be resolved somehow. (The APDB DiaForcedSource table has a flags column and all entires are 0, type np.int64.) Some Fun Facts This data set has 2 dataIds, and a total of 479 unique DiaObjectIds. The first dataId processed has 289 DiaSources. The second has 222 DiaSources. So far so good. The APDB DiaObject table has 479 rows. Great. All rows have a timestamp in the validityStart column corresponding to the time of either the first or second dataId processed. A naive pandas concatenation of all the fakes_deepDiff_diaObject yields 691 rows. This difference of 212 makes sense because the extras are all near-duplicate (same DiaObjectId), but one has a timestamp in validityStart while the other does not. None of the new DiaObjects from the second dataId processed has a timestamp in validityStart. Overall, NaT is not an acceptable value for validityStart, but simply dropping these rows would have the side effect of dropping all the new DiaObjects from the last dataId processed. The first dataId processed has 289 DiaForcedSources. The second has 481. This is probably OK because I think forced photometry happens at the location of all known DiaObjects and explicitly includes edge cases (chip edges), and 481 is just "pretty close" to 479. The APDB DiaForcedSource table has 558 unique DiaForcedSourceIds. A naive pandas concatenation of all the parquet fakes_deepDiff_diaForcedSource  yields 770 rows. This difference is the same as the difference in DiaObjects - 212 more. The extras also correspond to near-duplicate rows, with the difference residing entirely in the flags column.

            John and I worked some more on this, and reached the following conclusions

            (1) Ticket is OK to merge as-is, but a followup ticket is needed to make the diaForcedSource and diaObject parquet tables useful for downstream analysis (i.e., metrics or plots with analysis_tools).

            (2) Naively concatenating parquet diaObject tables is not sufficient to reproduce an APDB DiaObject table due to a lack of validityStart info. Whatever ticket is made to handle this should clearly be noted as blocking any analysis_tools style plots of DiaObjects.

            (3) Parquet diaForcedSource tables have a flags column of floats, when this column exists; it does not exist for the first visit+detector processed, and it is errantly converted to float due to a pandas append operation in doPackageAlerts. It is not clear that this column is useful - even in a large HiTS run, all values in the APDB are 0. In a followup ticket, we need to either define what this column is for or delete it outright.

            mrawls Meredith Rawls added a comment - John and I worked some more on this, and reached the following conclusions (1) Ticket is OK to merge as-is, but a followup ticket is needed to make the diaForcedSource and diaObject parquet tables useful for downstream analysis (i.e., metrics or plots with analysis_tools). (2) Naively concatenating parquet diaObject tables is not sufficient to reproduce an APDB DiaObject table due to a lack of validityStart info. Whatever ticket is made to handle this should clearly be noted as blocking any analysis_tools style plots of DiaObjects. (3) Parquet diaForcedSource tables have a flags column of floats, when this column exists; it does not exist for the first visit+detector processed, and it is errantly converted to float due to a pandas append operation in doPackageAlerts. It is not clear that this column is useful - even in a large HiTS run, all values in the APDB are 0. In a followup ticket, we need to either define what this column is for or delete it outright.

            People

              Parejkoj John Parejko
              sullivan Ian Sullivan
              Meredith Rawls
              Eric Bellm, Ian Sullivan, John Parejko, Meredith Rawls
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Jenkins

                  No builds found.