Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-31813

Add diaObjectId's coords to DiaSource Parquet Table before ingest

    XMLWordPrintable

    Details

    • Urgent?:
      No

      Description

      DAX needs a fiducial sky coord to use for the spatial partitioning. It's important that all the `diaSources` associated into a `diaObject` are on the same partition.

      Frtiz says (in context of ForcedSource, but applies to DiaSource too):

      if you give us an ObjectId, for each one we need to look up the associated Object to get the Object's ra/dec to determine where to place it in order to ingest. So we need to have seen the Objects first to build a hash table or index for this, and that will be really big for the whole sky. If pipelines happen to "know" the fiducial ra/dec for the Object associated with a ForcedSource and generates those as separate columns in the ForcedSource output products, we can drive spatial sharding off that directly during ingest, which will be much more efficient and would allow us to take in ForcedSources before Objects.

      We can easily strip the "redundant" ra/dec partitioning columns during ingest (we'd want to to cut down on storage; every column in the final ForcedSource tables hurts a lot because there are so many rows.)

      Currently drpAssociation writes out a nicely normalized `goodSeeingDiff_assocDiaSrcTable` and `goodSeeingDiff_diaObjTable`. We could either denormalize `goodSeeingDiff_assocDiaSrcTable` and add the diaObject's ra/decl (this is what we've been calling coord_ra and coord_dec in the parquet tables, and why you often see it in addition to ra/decl) or we can write an another task that joins the two and writes out a new table specifically for ingest. Other ideas welcome! I don't know how DAX uses these parquet for ingest, but if you write them to text files for bulk loading speed, maybe we could even override that to do the join before writing.

        Attachments

          Issue Links

            Activity

            yusra Yusra AlSayyad created issue -
            sullivan Ian Sullivan made changes -
            Field Original Value New Value
            Labels SciencePipelines drp-gen3 SciencePipelines ap-analysis drp-gen3
            yusra Yusra AlSayyad made changes -
            Summary Add diaObjectId's coords to DiaSource Table before ingest Add diaObjectId's coords to DiaSource Parquet Table before ingest
            ebellm Eric Bellm made changes -
            Watchers Chris Morrison, Eric Bellm, Fritz Mueller, Ian Sullivan, Yusra AlSayyad [ Chris Morrison, Eric Bellm, Fritz Mueller, Ian Sullivan, Yusra AlSayyad ] Chris Morrison, Colin Slater, Eric Bellm, Fritz Mueller, Ian Sullivan, Yusra AlSayyad [ Chris Morrison, Colin Slater, Eric Bellm, Fritz Mueller, Ian Sullivan, Yusra AlSayyad ]
            yusra Yusra AlSayyad made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            yusra Yusra AlSayyad made changes -
            Link This issue is duplicated by DM-31825 [ DM-31825 ]
            yusra Yusra AlSayyad made changes -
            Status In Progress [ 3 ] To Do [ 10001 ]
            yusra Yusra AlSayyad made changes -
            Resolution Done [ 10000 ]
            Status To Do [ 10001 ] Won't Fix [ 10405 ]

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              yusra Yusra AlSayyad
              Watchers:
              Chris Morrison [X] (Inactive), Colin Slater, Eric Bellm, Fritz Mueller, Ian Sullivan, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.