Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-31813

Add diaObjectId's coords to DiaSource Parquet Table before ingest

    XMLWordPrintable

    Details

    • Urgent?:
      No

      Description

      DAX needs a fiducial sky coord to use for the spatial partitioning. It's important that all the `diaSources` associated into a `diaObject` are on the same partition.

      Frtiz says (in context of ForcedSource, but applies to DiaSource too):

      if you give us an ObjectId, for each one we need to look up the associated Object to get the Object's ra/dec to determine where to place it in order to ingest. So we need to have seen the Objects first to build a hash table or index for this, and that will be really big for the whole sky. If pipelines happen to "know" the fiducial ra/dec for the Object associated with a ForcedSource and generates those as separate columns in the ForcedSource output products, we can drive spatial sharding off that directly during ingest, which will be much more efficient and would allow us to take in ForcedSources before Objects.

      We can easily strip the "redundant" ra/dec partitioning columns during ingest (we'd want to to cut down on storage; every column in the final ForcedSource tables hurts a lot because there are so many rows.)

      Currently drpAssociation writes out a nicely normalized `goodSeeingDiff_assocDiaSrcTable` and `goodSeeingDiff_diaObjTable`. We could either denormalize `goodSeeingDiff_assocDiaSrcTable` and add the diaObject's ra/decl (this is what we've been calling coord_ra and coord_dec in the parquet tables, and why you often see it in addition to ra/decl) or we can write an another task that joins the two and writes out a new table specifically for ingest. Other ideas welcome! I don't know how DAX uses these parquet for ingest, but if you write them to text files for bulk loading speed, maybe we could even override that to do the join before writing.

        Attachments

          Issue Links

            Activity

            No builds found.
            yusra Yusra AlSayyad created issue -
            sullivan Ian Sullivan made changes -
            Field Original Value New Value
            Labels SciencePipelines drp-gen3 SciencePipelines ap-analysis drp-gen3
            Hide
            cmorrison Chris Morrison [X] (Inactive) added a comment -

            Depending on when and how dax calculates their spatial index, you could update the full history of DiaSources for a given DiaObject with the DiaObject's index to enforce this. Would save having to store extra information with the DiaSource.

            Show
            cmorrison Chris Morrison [X] (Inactive) added a comment - Depending on when and how dax calculates their spatial index, you could update the full history of DiaSources for a given DiaObject with the DiaObject's index to enforce this. Would save having to store extra information with the DiaSource.
            yusra Yusra AlSayyad made changes -
            Summary Add diaObjectId's coords to DiaSource Table before ingest Add diaObjectId's coords to DiaSource Parquet Table before ingest
            ebellm Eric Bellm made changes -
            Watchers Chris Morrison, Eric Bellm, Fritz Mueller, Ian Sullivan, Yusra AlSayyad [ Chris Morrison, Eric Bellm, Fritz Mueller, Ian Sullivan, Yusra AlSayyad ] Chris Morrison, Colin Slater, Eric Bellm, Fritz Mueller, Ian Sullivan, Yusra AlSayyad [ Chris Morrison, Colin Slater, Eric Bellm, Fritz Mueller, Ian Sullivan, Yusra AlSayyad ]
            Hide
            ebellm Eric Bellm added a comment -

            For ForcedSources this is clearly right: there is a 1-1 mapping between the Object ra,dec and the Forced Source. For DIASources and DIAObjects it's more complicated, as the DIASource ra, dec aren't identical to the DIAObject ra, dec. I wonder if there are edge cases here to worry about, where we have DIASources near the edge of the association radius that would naturally want to fall in another partition from the DIAObject, and would be missed in queries of the DIASource table in the "right" partition if assigned to the partition of the associated DIAObject.

            Does DAX plan to use overlap or boundary regions around partitions? (If not, spatial crossmatches will be more expensive.) That would resolve this issue. Colin Slater should weigh in here as to whether I'm overthinking this.

            Show
            ebellm Eric Bellm added a comment - For ForcedSources this is clearly right: there is a 1-1 mapping between the Object ra,dec and the Forced Source. For DIASources and DIAObjects it's more complicated, as the DIASource ra, dec aren't identical to the DIAObject ra, dec. I wonder if there are edge cases here to worry about, where we have DIASources near the edge of the association radius that would naturally want to fall in another partition from the DIAObject, and would be missed in queries of the DIASource table in the "right" partition if assigned to the partition of the associated DIAObject. Does DAX plan to use overlap or boundary regions around partitions? (If not, spatial crossmatches will be more expensive.) That would resolve this issue. Colin Slater should weigh in here as to whether I'm overthinking this.
            Hide
            ctslater Colin Slater added a comment -

            The scenario that Eric described is indeed a valid concern. However, in the current state of qserv a table can either be co-partitioned with a "director" table to enable fast access by a foreign key (e.g. ForcedSource), or spatially sharded on its own to enable fast spatial queries (e.g. Object). Eric's point is exactly why these are in tension. Overlaps regions are used for the second category of tables ("director tables"), but  hybrid spatial+foreign key tables are not implemented and the storage cost of adding overlaps is nontrivial. (Fritz can correct me if I've misstated the tradeoffs here.)

            Because of this, I think it makes sense to partition DiaSource according to the corresponding DiaObject positions to enable fast light curve access, at the cost of leaving DiaSource spatial queries unoptimized (i.e. full table scans, so the results will still be correct). Spatially restricted joins between DiaObject and DiaSource may still give good performance, and that's one of the things we'll want to test with the real data. We can use that testing to decide how acceptable or unacceptable this arrangement is.

             

             

            Show
            ctslater Colin Slater added a comment - The scenario that Eric described is indeed a valid concern. However, in the current state of qserv a table can either be co-partitioned with a "director" table to enable fast access by a foreign key (e.g. ForcedSource), or spatially sharded on its own to enable fast spatial queries (e.g. Object). Eric's point is exactly why these are in tension. Overlaps regions are used for the second category of tables ("director tables"), but  hybrid spatial+foreign key tables are not implemented and the storage cost of adding overlaps is nontrivial. (Fritz can correct me if I've misstated the tradeoffs here.) Because of this, I think it makes sense to partition DiaSource according to the corresponding DiaObject positions to enable fast light curve access, at the cost of leaving DiaSource spatial queries unoptimized (i.e. full table scans, so the results will still be correct). Spatially restricted joins between DiaObject and DiaSource may still give good performance, and that's one of the things we'll want to test with the real data. We can use that testing to decide how acceptable or unacceptable this arrangement is.    
            Hide
            fritzm Fritz Mueller added a comment -

            You've got the right of it, Colin, and I concur with your recommendation.

            I'm certain spatially restricted INNER joins between DiaObject and DiaSource would be optimized by Qserv (we've recently tested similar queries on other datasets).  There are several strategies that be could pursued later to accelerate non-joined spatially-restriced queries on DiaSource, if the unoptimized perf. in these cases is deemed unacceptable.

            Show
            fritzm Fritz Mueller added a comment - You've got the right of it, Colin, and I concur with your recommendation. I'm certain spatially restricted INNER joins between DiaObject and DiaSource would be optimized by Qserv (we've recently tested similar queries on other datasets).  There are several strategies that be could pursued later to accelerate non-joined spatially-restriced queries on DiaSource, if the unoptimized perf. in these cases is deemed unacceptable.
            yusra Yusra AlSayyad made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            Hide
            yusra Yusra AlSayyad added a comment -

            I'm implementing this on DM-31825

            Show
            yusra Yusra AlSayyad added a comment - I'm implementing this on DM-31825
            yusra Yusra AlSayyad made changes -
            Link This issue is duplicated by DM-31825 [ DM-31825 ]
            yusra Yusra AlSayyad made changes -
            Status In Progress [ 3 ] To Do [ 10001 ]
            yusra Yusra AlSayyad made changes -
            Resolution Done [ 10000 ]
            Status To Do [ 10001 ] Won't Fix [ 10405 ]

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              yusra Yusra AlSayyad
              Watchers:
              Chris Morrison [X] (Inactive), Colin Slater, Eric Bellm, Fritz Mueller, Ian Sullivan, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.