Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-2020

Research how to support L3

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      Research implications of having to deal with updatable Level 3 data.

        Attachments

          Issue Links

            Activity

            Hide
            bvan Brian Van Klaveren added a comment -

            This should probably have been brought up in DM-3365, but maybe it's okay here (if not, feel free to ignore and we'll keep it in mind til later):

            For Authorization, is there a plan to handle users in multiple roles/groups? For example, say a person is in LSST project, and that person is also an L3 user, or that person belongs to several L3 Groups.

            Show
            bvan Brian Van Klaveren added a comment - This should probably have been brought up in DM-3365 , but maybe it's okay here (if not, feel free to ignore and we'll keep it in mind til later): For Authorization, is there a plan to handle users in multiple roles/groups? For example, say a person is in LSST project, and that person is also an L3 user, or that person belongs to several L3 Groups.
            Hide
            fritzm Fritz Mueller added a comment - - edited

            We will want to be able to efficiently join L3 data with spatially-partitioned L2 catalogs in qserv. For large L3 datasets, it is probable that we will want to shard and distribute the data using the same chunking scheme used for the L2 catalogs, so L2+L3 queries can be efficiently distributed to workers in the same fashion as L2-only queries. As such, L3 rows which might be joined to L2 data within a chunk will want to be co-located and co-replicated with the L2 data for the chunk. "Smaller" L3 datasets could conceivably be replicated in their entirety to every worker node, or accessed by workers via a shared filesystem.

            The main implications of this to the replication system in particular are as follows:

            • L3 data may come and go during any given data release, whereas the chunked L2 data is explicitly static. It had previously been proposed that the "atom" of replication be the set of files backing row-subsets from all tables that correspond to a given spatial chunk. If we want to keep this as the replication atom, then we need to be prepared to treat changes to L3 data as versions of the atom and support upgrades/updates at times other than the yearly L2 data releases. Alternatively, if we choose to unbundle the L3 data into separate replication atoms, then the replication system will need explicit support for atom colocation, and we will need to support at least atom deletion between data releases.
            • If smaller L3 data is to be replicated throughout the cluster, then the replication system probably needs to explicitly support "replicate-to-every-node" mode for some chunks, and not just "replicate-to-n-nodes". We were probably going to want/need this anyway for some of the smaller non-partitioned L2 tables.
            Show
            fritzm Fritz Mueller added a comment - - edited We will want to be able to efficiently join L3 data with spatially-partitioned L2 catalogs in qserv. For large L3 datasets, it is probable that we will want to shard and distribute the data using the same chunking scheme used for the L2 catalogs, so L2+L3 queries can be efficiently distributed to workers in the same fashion as L2-only queries. As such, L3 rows which might be joined to L2 data within a chunk will want to be co-located and co-replicated with the L2 data for the chunk. "Smaller" L3 datasets could conceivably be replicated in their entirety to every worker node, or accessed by workers via a shared filesystem. The main implications of this to the replication system in particular are as follows: L3 data may come and go during any given data release, whereas the chunked L2 data is explicitly static. It had previously been proposed that the "atom" of replication be the set of files backing row-subsets from all tables that correspond to a given spatial chunk. If we want to keep this as the replication atom, then we need to be prepared to treat changes to L3 data as versions of the atom and support upgrades/updates at times other than the yearly L2 data releases. Alternatively, if we choose to unbundle the L3 data into separate replication atoms, then the replication system will need explicit support for atom colocation, and we will need to support at least atom deletion between data releases. If smaller L3 data is to be replicated throughout the cluster, then the replication system probably needs to explicitly support "replicate-to-every-node" mode for some chunks, and not just "replicate-to-n-nodes". We were probably going to want/need this anyway for some of the smaller non-partitioned L2 tables.
            Hide
            jbecla Jacek Becla added a comment -

            Sounds good, let's close it.

            Just a thought, we shoul open a confluence page for capturing random thoughts and notes about L3 from qserv perspective. I'll do it.

            Show
            jbecla Jacek Becla added a comment - Sounds good, let's close it. Just a thought, we shoul open a confluence page for capturing random thoughts and notes about L3 from qserv perspective. I'll do it.

              People

              Assignee:
              fritzm Fritz Mueller
              Reporter:
              fritzm Fritz Mueller
              Reviewers:
              Jacek Becla
              Watchers:
              Brian Van Klaveren, Fritz Mueller, Jacek Becla
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.