Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-4239

Identify Qserv areas affected by secondary index

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      Evaluate Qserv software for the Czars and workers to identify where an interface to the secondary index will be required for efficient operation.

        Attachments

          Activity

          Hide
          kelsey Mike Kelsey [X] (Inactive) added a comment -

          Ran some simple greps to identify places where "get chunk" or "objectId" are referred to in the QServ code. Will consult with John Gates and Fritz Mueller about what I've missed.

          Show
          kelsey Mike Kelsey [X] (Inactive) added a comment - Ran some simple greps to identify places where "get chunk" or "objectId" are referred to in the QServ code. Will consult with John Gates and Fritz Mueller about what I've missed.
          Hide
          kelsey Mike Kelsey [X] (Inactive) added a comment -

          Qserv already has an implementation, qproc/SecondaryIndex. This assumes a three-column table, with the objectId, chunkId, and subChunkId, where the latter two are both 32-bit. Can we implement them instead as shorts (16-bit), and save ~160 GB + overhead of space?

          All queries go through this implementation, so deployment will involve possibly modifying this implementation, for example to split things across multiple director tables.

          Show
          kelsey Mike Kelsey [X] (Inactive) added a comment - Qserv already has an implementation, qproc/SecondaryIndex. This assumes a three-column table, with the objectId, chunkId, and subChunkId, where the latter two are both 32-bit. Can we implement them instead as shorts (16-bit), and save ~160 GB + overhead of space? All queries go through this implementation, so deployment will involve possibly modifying this implementation, for example to split things across multiple director tables.
          Hide
          kelsey Mike Kelsey [X] (Inactive) added a comment - - edited

          Loading the index is done by way of the main qserv-data-loader.py script, with the "-i"/"--index-db" option flag. The default is for the table to go into the qservMeta database. The main script invokes admin/python/lsst/qserv/admin/dataLoader.py, where the indexDb parameter is used; no secondary index is generated if the oneTable flag is set.

          In dataLoader.py, _makeIndex does the work, creating a three-column table with the key (objectId) column name taken from the partition options file "id" entry, and the chunk columns named chunkId and subChunkId, respectively. The latter two are currently INT; should we compress that to SMALLINT (or better, UNSIGNED SMALLINT)?

          Finally, there are two functions which actually fill the secondary index: _makeIndexMultiNode for a system with a czar and multiple workers, or _makeIndexSingleNode for a single-host test. Both of these call _loadChunkIndex, which gets the three columns of data for the given chunk on the worker, packs that data into a tab-delimited temp file (in-memory file), then loads it into the secondary index table.

          Show
          kelsey Mike Kelsey [X] (Inactive) added a comment - - edited Loading the index is done by way of the main qserv-data-loader.py script, with the "-i"/"--index-db" option flag. The default is for the table to go into the qservMeta database. The main script invokes admin/python/lsst/qserv/admin/dataLoader.py , where the indexDb parameter is used; no secondary index is generated if the oneTable flag is set. In dataLoader.py , _makeIndex does the work, creating a three-column table with the key (objectId) column name taken from the partition options file "id" entry, and the chunk columns named chunkId and subChunkId, respectively. The latter two are currently INT ; should we compress that to SMALLINT (or better, UNSIGNED SMALLINT )? Finally, there are two functions which actually fill the secondary index: _makeIndexMultiNode for a system with a czar and multiple workers, or _makeIndexSingleNode for a single-host test. Both of these call _loadChunkIndex , which gets the three columns of data for the given chunk on the worker, packs that data into a tab-delimited temp file (in-memory file), then loads it into the secondary index table.
          Hide
          kelsey Mike Kelsey [X] (Inactive) added a comment -

          It looks like the basic pieces are in place, both for creating and for using the secondary index. Refinements to data loading, updating, and query results will be needed.

          Show
          kelsey Mike Kelsey [X] (Inactive) added a comment - It looks like the basic pieces are in place, both for creating and for using the secondary index. Refinements to data loading, updating, and query results will be needed.

            People

            • Assignee:
              Unassigned
              Reporter:
              fritzm Fritz Mueller
              Watchers:
              Mike Kelsey [X] (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Summary Panel