Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-4239

Identify Qserv areas affected by secondary index

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      Evaluate Qserv software for the Czars and workers to identify where an interface to the secondary index will be required for efficient operation.

        Attachments

          Activity

          kelsey Mike Kelsey [X] (Inactive) created issue -
          kelsey Mike Kelsey [X] (Inactive) made changes -
          Field Original Value New Value
          Epic Link DM-4222 [ 21105 ]
          kelsey Mike Kelsey [X] (Inactive) made changes -
          Assignee Jacek Becla [ jbecla ]
          jbecla Jacek Becla made changes -
          Rank Ranked higher
          jbecla Jacek Becla made changes -
          Rank Ranked higher
          jbecla Jacek Becla made changes -
          Rank Ranked higher
          jbecla Jacek Becla made changes -
          Assignee Mike Kelsey [ kelsey ]
          jbecla Jacek Becla made changes -
          Rank Ranked higher
          fritzm Fritz Mueller made changes -
          Sprint DB_W16_03 [ 199 ]
          Hide
          kelsey Mike Kelsey [X] (Inactive) added a comment -

          Ran some simple greps to identify places where "get chunk" or "objectId" are referred to in the QServ code. Will consult with John Gates and Fritz Mueller about what I've missed.

          Show
          kelsey Mike Kelsey [X] (Inactive) added a comment - Ran some simple greps to identify places where "get chunk" or "objectId" are referred to in the QServ code. Will consult with John Gates and Fritz Mueller about what I've missed.
          kelsey Mike Kelsey [X] (Inactive) made changes -
          Status To Do [ 10001 ] In Progress [ 3 ]
          Hide
          kelsey Mike Kelsey [X] (Inactive) added a comment -

          Qserv already has an implementation, qproc/SecondaryIndex. This assumes a three-column table, with the objectId, chunkId, and subChunkId, where the latter two are both 32-bit. Can we implement them instead as shorts (16-bit), and save ~160 GB + overhead of space?

          All queries go through this implementation, so deployment will involve possibly modifying this implementation, for example to split things across multiple director tables.

          Show
          kelsey Mike Kelsey [X] (Inactive) added a comment - Qserv already has an implementation, qproc/SecondaryIndex. This assumes a three-column table, with the objectId, chunkId, and subChunkId, where the latter two are both 32-bit. Can we implement them instead as shorts (16-bit), and save ~160 GB + overhead of space? All queries go through this implementation, so deployment will involve possibly modifying this implementation, for example to split things across multiple director tables.
          Hide
          kelsey Mike Kelsey [X] (Inactive) added a comment - - edited

          Loading the index is done by way of the main qserv-data-loader.py script, with the "-i"/"--index-db" option flag. The default is for the table to go into the qservMeta database. The main script invokes admin/python/lsst/qserv/admin/dataLoader.py, where the indexDb parameter is used; no secondary index is generated if the oneTable flag is set.

          In dataLoader.py, _makeIndex does the work, creating a three-column table with the key (objectId) column name taken from the partition options file "id" entry, and the chunk columns named chunkId and subChunkId, respectively. The latter two are currently INT; should we compress that to SMALLINT (or better, UNSIGNED SMALLINT)?

          Finally, there are two functions which actually fill the secondary index: _makeIndexMultiNode for a system with a czar and multiple workers, or _makeIndexSingleNode for a single-host test. Both of these call _loadChunkIndex, which gets the three columns of data for the given chunk on the worker, packs that data into a tab-delimited temp file (in-memory file), then loads it into the secondary index table.

          Show
          kelsey Mike Kelsey [X] (Inactive) added a comment - - edited Loading the index is done by way of the main qserv-data-loader.py script, with the "-i"/"--index-db" option flag. The default is for the table to go into the qservMeta database. The main script invokes admin/python/lsst/qserv/admin/dataLoader.py , where the indexDb parameter is used; no secondary index is generated if the oneTable flag is set. In dataLoader.py , _makeIndex does the work, creating a three-column table with the key (objectId) column name taken from the partition options file "id" entry, and the chunk columns named chunkId and subChunkId, respectively. The latter two are currently INT ; should we compress that to SMALLINT (or better, UNSIGNED SMALLINT )? Finally, there are two functions which actually fill the secondary index: _makeIndexMultiNode for a system with a czar and multiple workers, or _makeIndexSingleNode for a single-host test. Both of these call _loadChunkIndex , which gets the three columns of data for the given chunk on the worker, packs that data into a tab-delimited temp file (in-memory file), then loads it into the secondary index table.
          Hide
          kelsey Mike Kelsey [X] (Inactive) added a comment -

          It looks like the basic pieces are in place, both for creating and for using the secondary index. Refinements to data loading, updating, and query results will be needed.

          Show
          kelsey Mike Kelsey [X] (Inactive) added a comment - It looks like the basic pieces are in place, both for creating and for using the secondary index. Refinements to data loading, updating, and query results will be needed.
          kelsey Mike Kelsey [X] (Inactive) made changes -
          Resolution Done [ 10000 ]
          Status In Progress [ 3 ] Done [ 10002 ]
          fritzm Fritz Mueller made changes -
          Assignee Mike Kelsey [ kelsey ]
          Reporter Mike Kelsey [ kelsey ] Fritz Mueller [ fritzm ]

            People

            • Assignee:
              Unassigned
              Reporter:
              fritzm Fritz Mueller
              Watchers:
              Mike Kelsey [X] (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Summary Panel