Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-24612

Add indexes to dataset_collection tables

    XMLWordPrintable

    Details

    • Story Points:
      1
    • Epic Link:
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      Adding the right indexes - once we determine what they are - should be easy after DM-21764. Stay tuned for a follow-up post on that here.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment - - edited

            Christopher Stephens [X], Kian-Tat Lim, Andy Salnikov, and other database pundits, I have some questions about what indexes I should be adding to the dataset tables now that (soon; post DM-21764) I'm in a position to do so.

            Here's an example of the kind of table we're working with (SQLite dialect):

            CREATE TABLE dataset_collection_24cc (
                dataset_type_id BIGINT NOT NULL,
                dataset_id BIGINT,
                collection_name VARCHAR(64),
                instrument VARCHAR(16) NOT NULL,
                detector INTEGER NOT NULL,
                visit INTEGER NOT NULL, 
                PRIMARY KEY (dataset_id, collection_name), 
                CONSTRAINT dataset_collection_24cc_unq_dataset_type_id_collection_name_instrument_detector_visit UNIQUE (dataset_type_id, collection_name, instrument, detector, visit), 
                CONSTRAINT fkey_dataset_collection_24cc_dataset_type_id_dataset_type_id FOREIGN KEY(dataset_type_id) REFERENCES dataset_type (id), 
                CONSTRAINT fkey_dataset_collection_24cc_dataset_id_dataset_id FOREIGN KEY(dataset_id) REFERENCES dataset (id) ON DELETE CASCADE, 
                CONSTRAINT fkey_dataset_collection_24cc_collection_name_collection_name FOREIGN KEY(collection_name) REFERENCES collection (name) ON DELETE CASCADE, 
                CONSTRAINT fkey_dataset_collection_24cc_instrument_name_instrument FOREIGN KEY(instrument) REFERENCES instrument (name), 
                CONSTRAINT fkey_dataset_collection_24cc_detector_instrument_id_instrument_detector FOREIGN KEY(instrument, detector) REFERENCES detector (instrument, id), 
                CONSTRAINT fkey_dataset_collection_24cc_visit_instrument_id_instrument_visit FOREIGN KEY(instrument, visit) REFERENCES visit (instrument, id)
            );
            

            The "instrument", "detector", and "visit" fields are together the data ID fields - different tables will have different combinations of these fields and other similar ones.

            The kinds of queries we want to optimize look like this:

            SELECT
                dataset_id
            FROM dataset_collection_24cc
            WHERE
                collection_name=? AND dataset_type_id=? AND instrument=? AND detector=? AND detector=?
            

            or they'll be a larger query in which this table appears in a subquery, with roughly this form:

            SELECT
                visit.instrument AS instrument,
                detector.id AS detector,
                visit.id AS visit,
                calexp.dataset_id AS calexp_dataset_id,
                ...
            FROM
                visit
                INNER JOIN detector ON (visit.instrument = detector.instrument)
                INNER JOIN (
                    SELECT dataset_id, instrument, detector, visit
                    FROM dataset_collection_24cc
                    WHERE collection_name=? AND dataset_type_id=?
                ) calexp ON (
                    calexp.instrument=visit.instrument AND calexp.visit=visit.id 
                    AND
                    calexp.instrument=detector.instrument AND calexp.detector=detector.id
                )
                ...
            

            I'm eliding some joins to other tables (on dataset_id) that would also appear unless we did some denormalization because I don't think they're relevant for my questions.

            Some notes on this, before I get to those questions:

            • It is not a coincidence that both of these queries specify the dataset_type_id and collection_name explicitly. At present I am planning to loop over the combinations of those that could actually contain datasets in Python when generating queries to try to keep the queries from getting too complex.
            • We will often want to search multiple collections. At present I'm doing that by doing a UNION inside the subquery rather than collection_name IN (...), but could change that necessary; this is just much easier to write to code to generate the way things are structured in Python. I think this is also orthogonal to my questions.

            So, the questions:

            • Do we want one index on all of (collection_name, dataset_type, <data-ID-fields), one index on each of them, or multiple indexes on different subsets of them?
            • Given that we already have a UNIQUE constraint on all of those fields together, should we expect an index to have been created automatically by the database to enforce that constraint, and that this index will automatically be used when it's useful in queries like this?
            • Given that we already have FOREIGN KEY constraints for each of the data ID field combinations that would appear in the JOIN ON expressions, should we expect a similar usable index to have been created automatically for those?
            Show
            jbosch Jim Bosch added a comment - - edited Christopher Stephens [X] , Kian-Tat Lim , Andy Salnikov , and other database pundits, I have some questions about what indexes I should be adding to the dataset tables now that (soon; post DM-21764 ) I'm in a position to do so. Here's an example of the kind of table we're working with (SQLite dialect): CREATE TABLE dataset_collection_24cc ( dataset_type_id BIGINT NOT NULL , dataset_id BIGINT , collection_name VARCHAR (64), instrument VARCHAR (16) NOT NULL , detector INTEGER NOT NULL , visit INTEGER NOT NULL , PRIMARY KEY (dataset_id, collection_name), CONSTRAINT dataset_collection_24cc_unq_dataset_type_id_collection_name_instrument_detector_visit UNIQUE (dataset_type_id, collection_name, instrument, detector, visit), CONSTRAINT fkey_dataset_collection_24cc_dataset_type_id_dataset_type_id FOREIGN KEY (dataset_type_id) REFERENCES dataset_type (id), CONSTRAINT fkey_dataset_collection_24cc_dataset_id_dataset_id FOREIGN KEY (dataset_id) REFERENCES dataset (id) ON DELETE CASCADE , CONSTRAINT fkey_dataset_collection_24cc_collection_name_collection_name FOREIGN KEY (collection_name) REFERENCES collection ( name ) ON DELETE CASCADE , CONSTRAINT fkey_dataset_collection_24cc_instrument_name_instrument FOREIGN KEY (instrument) REFERENCES instrument ( name ), CONSTRAINT fkey_dataset_collection_24cc_detector_instrument_id_instrument_detector FOREIGN KEY (instrument, detector) REFERENCES detector (instrument, id), CONSTRAINT fkey_dataset_collection_24cc_visit_instrument_id_instrument_visit FOREIGN KEY (instrument, visit) REFERENCES visit (instrument, id) ); The "instrument", "detector", and "visit" fields are together the data ID fields - different tables will have different combinations of these fields and other similar ones. The kinds of queries we want to optimize look like this: SELECT dataset_id FROM dataset_collection_24cc WHERE collection_name=? AND dataset_type_id=? AND instrument=? AND detector=? AND detector=? or they'll be a larger query in which this table appears in a subquery, with roughly this form: SELECT visit.instrument AS instrument, detector.id AS detector, visit.id AS visit, calexp.dataset_id AS calexp_dataset_id, ... FROM visit INNER JOIN detector ON (visit.instrument = detector.instrument) INNER JOIN ( SELECT dataset_id, instrument, detector, visit FROM dataset_collection_24cc WHERE collection_name=? AND dataset_type_id=? ) calexp ON ( calexp.instrument=visit.instrument AND calexp.visit=visit.id AND calexp.instrument=detector.instrument AND calexp.detector=detector.id ) ... I'm eliding some joins to other tables (on dataset_id) that would also appear unless we did some denormalization because I don't think they're relevant for my questions. Some notes on this, before I get to those questions: It is not a coincidence that both of these queries specify the dataset_type_id and collection_name explicitly. At present I am planning to loop over the combinations of those that could actually contain datasets in Python when generating queries to try to keep the queries from getting too complex. We will often want to search multiple collections. At present I'm doing that by doing a UNION inside the subquery rather than collection_name IN (...), but could change that necessary; this is just much easier to write to code to generate the way things are structured in Python. I think this is also orthogonal to my questions. So, the questions: Do we want one index on all of (collection_name, dataset_type, <data-ID-fields), one index on each of them, or multiple indexes on different subsets of them? Given that we already have a UNIQUE constraint on all of those fields together, should we expect an index to have been created automatically by the database to enforce that constraint, and that this index will automatically be used when it's useful in queries like this? Given that we already have FOREIGN KEY constraints for each of the data ID field combinations that would appear in the JOIN ON expressions, should we expect a similar usable index to have been created automatically for those?
            Hide
            salnikov Andy Salnikov added a comment - - edited

            I presume in the first query it should have

             collection_name=? AND dataset_type_id=? AND instrument=? AND detector=? AND visit=?
            

            instead of detector twice.

            If the first query is executed separately (not joined with anything) then the index (collection_name, dataset_type, <data-ID-fields>) should be optimal. Unique constraint for simple cases like ours is implemented as a unique index in all backends that we support now, I don't think we need to define separate index for it.

            For joins with multiple tables/subqueries it's hard to predict exactly which indices query optimizer will decide to use, and it also depends on the data in the database. If you define (collection_name, dataset_type, <data-ID-fields>) index then this index could already be used for second query if (collection_name, dataset_type) columns are the leading elements in the index. I'd probably not try anything fancier now, for performance we'll likely have to do separate study.

            Foreign key situation is more complicated, I think I remembered that some databases do create an implicit index on those but in general foreign key implementation does not require an index, so if you want to index it then it's better to create explicit index (and I think some people say always create an index on foreign key). Whether it helps join performance depends on query and optimizer.

            Show
            salnikov Andy Salnikov added a comment - - edited I presume in the first query it should have collection_name=? AND dataset_type_id=? AND instrument=? AND detector=? AND visit=? instead of detector twice. If the first query is executed separately (not joined with anything) then the index (collection_name, dataset_type, <data-ID-fields>) should be optimal. Unique constraint for simple cases like ours is implemented as a unique index in all backends that we support now, I don't think we need to define separate index for it. For joins with multiple tables/subqueries it's hard to predict exactly which indices query optimizer will decide to use, and it also depends on the data in the database. If you define (collection_name, dataset_type, <data-ID-fields>) index then this index could already be used for second query if (collection_name, dataset_type) columns are the leading elements in the index. I'd probably not try anything fancier now, for performance we'll likely have to do separate study. Foreign key situation is more complicated, I think I remembered that some databases do create an implicit index on those but in general foreign key implementation does not require an index, so if you want to index it then it's better to create explicit index (and I think some people say always create an index on foreign key). Whether it helps join performance depends on query and optimizer.
            Hide
            cs2018 Christopher Stephens [X] (Inactive) added a comment - - edited

            just to add to what Andy said:

            I think we can make some solid guesses on indexing but we won't know for sure until we see this functionality in action w/ data volumes and distributions that are representative of production.

            If the above two generalized SQLs account for most of the queries hitting collection tables, the UNIQUE constraint should be enough for good performance. You just have to make sure the index has the columns in the right order to get efficient plans for both SQLs (collection_name, dataset_type_id, instrument, detector, visit). That assumes join order of second SQL starts w/ collection table which may not be the case. We'll have to see.

            If you plan to return a significant number of rows from collection tables leading to lots of index -> table lookups to get additional dataset_id column, it might be worth adding dataset_id to keep from touching the table at all. I think that would mean an additional index since constraint couldn't be enforced with covering index which obviously slows DML so we'd definitely need to test. Regardless, it's premature to do that.

            You definitely don't want a bunch of single column indexes unless you have a lot of SQL with single column predicates or possibly varying subset of predicates against collection_name, dataset_type_id, instrument, detector, visit.

            I'd be very cautious of UNION if you are returning much data from UNION'd portion of SQL. That implies an additional sort or hashing to get distinct result set and since hashing is an option, you can't depend on any kind of implied order. If that isn't absolutely necessary, use UNION ALL.

            If foreign keys are indexed, that absolutely makes joins more efficient if join order is parent > child. If not, those indexes still might be necessary to prevent locking issues (row-exclusive vs table-exclusive) on DML.

             

            Show
            cs2018 Christopher Stephens [X] (Inactive) added a comment - - edited just to add to what Andy said: I think we can make some solid guesses on indexing but we won't know for sure until we see this functionality in action w/ data volumes and distributions that are representative of production. If the above two generalized SQLs account for most of the queries hitting collection tables, the UNIQUE constraint should be enough for good performance. You just have to make sure the index has the columns in the right order to get efficient plans for both SQLs (collection_name, dataset_type_id, instrument, detector, visit). That assumes join order of second SQL starts w/ collection table which may not be the case. We'll have to see. If you plan to return a significant number of rows from collection tables leading to lots of index -> table lookups to get additional dataset_id column, it might be worth adding dataset_id to keep from touching the table at all. I think that would mean an additional index since constraint couldn't be enforced with covering index which obviously slows DML so we'd definitely need to test. Regardless, it's premature to do that. You definitely don't want a bunch of single column indexes unless you have a lot of SQL with single column predicates or possibly varying subset of predicates against collection_name, dataset_type_id, instrument, detector, visit. I'd be very cautious of UNION if you are returning much data from UNION'd portion of SQL. That implies an additional sort or hashing to get distinct result set and since hashing is an option, you can't depend on any kind of implied order. If that isn't absolutely necessary, use UNION ALL. If foreign keys are indexed, that absolutely makes joins more efficient if join order is parent > child. If not, those indexes still might be necessary to prevent locking issues (row-exclusive vs table-exclusive) on DML.  
            Hide
            jbosch Jim Bosch added a comment -

            Thanks for the feedback. It sounds like all I should do for now on this ticket is add indexes to the foreign key definitions, since those aren't automatic, and to make sure the UNIQUE constraint is defined in the right order (I assume that order also sets the index order).

            I'll just clarify a more points raised above:

            assumes join order of second SQL

            Is there anything I should be doing in writing these queries to encourage certain join orders? The order in which I actually put tables in the FROM clause is irrelevant, right? (In any case, I assume it's premature to do anything like this, but I don't want to make incorrect assumptions about what can and cannot matter when writing the query.)

            If you plan to return a significant number of rows from collection tables leading to lots of index -> table lookups to get additional dataset_id column, it might be worth adding dataset_id to keep from touching the table at all.

            These would actually be joins on dataset_id (which is in this table) to get other columns from a different table where dataset_id is the primary key. I assume it's still true that any denormalization to eliminate those joins is premature.

            I'd be very cautious of UNION if you are returning much data from UNION'd portion of SQL.

            I believe the UNIQUE constraint here actually guarantees that UNION and UNION ALL would return the same results for most (probably all) of these queries, and it should have occurred to me that I should definitely use UNION ALL anyway.

            Show
            jbosch Jim Bosch added a comment - Thanks for the feedback. It sounds like all I should do for now on this ticket is add indexes to the foreign key definitions, since those aren't automatic, and to make sure the UNIQUE constraint is defined in the right order (I assume that order also sets the index order). I'll just clarify a more points raised above: assumes join order of second SQL Is there anything I should be doing in writing these queries to encourage certain join orders? The order in which I actually put tables in the FROM clause is irrelevant, right? (In any case, I assume it's premature to do anything like this, but I don't want to make incorrect assumptions about what can and cannot matter when writing the query.) If you plan to return a significant number of rows from collection tables leading to lots of index -> table lookups to get additional dataset_id column, it might be worth adding dataset_id to keep from touching the table at all. These would actually be joins on dataset_id (which is in this table) to get other columns from a different table where dataset_id is the primary key. I assume it's still true that any denormalization to eliminate those joins is premature. I'd be very cautious of UNION if you are returning much data from UNION'd portion of SQL. I believe the UNIQUE constraint here actually guarantees that UNION and UNION ALL would return the same results for most (probably all) of these queries, and it should have occurred to me that I should definitely use UNION ALL anyway.
            Hide
            cs2018 Christopher Stephens [X] (Inactive) added a comment -

            I'm unsure about sqlite but with Postgresql and Oracle, the optimizer determines join order based on statistics so that shouldn't matter.

            yeah, i wouldn't worry at all about denormalization yet. i think current actions are sufficient until we get to a point of testing representative workloads.

            Show
            cs2018 Christopher Stephens [X] (Inactive) added a comment - I'm unsure about sqlite but with Postgresql and Oracle, the optimizer determines join order based on statistics so that shouldn't matter. yeah, i wouldn't worry at all about denormalization yet. i think current actions are sufficient until we get to a point of testing representative workloads.
            Hide
            jbosch Jim Bosch added a comment -

            Andy Salnikov, yes, it's another review from me, but this one's really small and since you're already familiar with the context it should hopefully be quite easy.

            PR is https://github.com/lsst/daf_butler/pull/268.

            Show
            jbosch Jim Bosch added a comment - Andy Salnikov , yes, it's another review from me, but this one's really small and since you're already familiar with the context it should hopefully be quite easy. PR is https://github.com/lsst/daf_butler/pull/268 .
            Hide
            salnikov Andy Salnikov added a comment -

            Looks OK, though I'm not sure that we do need indices on all foreign keys. It should not hurt much (but it will some) and we'll need to look at performance later with realistic scale of data in the registry.

            Show
            salnikov Andy Salnikov added a comment - Looks OK, though I'm not sure that we do need indices on all foreign keys. It should not hurt much (but it will some) and we'll need to look at performance later with realistic scale of data in the registry.

              People

              Assignee:
              jbosch Jim Bosch
              Reporter:
              jbosch Jim Bosch
              Reviewers:
              Andy Salnikov
              Watchers:
              Andy Salnikov, Christopher Stephens [X] (Inactive), Jim Bosch, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.