just to add to what Andy said:
I think we can make some solid guesses on indexing but we won't know for sure until we see this functionality in action w/ data volumes and distributions that are representative of production.
If the above two generalized SQLs account for most of the queries hitting collection tables, the UNIQUE constraint should be enough for good performance. You just have to make sure the index has the columns in the right order to get efficient plans for both SQLs (collection_name, dataset_type_id, instrument, detector, visit). That assumes join order of second SQL starts w/ collection table which may not be the case. We'll have to see.
If you plan to return a significant number of rows from collection tables leading to lots of index -> table lookups to get additional dataset_id column, it might be worth adding dataset_id to keep from touching the table at all. I think that would mean an additional index since constraint couldn't be enforced with covering index which obviously slows DML so we'd definitely need to test. Regardless, it's premature to do that.
You definitely don't want a bunch of single column indexes unless you have a lot of SQL with single column predicates or possibly varying subset of predicates against collection_name, dataset_type_id, instrument, detector, visit.
I'd be very cautious of UNION if you are returning much data from UNION'd portion of SQL. That implies an additional sort or hashing to get distinct result set and since hashing is an option, you can't depend on any kind of implied order. If that isn't absolutely necessary, use UNION ALL.
If foreign keys are indexed, that absolutely makes joins more efficient if join order is parent > child. If not, those indexes still might be necessary to prevent locking issues (row-exclusive vs table-exclusive) on DML.
Christopher Stephens [X], Kian-Tat Lim, Andy Salnikov, and other database pundits, I have some questions about what indexes I should be adding to the dataset tables now that (soon; post
DM-21764) I'm in a position to do so.Here's an example of the kind of table we're working with (SQLite dialect):
);
The "instrument", "detector", and "visit" fields are together the data ID fields - different tables will have different combinations of these fields and other similar ones.
The kinds of queries we want to optimize look like this:
SELECT
dataset_id
WHERE
or they'll be a larger query in which this table appears in a subquery, with roughly this form:
SELECT
...
FROM
visit
)
...
I'm eliding some joins to other tables (on dataset_id) that would also appear unless we did some denormalization because I don't think they're relevant for my questions.
Some notes on this, before I get to those questions:
So, the questions: