Fix Version/s: None
Adding the right indexes - once we determine what they are - should be easy after
DM-21764. Stay tuned for a follow-up post on that here.
I presume in the first query it should have
collection_name=? AND dataset_type_id=? AND instrument=? AND detector=? AND visit=?
instead of detector twice.
If the first query is executed separately (not joined with anything) then the index (collection_name, dataset_type, <data-ID-fields>) should be optimal. Unique constraint for simple cases like ours is implemented as a unique index in all backends that we support now, I don't think we need to define separate index for it.
For joins with multiple tables/subqueries it's hard to predict exactly which indices query optimizer will decide to use, and it also depends on the data in the database. If you define (collection_name, dataset_type, <data-ID-fields>) index then this index could already be used for second query if (collection_name, dataset_type) columns are the leading elements in the index. I'd probably not try anything fancier now, for performance we'll likely have to do separate study.
Foreign key situation is more complicated, I think I remembered that some databases do create an implicit index on those but in general foreign key implementation does not require an index, so if you want to index it then it's better to create explicit index (and I think some people say always create an index on foreign key). Whether it helps join performance depends on query and optimizer.
just to add to what Andy said:
I think we can make some solid guesses on indexing but we won't know for sure until we see this functionality in action w/ data volumes and distributions that are representative of production.
If the above two generalized SQLs account for most of the queries hitting collection tables, the UNIQUE constraint should be enough for good performance. You just have to make sure the index has the columns in the right order to get efficient plans for both SQLs (collection_name, dataset_type_id, instrument, detector, visit). That assumes join order of second SQL starts w/ collection table which may not be the case. We'll have to see.
If you plan to return a significant number of rows from collection tables leading to lots of index -> table lookups to get additional dataset_id column, it might be worth adding dataset_id to keep from touching the table at all. I think that would mean an additional index since constraint couldn't be enforced with covering index which obviously slows DML so we'd definitely need to test. Regardless, it's premature to do that.
You definitely don't want a bunch of single column indexes unless you have a lot of SQL with single column predicates or possibly varying subset of predicates against collection_name, dataset_type_id, instrument, detector, visit.
I'd be very cautious of UNION if you are returning much data from UNION'd portion of SQL. That implies an additional sort or hashing to get distinct result set and since hashing is an option, you can't depend on any kind of implied order. If that isn't absolutely necessary, use UNION ALL.
If foreign keys are indexed, that absolutely makes joins more efficient if join order is parent > child. If not, those indexes still might be necessary to prevent locking issues (row-exclusive vs table-exclusive) on DML.
Thanks for the feedback. It sounds like all I should do for now on this ticket is add indexes to the foreign key definitions, since those aren't automatic, and to make sure the UNIQUE constraint is defined in the right order (I assume that order also sets the index order).
I'll just clarify a more points raised above:
assumes join order of second SQL
Is there anything I should be doing in writing these queries to encourage certain join orders? The order in which I actually put tables in the FROM clause is irrelevant, right? (In any case, I assume it's premature to do anything like this, but I don't want to make incorrect assumptions about what can and cannot matter when writing the query.)
If you plan to return a significant number of rows from collection tables leading to lots of index -> table lookups to get additional dataset_id column, it might be worth adding dataset_id to keep from touching the table at all.
These would actually be joins on dataset_id (which is in this table) to get other columns from a different table where dataset_id is the primary key. I assume it's still true that any denormalization to eliminate those joins is premature.
I'd be very cautious of UNION if you are returning much data from UNION'd portion of SQL.
I believe the UNIQUE constraint here actually guarantees that UNION and UNION ALL would return the same results for most (probably all) of these queries, and it should have occurred to me that I should definitely use UNION ALL anyway.
I'm unsure about sqlite but with Postgresql and Oracle, the optimizer determines join order based on statistics so that shouldn't matter.
yeah, i wouldn't worry at all about denormalization yet. i think current actions are sufficient until we get to a point of testing representative workloads.
Andy Salnikov, yes, it's another review from me, but this one's really small and since you're already familiar with the context it should hopefully be quite easy.
Looks OK, though I'm not sure that we do need indices on all foreign keys. It should not hurt much (but it will some) and we'll need to look at performance later with realistic scale of data in the registry.
Christopher Stephens [X], Kian-Tat Lim, Andy Salnikov, and other database pundits, I have some questions about what indexes I should be adding to the dataset tables now that (soon; post
DM-21764) I'm in a position to do so.
Here's an example of the kind of table we're working with (SQLite dialect):
The "instrument", "detector", and "visit" fields are together the data ID fields - different tables will have different combinations of these fields and other similar ones.
The kinds of queries we want to optimize look like this:
or they'll be a larger query in which this table appears in a subquery, with roughly this form:
I'm eliding some joins to other tables (on dataset_id) that would also appear unless we did some denormalization because I don't think they're relevant for my questions.
Some notes on this, before I get to those questions:
So, the questions: