Partitioning of the tables.
There are just two major problems that need to be solved to make Cassandra performing efficiently - selecting good partitioning and good ordering of the data (as saved in sstables). This boils down to selecting columns for primary key (partitioning key is the first part of the PK, clustering key is remaining columns).
Selecting one or few partitions out of many is very efficient way to limit mount of data being processed. Cassandra only supports equality relation on partition key or IN operation on the last (or only) column of the partition key. This means one has to usually write multiple queries, e.g. one query par partition if remaining selection differs from partition to partition. This in turn can cause overhead if number of partitions is large.
Selection on clustering key has to be done so that one query selects a contiguous range of keys (from a given partition). This also means that some complex queries will have to be split into multiple queries.
Choosing right keys depends entirely on the query(-ies) that we run on the tables, so I'm going to summarize these.
DiaObject(Last) table
The query for this table is "select latest version of every DiaObject in a region covered by one CCD (or maybe arbitrary region)". If we have a separate DiaObjectLast table then it already contains latest versions only. With only DiaObject table I don't think we can achieve the same thing in a reasonable way - the "last" flag changes and obviously it cannot be a part of the partition or clustering key, we'd have to read a lot of records and filter them which will kill performance quickly.
So far we had spatial queries in PPDB based on HTM index, for Cassandra it makes sense to use the same approach or use other pixelization, e.g. Q3C. For partitioning purposes we wand relatively large "pixels", comparable in size to CCD (see above), so partitioning column could be HTM level=8 for example. The normal selection is still based on based on pixelId (htm20) using ranges of those. Because we select ranges of pixelIds it's natural to use pixelId as a clustering key.
Altogether it may look like
CREATE TABLE DiaObjectLast (
|
partId int, // htm8 of ra, dec
|
pixelId int, // htm20 of ra, dec
|
diaObjectId int,
|
...,
|
primary key (partId, pixelId, diaObjectId)
|
);
|
|
// series of queries for one visit
|
SELECT * from DiaObjectLast WHERE partId = 1 AND pixelId >= 10000 AND pixelId < 12300;
|
SELECT * from DiaObjectLast WHERE partId = 1 AND pixelId >= 14000 AND pixelId < 15200;
|
SELECT * from DiaObjectLast WHERE partId = 2 AND pixelId >= 21000 AND pixelId < 21700;
|
...
|
(with the caveat that HTM indices do not change after position updates, it should be ra/dec of the first version of DiaObject).
One potentially interesting idea is to make pixelId a part of the diaObjectId (higher bits, so that contiguous location is preserved), then we could guarantee that pixelId is stable and cannot change.
Dia(Forced)Source table
There is some freedom for choosing which query is used to retrieve data from these tables - if the set of diaObjectIds is known then one can select based on that (and for relational databases this seems to work better than other approach), otherwise same region-based select as for DiaObjectLast should be used. Additionally we only need to select last several months (12) of sources from these tables.
I think region-based selection should be preferred because it can run in parallel with DiaObjectLast select (if we have enough capacity). That would make partitioning and clustering to be based on htm similarly to DiaObjectLast. The difference is the additional time constraint.
I don't think it would be possible to use timestamp in a clustering key without blowing up the number of queries (2-D space range selection is inherently non-contiguous). I could imagine something like this though which would result in a factor 10 more queries than for above case
CREATE TABLE DiaSource (
|
partId int, // htm8 of ra, dec
|
pixelId int, // htm20 of ra, dec
|
month int, // year and month combined, e.g. 202010 for Oct 2020
|
diaSourceId int,
|
...,
|
primary key (partId, pixelId, month, diaSourceId)
|
);
|
|
// series of queries for one visit, just for two months
|
SELECT * from DiaSource WHERE partId = 1 AND pixelId >= 10000 AND pixelId < 12300 AND month = 202010;
|
SELECT * from DiaSource WHERE partId = 1 AND pixelId >= 10000 AND pixelId < 12300 AND month = 202011;
|
SELECT * from DiaSource WHERE partId = 1 AND pixelId >= 14000 AND pixelId < 15200 AND month = 202010;
|
SELECT * from DiaSource WHERE partId = 1 AND pixelId >= 14000 AND pixelId < 15200 AND month = 202011;
|
SELECT * from DiaSource WHERE partId = 2 AND pixelId >= 21000 AND pixelId < 21700 AND month = 202010;
|
SELECT * from DiaSource WHERE partId = 2 AND pixelId >= 21000 AND pixelId < 21700 AND month = 202011;
|
...
|
Another option is to use "month" as a second part of partition key, Cassandra allows IN operator for the last column of partition key so we can do something like:
CREATE TABLE DiaSource (
|
partId int, // htm8 of ra, dec
|
month int, // year and month combined, e.g. 202010 for Oct 2020
|
pixelId int, // htm20 of ra, dec
|
diaSourceId int,
|
...,
|
primary key ((partId, month), pixelId, diaSourceId)
|
);
|
|
// series of queries for one visit, just for two months
|
SELECT * from DiaSource WHERE partId = 1 AND month IN (202010, 202011, 202012) AND pixelId >= 10000 AND pixelId < 12300;
|
SELECT * from DiaSource WHERE partId = 1 AND month IN (202010, 202011, 202012) AND pixelId >= 14000 AND pixelId < 15200;
|
SELECT * from DiaSource WHERE partId = 2 AND month IN (202010, 202011, 202012) AND pixelId >= 21000 AND pixelId < 21700;
|
...
|
I want to try latter variant to see if that actually works.
Partitioning of the tables.
There are just two major problems that need to be solved to make Cassandra performing efficiently - selecting good partitioning and good ordering of the data (as saved in sstables). This boils down to selecting columns for primary key (partitioning key is the first part of the PK, clustering key is remaining columns).
Selecting one or few partitions out of many is very efficient way to limit mount of data being processed. Cassandra only supports equality relation on partition key or IN operation on the last (or only) column of the partition key. This means one has to usually write multiple queries, e.g. one query par partition if remaining selection differs from partition to partition. This in turn can cause overhead if number of partitions is large.
Selection on clustering key has to be done so that one query selects a contiguous range of keys (from a given partition). This also means that some complex queries will have to be split into multiple queries.
Choosing right keys depends entirely on the query(-ies) that we run on the tables, so I'm going to summarize these.
DiaObject(Last) table
The query for this table is "select latest version of every DiaObject in a region covered by one CCD (or maybe arbitrary region)". If we have a separate DiaObjectLast table then it already contains latest versions only. With only DiaObject table I don't think we can achieve the same thing in a reasonable way - the "last" flag changes and obviously it cannot be a part of the partition or clustering key, we'd have to read a lot of records and filter them which will kill performance quickly.
So far we had spatial queries in PPDB based on HTM index, for Cassandra it makes sense to use the same approach or use other pixelization, e.g. Q3C. For partitioning purposes we wand relatively large "pixels", comparable in size to CCD (see above), so partitioning column could be HTM level=8 for example. The normal selection is still based on based on pixelId (htm20) using ranges of those. Because we select ranges of pixelIds it's natural to use pixelId as a clustering key.
Altogether it may look like
...,
);
...
(with the caveat that HTM indices do not change after position updates, it should be ra/dec of the first version of DiaObject).
One potentially interesting idea is to make pixelId a part of the diaObjectId (higher bits, so that contiguous location is preserved), then we could guarantee that pixelId is stable and cannot change.
Dia(Forced)Source table
There is some freedom for choosing which query is used to retrieve data from these tables - if the set of diaObjectIds is known then one can select based on that (and for relational databases this seems to work better than other approach), otherwise same region-based select as for DiaObjectLast should be used. Additionally we only need to select last several months (12) of sources from these tables.
I think region-based selection should be preferred because it can run in parallel with DiaObjectLast select (if we have enough capacity). That would make partitioning and clustering to be based on htm similarly to DiaObjectLast. The difference is the additional time constraint.
I don't think it would be possible to use timestamp in a clustering key without blowing up the number of queries (2-D space range selection is inherently non-contiguous). I could imagine something like this though which would result in a factor 10 more queries than for above case
...,
);
...
Another option is to use "month" as a second part of partition key, Cassandra allows IN operator for the last column of partition key so we can do something like:
...,
);
...
I want to try latter variant to see if that actually works.