Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: None
-
Labels:
-
Story Points:8
-
Epic Link:
-
Sprint:DB_S21_12
-
Team:Data Access and Database
-
Urgent?:No
Description
With a baseline established by DM-28136 it's time to move and see how things scale with the extended cluster size. For this round of tests I want to use 6 nodes on server side, nodes are identical to the one already defined, I don't want very large data set yet with the existing SSD config we should be OK to run ~120k visits.
Attachments
Attachments
Issue Links
Activity
Starting post mortem for this test, it was a partial success but there were hardware problems with server machine (bad disk) which caused some issues. Still, some results are usable and some conclusions can be made even with that.
I planned to run this test with 100k visits but at some point client log started showing timeouts and exceptions and things started to run very slowly so I had to stop the test to try to understand what is going on. Apparently there was a problem with one of the four SSDs on apdb-server-6 machine which caused all sorts of problem for Cassandra.
First errors for SSD appeared in kern.log on 2020-12-25 00:15:57Z:
Dec 25 00:15:57 apdb-server-6 kernel: [172184.641277] blk_update_request: critical medium error, dev nvme0n1, sector 400547320 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
|
Dec 25 00:15:57 apdb-server-6 kernel: [172184.702759] blk_update_request: critical medium error, dev nvme0n1, sector 400592392 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
|
Dec 25 00:16:18 apdb-server-6 kernel: [172206.246250] blk_update_request: critical medium error, dev nvme0n1, sector 400575952 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
|
Dec 25 00:16:19 apdb-server-6 kernel: [172206.287305] blk_update_request: critical medium error, dev nvme0n1, sector 400571912 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
|
I think first errors appeared in client log at 2020-12-25 00:16:31Z, during visit 75923:
2020-12-25 00:16:31,995 [ERROR] ap_proto: Exception in visit processing: errors={'10.128.0.30:9042': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.128.0.30:9042
|
Traceback (most recent call last):
|
File "/net/10.65.206.122/apdb_vol1/apdb/apdb-gcloud/l1dbproto/python/lsst/l1dbproto/ap_proto.py", line 429, in run_mpi_tile
|
self.visit(db, visit_id, dt, region, sources, indices, tile)
|
File "/net/10.65.206.122/apdb_vol1/apdb/apdb-gcloud/l1dbproto/python/lsst/l1dbproto/ap_proto.py", line 521, in visit
|
db.storeDiaObjects(objects, dt, pos_func.pos_func_diaobj)
|
File "/net/10.65.206.122/apdb_vol1/apdb/apdb-gcloud/dax_apdb/python/lsst/dax/apdb/apdbCassandra.py", line 610, in storeDiaObjects
|
self._storeObjectsAfw(objs, "DiaObjectLast", dt, pos_func, extra_columns=extra_columns)
|
File "/net/10.65.206.122/apdb_vol1/apdb/apdb-gcloud/dax_apdb/python/lsst/dax/apdb/apdbCassandra.py", line 854, in _storeObjectsAfw
|
self._session.execute(queries)
|
File "cassandra/cluster.py", line 2618, in cassandra.cluster.Session.execute
|
File "cassandra/cluster.py", line 4877, in cassandra.cluster.ResponseFuture.result
|
cassandra.OperationTimedOut: errors={'10.128.0.30:9042': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.128.0.30:9042
|
Cluster state after I stopped the test:
# nodetool status
|
Datacenter: datacenter1
|
=======================
|
Status=Up/Down
|
|/ State=Normal/Leaving/Joining/Moving
|
-- Address Load Tokens Owns (effective) Host ID Rack
|
UN 10.128.0.116 784.44 GiB 256 ? 9dea29d2-81e1-47f7-b090-790273382b5f rack1
|
UN 10.128.0.114 907.19 GiB 256 ? dc3c9e53-b336-4cde-9a68-0c1ecc92aa18 rack1
|
UN 10.128.0.115 914.94 GiB 256 ? 36ddf99b-742a-48d4-a1b7-e82c8140d189 rack1
|
UN 10.128.0.110 870.05 GiB 256 ? ba56d50c-eb46-407b-9a9f-45452a11550b rack1
|
UN 10.128.0.34 817.03 GiB 256 ? 5b298a67-c59b-4b3c-8e9a-00263e14d0a6 rack1
|
UN 10.128.0.30 894.46 GiB 256 ? e0be23ed-2516-45c5-88c5-bbdf9ffe9481 rack1
|
I think first 75k visits should be OK and I will draw some results from that sample.
Summary of the results for this round of tests.
Setup:
- 6 server nodes, 32 cores, 64GB RAM each
- 4x375GiB local SSD storage on each node
- replication factor 3, QUORUM consistency for reads and writes
- 6 client machines, 32 cores each
- 75k visits generated, storage errors after that
Select time as a function of visit number:
Store time:
Comparing with 3-node cluster for reading performance, numbers are for per-CCD real time per query.
3 nodes | 6 nodes | |
---|---|---|
obj_select_real at 50k visits, sec | 0.52 | 0.32 |
src_select_real at 50k visits, sec | 2.8 | 2.1 |
fsrc_select_real at 50k visits, sec | 1.4 | 0.7 |
There is some improvement in numbers but it is not an expected 2x times improvement. I think that is because there is a significant overhead on client side for converting query results into a its final format. Here are the CPU time per corresponding query:
3 nodes | 6 nodes | |
---|---|---|
obj_select_cpu at 50k visits, sec | 0.17 | 0.16 |
src_select_cpu at 50k visits, sec | 0.94 | 1.13 |
fsrc_select_cpu at 50k visits, sec | 0.43 | 0.49 |
Better measure for performance may be a difference between real and CPU times:
3 nodes | 6 nodes | |
---|---|---|
obj_select_cpu at 50k visits, sec | 0.35 | 0.16 |
src_select_cpu at 50k visits, sec | 1.9 | 0.97 |
fsrc_select_cpu at 50k visits, sec | 0.97 | 0.21 |
Here the improvement is more dramatic, factor of 2 or better (but the number are not very precise of course).
I think I need to optimize client side to make overhead much smaller. I am now converting results to afw.Table like I did in older SQL code, and I think AP really wants to use pandas. I did not try to optimize afw conversion at all but there maybe ways to make that faster and for pandas I have found one recipe for optimal Cassandra performance: https://groups.google.com/a/lists.datastax.com/g/python-driver-user/c/1v-KHtyA0Zs
One more interesting observation - looking at JMX metrics from Cassandra I noticed that write latency behaves differently on different nodes, e.g. DiaSource latency:
On this plot one node (apdb-server-1) has lower latency that all other nodes, and one node (apdb-server-4) has significantly higher latency. This looks odd as my impression is that all nodes should look more or less similar. Not sure what it means, just a note to possibly investigate later.
I implemented pandas support for Cassandra APDB implementation and added more efficient transform of the query result to pandas. With that I ran it in the same setup with 100k visits, results will be added below. Here is the cluster status right before I stop server nodes:
andy_salnikov_gmail_com@apdb-server-1:~$ nodetool status
|
Datacenter: datacenter1
|
=======================
|
Status=Up/Down
|
|/ State=Normal/Leaving/Joining/Moving
|
-- Address Load Tokens Owns (effective) Host ID Rack
|
UN 10.128.0.116 1008.36 GiB 256 ? 222e6f35-e667-4cba-a507-a45c7ed804cc rack1
|
UN 10.128.0.114 1.03 TiB 256 ? 5d4b4b7b-6d75-4cb6-a419-ec7c4a518b29 rack1
|
UN 10.128.0.115 987.94 GiB 256 ? 861dff58-1ce4-415e-94ac-3cf770180d90 rack1
|
UN 10.128.0.110 968.93 GiB 256 ? a35b0ec7-9940-4dc3-bced-0a842abb23f8 rack1
|
UN 10.128.0.34 947.9 GiB 256 ? 1dff0ff1-530c-4d3d-aee8-54cfcf9c7c0e rack1
|
UN 10.128.0.30 1017.47 GiB 256 ? 70e974f9-d135-42af-8f11-85066c705198 rack1
|
And data disk usage:
(lsst-scipipe-4f18ecb) [andy_salnikov_gmail_com@apdb-client-1 apdb-gcloud]$ shmux -c "df -h /data/apdb*" -- apdb-server-{1..6}
|
apdb-server-1: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-1: /dev/nvme0n1p1 375G 263G 113G 71% /data/apdb1
|
apdb-server-1: /dev/nvme0n2p1 375G 264G 112G 71% /data/apdb2
|
apdb-server-1: /dev/nvme0n3p1 375G 264G 112G 71% /data/apdb3
|
apdb-server-1: /dev/nvme0n4p1 375G 262G 113G 70% /data/apdb4
|
apdb-server-5: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-5: /dev/nvme0n1p1 375G 270G 106G 72% /data/apdb1
|
apdb-server-5: /dev/nvme0n2p1 375G 268G 108G 72% /data/apdb2
|
apdb-server-5: /dev/nvme0n3p1 375G 268G 107G 72% /data/apdb3
|
apdb-server-5: /dev/nvme0n4p1 375G 269G 107G 72% /data/apdb4
|
apdb-server-6: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-6: /dev/nvme0n1p1 375G 281G 95G 75% /data/apdb1
|
apdb-server-6: /dev/nvme0n2p1 375G 273G 103G 73% /data/apdb2
|
apdb-server-6: /dev/nvme0n3p1 375G 273G 103G 73% /data/apdb3
|
apdb-server-6: /dev/nvme0n4p1 375G 274G 102G 73% /data/apdb4
|
apdb-server-2: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-2: /dev/nvme0n1p1 375G 277G 99G 74% /data/apdb1
|
apdb-server-2: /dev/nvme0n2p1 375G 274G 102G 73% /data/apdb2
|
apdb-server-2: /dev/nvme0n3p1 375G 276G 100G 74% /data/apdb3
|
apdb-server-2: /dev/nvme0n4p1 375G 279G 97G 75% /data/apdb4
|
apdb-server-4: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-4: /dev/nvme0n1p1 375G 288G 88G 77% /data/apdb1
|
apdb-server-4: /dev/nvme0n2p1 375G 286G 89G 77% /data/apdb2
|
apdb-server-4: /dev/nvme0n3p1 375G 287G 89G 77% /data/apdb3
|
apdb-server-4: /dev/nvme0n4p1 375G 287G 89G 77% /data/apdb4
|
apdb-server-3: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-3: /dev/nvme0n1p1 375G 265G 111G 71% /data/apdb1
|
apdb-server-3: /dev/nvme0n2p1 375G 258G 118G 69% /data/apdb2
|
apdb-server-3: /dev/nvme0n3p1 375G 258G 118G 69% /data/apdb3
|
apdb-server-3: /dev/nvme0n4p1 375G 254G 121G 68% /data/apdb4
|
Summary of the results:
- setup is the same as above
- switched to pandas, client CPU time looks much better
- run 100k visits, no issues with storage on any node
Select time as a function of visit:
and corresponding CPU time on client side for selects:
Store time looks slightly worse, not clear why yet, but effect is not too big (it may be due to interaction between reads and writes):
Some tables:
Real time for selects:
3 nodes | 6 nodes, afw | 6 nodes, pandas | |
---|---|---|---|
obj_select_real at 50k visits, sec | 0.52 | 0.32 | 0.13 |
src_select_real at 50k visits, sec | 2.8 | 2.1 | 1.03 |
fsrc_select_real at 50k visits, sec | 1.4 | 0.7 | 0.51 |
CPU time:
3 nodes | 6 nodes, afw | 6 nodes, pandas | |
---|---|---|---|
obj_select_cpu at 50k visits, sec | 0.17 | 0.16 | 0.014 |
src_select_cpu at 50k visits, sec | 0.94 | 1.13 | 0.217 |
fsrc_select_cpu at 50k visits, sec | 0.43 | 0.49 | 0.105 |
Difference between real and CPU times:
3 nodes | 6 nodes, afw | 6 nodes, pandas | |
---|---|---|---|
obj_select at 50k visits, sec | 0.35 | 0.16 | 0.12 |
src_select at 50k visits, sec | 1.9 | 0.97 | 1.01 |
fsrc_select at 50k visits, sec | 0.97 | 0.21 | 0.4 |
With that sort of precision I think it's reasonable to say that reading performance improves as expected with the cluster size. Time grows linearly with the visit number also as expected. Next step would be to grow cluster again and see how it scales, and try longer run (1 year worth of visits).
One more observation looking at JMX monitoring. Write latency on one host (apdb-server-2) is again higher than all other hosts:
System load is also higher on the same host:
Not clear what is going on, I suspect that one of the nodes is elected as some sort of coordinator, maybe it is related to that. Will try to look at Cassandra logs to see if there is anything interesting there.
I looked at the logs and additional JMX metrics but there is no smoking gun. I believe all clients are connected to every server so there should not be a single coordinating server. I'll try to see what happens in the next round of tests maybe there is something that I can see on the live system.
Closing this ticket for now, moving to the next test.
OK, here it goes:
$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.128.0.116 553.12 KiB 256 ? 9dea29d2-81e1-47f7-b090-790273382b5f rack1
UN 10.128.0.114 437.55 KiB 256 ? dc3c9e53-b336-4cde-9a68-0c1ecc92aa18 rack1
UN 10.128.0.115 80.01 MiB 256 ? 36ddf99b-742a-48d4-a1b7-e82c8140d189 rack1
UN 10.128.0.110 304.19 KiB 256 ? ba56d50c-eb46-407b-9a9f-45452a11550b rack1
UN 10.128.0.34 382.9 KiB 256 ? 5b298a67-c59b-4b3c-8e9a-00263e14d0a6 rack1
UN 10.128.0.30 232.81 KiB 256 ? e0be23ed-2516-45c5-88c5-bbdf9ffe9481 rack1
Started short run with 1k visits, no problems so far after 50 visits.