# Test Cassandra APDB with with Scylla server

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s: None
• Labels:
• Story Points:
8
• Sprint:
DB_S20_02
• Team:
Data Access and Database
• Urgent?:
No

#### Description

Scylla is a drop-in replacement for Cassandra re-implemented in C++. We want to try it hoping that its performance will be better and it does not have GC issues of Java.

#### Attachments

1. apdb-scylla1-nb-time-select-fit.png
89 kB
2. apdb-scylla1-nb-time-select-fsrc.png
49 kB
3. apdb-scylla1-nb-time-select-obj.png
58 kB
4. apdb-scylla1-nb-time-select-src.png
56 kB
5. apdb-scylla1-nb-time-store-fit.png
77 kB
6. apdb-scylla1-time-select.png
65 kB
7. apdb-scylla1-time-store.png
64 kB
8. apdb-scylla2-nb-time-select-fit.png
95 kB
9. apdb-scylla2-nb-time-select-fsrc.png
60 kB
10. apdb-scylla2-nb-time-select-obj.png
52 kB
11. apdb-scylla2-nb-time-select-src.png
59 kB
12. apdb-scylla2-nb-time-store-fit.png
79 kB
13. apdb-scylla2-nb-time-totals.png
87 kB
14. apdb-scylla2-time-select.png
94 kB
15. apdb-scylla2-time-store.png
113 kB

#### Activity

Hide
Andy Salnikov added a comment -

Some notes on on testing setup.

Installation is relatively simple, after registration thee sent me a link with the installation instructions which contained the URL for yum repo config file (needs to be copied to /etc/yum.repos.d/scylla.repo). After adding the repo simple sudo yum install scylla installs it. One issue with install is that Scylla conflicts with Cassandra and cassandra needs to be removed first (yum remove cassandra cassandra-tools).

Scylla claims that it is cassandra drop-in replacement so some applications such as nodetool are the same. On client side the same python driver (from pip install cassandra-driver) works almost without issues.

## Configuration

Scylla main configuration file is /etc/scylla/scylla.yaml, but there is also /etc/sysconfig/scylla-server which I had to modify.

scylla.yaml is practically the same content as cassandra.yaml with all the same options, though I had discovered that some options do not work at all and some options work in an unexpected way. Documentation for configuration files exists but is not as extensive as Cassandra's: https://docs.scylladb.com/operating-scylla/scylla-yaml/

Before starting the instance on a node one also has to run sudo scylla_setup command which collects some system info, updates system setup (disables selinux :shhh, and writes few config files to /etc/scylla.d/ folder.

## Problems with initial setup

I tried to configure in the same way as Cassandra using separate SSD disks for storage, giving it a list of directories to data_file_directories option in YAML file. Scylla accepted that option but failed to start with some obscure errors which looked like:

 Apr 29 23:22:24 lsst-qserv-master03 scylla[841750]: [shard 0] database - Exception while populating keyspace 'system_schema' with column family 'computed_columns' from file '/local_data/apdb2/system_schema/computed_columns-cc7c7069374033c192a4c3de78dbd2c4': std::filesystem::__cxx11::filesystem_error (error system:2, filesystem error: open failed: No such file or directory [/local_data/apdb2/system_schema/computed_columns-cc7c7069374033c192a4c3de78dbd2c4/staging]) Apr 29 23:22:24 lsst-qserv-master03 scylla[841750]: [shard 0] database - Exception while populating keyspace 'system_schema' with column family 'view_virtual_columns' from file '/local_data/apdb2/system_schema/view_virtual_columns-08843b6345dc3be29798a0418295cfaa': std::filesystem::__cxx11::filesystem_error (error system:2, filesystem error: open failed: No such file or directory [/local_data/apdb2/system_schema/view_virtual_columns-08843b6345dc3be29798a0418295cfaa/staging]) 

Some googling revealed that data_file_directories does not actually support multiple data directories, all data have to be in a single directory. I found a ticket from 2016 for supporting multiple disks/folders, but later comments on that ticket suggest that this is not going to be implemented (I added a comment myself, but response was unhelpful).

Due to that my initial test utilizes only a single disk for data (plus one separate disk for commit log). First test is done with one replica so I think it will be enough space to get reasonable comparison with 6-month Cassandra run which also used one replica. For later tests I'll probably need to make RAID0 array out of all disks (mdadm array on master03/02).

## Monitoring issues

Scylla is not Java but they provide JMX access to control and monitor the whole thing (JMX is implemented as a separate service which I believe is a Java app that translates JMX into some other protocol used by main scylla server). With JMX I could use the same monitoring Jython script that I wrote for Cassandra. Except that it did not work - apparently some methods were not implemented 100% compatible in that JMX translation level. With some simple hack I managed to make it work though. JMX is also used by the tools that Scylla did not bother to rewrite (e.g. nodetool) so there is bunch of Java tools that talk JMX which gets translated by intermediate service, sort of ugly.

## Memory issues

My ap_proto did run OK witch scylla without any changes using cassandra driver module. After running for a while I noticed heavy swapping on server machines, they were very unresponsive at that point. Apparently Scylla thinks that it is the only service worth running on the system and is not good at sharing physical memory with other services. And we are running few other services on those machines (GPFS, docker) which tend to consume quite a bit of memory too so they start fighting and swapping. Fortunately there is a simple way to tell Scylla how much memory it should be allowed to use - the trick is to add --memory NNNG option to the command line in /etc/sysconfig/scylla-server file. I have given it 128GB, with that setting grafana shows that memory is utilized at 65%.

## Connection issues

Most of the time clients manage to connect to server without issues but sometimes I see warnings on client side of this type:

 2020-05-05 09:09:35,156 [WARNING] cassandra.cluster: [control connection] Error connecting to 141.142.181.163:9042: Traceback (most recent call last):  File "cassandra/cluster.py", line 3407, in cassandra.cluster.ControlConnection._reconnect_internal  File "cassandra/cluster.py", line 3464, in cassandra.cluster.ControlConnection._try_connect  File "cassandra/cluster.py", line 3461, in cassandra.cluster.ControlConnection._try_connect  File "cassandra/cluster.py", line 3563, in cassandra.cluster.ControlConnection._refresh_schema  File "cassandra/metadata.py", line 142, in cassandra.metadata.Metadata.refresh  File "cassandra/metadata.py", line 165, in cassandra.metadata.Metadata._rebuild_all  File "cassandra/metadata.py", line 2339, in get_all_keyspaces  File "cassandra/metadata.py", line 1850, in get_all_keyspaces  File "cassandra/metadata.py", line 2536, in cassandra.metadata.SchemaParserV3._query_all  File "cassandra/connection.py", line 981, in cassandra.connection.Connection.wait_for_responses  File "cassandra/connection.py", line 979, in cassandra.connection.Connection.wait_for_responses  File "cassandra/connection.py", line 1431, in cassandra.connection.ResponseWaiter.deliver cassandra.OperationTimedOut: errors=None, last_host=None 

And when this warning appears it appears for many clients (maybe for each one of 189 clients). On server side this approximate matches the messages of this kind:

 May 05 09:09:37 lsst-qserv-master04 scylla[73986]: [shard 15] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe) May 05 09:09:37 lsst-qserv-master04 scylla[73986]: [shard 14] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe) May 05 09:09:37 lsst-qserv-master04 scylla[73986]: [shard 5] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe) May 05 09:09:37 lsst-qserv-master04 scylla[73986]: [shard 2] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe) May 05 09:09:37 lsst-qserv-master04 scylla[73986]: [shard 18] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe) May 05 09:09:37 lsst-qserv-master04 scylla[73986]: [shard 16] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe) May 05 09:09:37 lsst-qserv-master04 scylla[73986]: [shard 5] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe) May 05 09:09:37 lsst-qserv-master04 scylla[73986]: [shard 6] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe)   or   May 05 09:09:43 lsst-qserv-master02 scylla[70807]: [shard 2] storage_proxy - Exception when communicating with 141.142.181.129: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem) May 05 09:09:43 lsst-qserv-master02 scylla[70807]: [shard 3] storage_proxy - Exception when communicating with 141.142.181.129: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem) May 05 09:09:43 lsst-qserv-master02 scylla[70807]: [shard 4] storage_proxy - Exception when communicating with 141.142.181.129: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem) 

though the timestamps do not align for later messages.

These warning do not seem to be fatal though, the clients seem to be able to either reconnect or maybe use different coordinator node, not sure.

Show
Hide
Andy Salnikov added a comment -

Quick results from first test with scylla:

• code name scylla1
• one instance per machine
• no replication
• data goes to a single disk (commitlog goes to  a separate disk)
• memory use is set to 128GB for each instance

Metrics from servers (from JMX) look odd, not sure how to interpret that, so I'll ignore them.

Data sizes:

 $nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 141.142.181.162 1.06 TB 256 ? c8797aed-03be-4cfc-87df-457a5b0006fe rack1 UN 141.142.181.163 1 TB 256 ? a7fcc51c-1d44-44bc-8529-bb079896e727 rack1 UN 141.142.181.129 971.7 GB 256 ? 9b35785d-8eee-41b7-ae2c-3316dbae8d90 rack1  Disk usage on master02 (141.142.181.129):  /dev/sdb1 1.4T 261G 1.1T 20% /local_data/apdb1 <-- commitlog /dev/sdd1 1.4T 972G 368G 73% /local_data/apdb2 <-- data /dev/sde1 1.4T 34M 1.4T 1% /local_data/apdb3 /dev/sdf1 1.4T 34M 1.4T 1% /local_data/apdb4  Disk usage on master03 (141.142.181.162):  /dev/nvme0n1p1 3.7T 262G 3.4T 8% /local_data/apdb1 <-- commitlog /dev/nvme1n1p1 3.7T 1.1T 2.6T 30% /local_data/apdb2 <--- data /dev/nvme2n1p1 3.7T 33M 3.7T 1% /local_data/apdb3 /dev/nvme3n1p1 3.7T 33M 3.7T 1% /local_data/apdb4 /dev/nvme4n1p1 3.7T 33M 3.7T 1% /local_data/apdb5  Disk usage on master04 (141.142.181.163):  /dev/nvme0n1p1 3.7T 262G 3.4T 8% /local_data/apdb1 /dev/nvme1n1p1 3.7T 1.1T 2.7T 28% /local_data/apdb2 /dev/nvme2n1p1 3.7T 33M 3.7T 1% /local_data/apdb3 /dev/nvme3n1p1 3.7T 33M 3.7T 1% /local_data/apdb4 /dev/nvme4n1p1 3.7T 33M 3.7T 1% /local_data/apdb5  Query timing measured on client side as a function of wall clock time (as usual this is average over 189 CCDs in one visit) Storing data: Reading data: This I think is close or slightly worse than Cassandra with one replica. Notebook plots (as a function of visit number): Select time for each of three tables: And combined fit Show Andy Salnikov added a comment - Quick results from first test with scylla: code name scylla1 one instance per machine about 150k visits no replication data goes to a single disk (commitlog goes to a separate disk) memory use is set to 128GB for each instance Metrics from servers (from JMX) look odd, not sure how to interpret that, so I'll ignore them. Data sizes:$ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 141.142.181.162 1.06 TB 256 ? c8797aed-03be-4cfc-87df-457a5b0006fe rack1 UN 141.142.181.163 1 TB 256 ? a7fcc51c-1d44-44bc-8529-bb079896e727 rack1 UN 141.142.181.129 971.7 GB 256 ? 9b35785d-8eee-41b7-ae2c-3316dbae8d90 rack1 Disk usage on master02 (141.142.181.129): /dev/sdb1 1.4T 261G 1.1T 20% /local_data/apdb1 <-- commitlog /dev/sdd1 1.4T 972G 368G 73% /local_data/apdb2 <-- data /dev/sde1 1.4T 34M 1.4T 1% /local_data/apdb3 /dev/sdf1 1.4T 34M 1.4T 1% /local_data/apdb4 Disk usage on master03 (141.142.181.162): /dev/nvme0n1p1 3.7T 262G 3.4T 8% /local_data/apdb1 <-- commitlog /dev/nvme1n1p1 3.7T 1.1T 2.6T 30% /local_data/apdb2 <--- data /dev/nvme2n1p1 3.7T 33M 3.7T 1% /local_data/apdb3 /dev/nvme3n1p1 3.7T 33M 3.7T 1% /local_data/apdb4 /dev/nvme4n1p1 3.7T 33M 3.7T 1% /local_data/apdb5   Disk usage on master04 (141.142.181.163): /dev/nvme0n1p1 3.7T 262G 3.4T 8% /local_data/apdb1 /dev/nvme1n1p1 3.7T 1.1T 2.7T 28% /local_data/apdb2 /dev/nvme2n1p1 3.7T 33M 3.7T 1% /local_data/apdb3 /dev/nvme3n1p1 3.7T 33M 3.7T 1% /local_data/apdb4 /dev/nvme4n1p1 3.7T 33M 3.7T 1% /local_data/apdb5 Query timing measured on client side as a function of wall clock time (as usual this is average over 189 CCDs in one visit) Storing data: Reading data: This I think is close or slightly worse than Cassandra with one replica. Notebook plots (as a function of visit number): Select time for each of three tables: And combined fit
Hide
Andy Salnikov added a comment - - edited

I want to test Scylla with three replicas, for that I need to increase usable disk space by merging all SSD space together. I think easiest is to configure LVM RAID0 logical volume. For master02 where we have hardware raid we could probably go back to RAID0 in hw, but I don't want to spend much time on that and I think resulting performance would probably be identical to LVM.

To make LVM raid0 (a.k.a. striping) volume"

• unmount all existing volumes, comment/remove from fstab too
• make a PV out of each one (e.g. pvcreate /dev/nvme0n1p1)
• make LV group out of all of them: vgcreate VGapdb /dev/nvme?n1p1
• make striping LV: lvcreate --type raid0 --stripes 5 -l 100%FREE VGapdb --name LVapdb, default stripe size is 64kB
• make filesystem: mkfs.xfs -f /dev/VGapdb/LVapdb, defaults should be ok, XFS is good at guessing best parameters
• mount it at /local_data/apdb1, change ownership
• update Scylla config to use /local_data/apdb1 for everything
• also needed to remove missing entries from /etc/scylla.d/io_properties.yaml

Here is what scylla_io_setup reports for I/O performance in the new and old configuration (old one was for a single disk):

Machine New Old
master02 mountpoint: /local_data/apdb1
write_iops: 12,041
write_bandwidth: 4,479,416,320
mountpoint: /local_data/apdb1
write_iops: 6,663
write_bandwidth: 1,345,401,728
master03 mountpoint: /local_data/apdb1
write_iops: 1,131,395
write_bandwidth: 8,546,787,840
mountpoint: /local_data/apdb1
write_iops: 665,117
write_bandwidth: 2,792,875,264
master04 mountpoint: /local_data/apdb1
write_iops: 1,105,646
write_bandwidth: 8,617,294,848
mountpoint: /local_data/apdb1
write_iops: 668,418
write_bandwidth: 2,813,575,168

And creating keyspace:

 cqlsh> create KEYSPACE apdb WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'} ; 

Show
Hide
Andy Salnikov added a comment - - edited

I stopped the second round of Scylla tests after 178k visits, here is the summary for that test.

Setup:

• code name scylla2
• replication factor 3
• all data disks configured in RAID0 as described above
• memory limit 128GB on each node

Data sizes after 178k visits:

 $nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 141.142.181.162 3.56 TB 256 ? a0bd6468-ed88-452f-b3e7-471274b9dd37 rack1 UN 141.142.181.163 3.56 TB 256 ? e3783b28-4d34-4ca3-8c60-ada87c1532e4 rack1 UN 141.142.181.129 3.56 TB 256 ? 66ff1086-c01f-4b26-8c80-5440bf26fa6d rack1  Disk usage on master02:  $ df -h /local_data/apdb1; du -sh /local_data/apdb1/{commitlog,apdb} Filesystem Size Used Avail Use% Mounted on /dev/mapper/VGapdb-LVapdb 5.3T 3.9T 1.5T 73% /local_data/apdb1 259G /local_data/apdb1/commitlog 3.6T /local_data/apdb1/apdb 

Disk usage on master03:

 $df -h /local_data/apdb1; du -sh /local_data/apdb1/{commitlog,apdb} Filesystem Size Used Avail Use% Mounted on /dev/mapper/VGapdb-LVapdb 19T 3.9T 15T 21% /local_data/apdb1 259G /local_data/apdb1/commitlog 3.6T /local_data/apdb1/apdb  Disk usage on master04:  $ df -h /local_data/apdb1; du -sh /local_data/apdb1/{commitlog,apdb} Filesystem Size Used Avail Use% Mounted on /dev/mapper/VGapdb-LVapdb 19T 3.9T 15T 21% /local_data/apdb1 259G /local_data/apdb1/commitlog 3.6T /local_data/apdb1/apdb 

In this round of tests I tried to accelerate testing by skipping the reading of Dia(Forced)Source for large fraction of events (90%). This could potentially affect measured performance as workload on server side becomes somewhat different. This mode was enabled around visit 64k, and I did not see any significant discontinuities in the plots due to that, but effect was noticeable especially on writing time, I think due to higher load when reading is disabled.

Here are standard grafana plots for write and read times (as usual average per CCD):

Top plot shows the effect of disabling read of sources which cause a lot more writes per unit of time and apparently reduces performance for the duration of that time (time around 2020-05-19 15:30).

Second plot has a discontinuity around 2020-05-20 20:00 (visit ~90k), at that time I decided to stop and restart the cluster. There were GPFS issues and monitoring jobs that were writing logs to GPFS crashed (and also client batch jobs). Scylla itself does not need GPFS but I decided to flush the data and restart cluster anyways (but did not do compaction). Mysteriously after restart performance had improved significantly, I though that JMX monitoring could have negative impact on performance as I did not restart monitoring after restarting cluster, but later I started monitoring again and it did not show any effect on performance. There are also significant fluctuations in read performance, something that was not observed on plots from single replica above.

In general read performance feels slightly worse that for single replica but not significantly.

And same set of plots from notebook with the timing as a function of visit.

Select time for each of three tables:

And combined fits

Raw numbers, but averaged over CCDs, for total read and write times as a function of visit (only first 100 out 1000 visits where sources were also read):

(do not try to draw any conclusions from this plot),

I think that performance looks reasonable even for the current totally inadequate cluster setup, and scaling the cluster up should bring the numbers to a more reasonable range.

Show
Andy Salnikov added a comment - - edited I stopped the second round of Scylla tests after 178k visits, here is the summary for that test. Setup: code name scylla2 replication factor 3 all data disks configured in RAID0 as described above memory limit 128GB on each node Data sizes after 178k visits: $nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 141.142.181.162 3.56 TB 256 ? a0bd6468-ed88-452f-b3e7-471274b9dd37 rack1 UN 141.142.181.163 3.56 TB 256 ? e3783b28-4d34-4ca3-8c60-ada87c1532e4 rack1 UN 141.142.181.129 3.56 TB 256 ? 66ff1086-c01f-4b26-8c80-5440bf26fa6d rack1 Disk usage on master02:$ df -h /local_data/apdb1; du -sh /local_data/apdb1/{commitlog,apdb} Filesystem Size Used Avail Use% Mounted on /dev/mapper/VGapdb-LVapdb 5.3T 3.9T 1.5T 73% /local_data/apdb1 259G /local_data/apdb1/commitlog 3.6T /local_data/apdb1/apdb Disk usage on master03: $df -h /local_data/apdb1; du -sh /local_data/apdb1/{commitlog,apdb} Filesystem Size Used Avail Use% Mounted on /dev/mapper/VGapdb-LVapdb 19T 3.9T 15T 21% /local_data/apdb1 259G /local_data/apdb1/commitlog 3.6T /local_data/apdb1/apdb Disk usage on master04:$ df -h /local_data/apdb1; du -sh /local_data/apdb1/{commitlog,apdb} Filesystem Size Used Avail Use% Mounted on /dev/mapper/VGapdb-LVapdb 19T 3.9T 15T 21% /local_data/apdb1 259G /local_data/apdb1/commitlog 3.6T /local_data/apdb1/apdb In this round of tests I tried to accelerate testing by skipping the reading of Dia(Forced)Source for large fraction of events (90%). This could potentially affect measured performance as workload on server side becomes somewhat different. This mode was enabled around visit 64k, and I did not see any significant discontinuities in the plots due to that, but effect was noticeable especially on writing time, I think due to higher load when reading is disabled. Here are standard grafana plots for write and read times (as usual average per CCD): Top plot shows the effect of disabling read of sources which cause a lot more writes per unit of time and apparently reduces performance for the duration of that time (time around 2020-05-19 15:30). Second plot has a discontinuity around 2020-05-20 20:00 (visit ~90k), at that time I decided to stop and restart the cluster. There were GPFS issues and monitoring jobs that were writing logs to GPFS crashed (and also client batch jobs). Scylla itself does not need GPFS but I decided to flush the data and restart cluster anyways (but did not do compaction). Mysteriously after restart performance had improved significantly, I though that JMX monitoring could have negative impact on performance as I did not restart monitoring after restarting cluster, but later I started monitoring again and it did not show any effect on performance. There are also significant fluctuations in read performance, something that was not observed on plots from single replica above. In general read performance feels slightly worse that for single replica but not significantly. And same set of plots from notebook with the timing as a function of visit. Select time for each of three tables: And combined fits   Raw numbers, but averaged over CCDs, for total read and write times as a function of visit (only first 100 out 1000 visits where sources were also read): (do not try to draw any conclusions from this plot), I think that performance looks reasonable even for the current totally inadequate cluster setup, and scaling the cluster up should bring the numbers to a more reasonable range.
Hide
Andy Salnikov added a comment -

I think I messed enough with Scylla, I don't think there is a clear conclusion yet but it definitely has some issues. At this point I'm going to close this ticket and go back to Cassandra to reproduce the same setup - three replicas with out docker, but with just three instances, interesting to compare that with Scylla.

Show
Andy Salnikov added a comment - I think I messed enough with Scylla, I don't think there is a clear conclusion yet but it definitely has some issues. At this point I'm going to close this ticket and go back to Cassandra to reproduce the same setup - three replicas with out docker, but with just three instances, interesting to compare that with Scylla.

#### People

Assignee:
Andy Salnikov
Reporter:
Andy Salnikov
Watchers:
Andy Hanushevsky, Andy Salnikov, Colin Slater, Fritz Mueller