Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-24692

Test Cassandra APDB with with Scylla server

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Story Points:
      8
    • Sprint:
      DB_S20_02
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      Scylla is a drop-in replacement for Cassandra re-implemented in C++. We want to try it hoping that its performance will be better and it does not have GC issues of Java.

      ScyllaDB page: https://www.scylladb.com/open-source/

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment -

            Some notes on on testing setup.

            Installing ScyllaDB

            Installation is relatively simple, after registration thee sent me a link with the installation instructions which contained the URL for yum repo config file (needs to be copied to /etc/yum.repos.d/scylla.repo). After adding the repo simple sudo yum install scylla installs it. One issue with install is that Scylla conflicts with Cassandra and cassandra needs to be removed first (yum remove cassandra cassandra-tools).

            Scylla claims that it is cassandra drop-in replacement so some applications such as nodetool are the same. On client side the same python driver (from pip install cassandra-driver) works almost without issues.

            Configuration

            Scylla main configuration file is /etc/scylla/scylla.yaml, but there is also /etc/sysconfig/scylla-server which I had to modify.

            scylla.yaml is practically the same content as cassandra.yaml with all the same options, though I had discovered that some options do not work at all and some options work in an unexpected way. Documentation for configuration files exists but is not as extensive as Cassandra's: https://docs.scylladb.com/operating-scylla/scylla-yaml/

            Before starting the instance on a node one also has to run sudo scylla_setup command which collects some system info, updates system setup (disables selinux :shhh, and writes few config files to /etc/scylla.d/ folder.

            Problems with initial setup

            I tried to configure in the same way as Cassandra using separate SSD disks for storage, giving it a list of directories to data_file_directories option in YAML file. Scylla accepted that option but failed to start with some obscure errors which looked like:

            Apr 29 23:22:24 lsst-qserv-master03 scylla[841750]:  [shard 0] database - Exception while populating keyspace 'system_schema' with column family 'computed_columns' from file '/local_data/apdb2/system_schema/computed_columns-cc7c7069374033c192a4c3de78dbd2c4': std::filesystem::__cxx11::filesystem_error (error system:2, filesystem error: open failed: No such file or directory [/local_data/apdb2/system_schema/computed_columns-cc7c7069374033c192a4c3de78dbd2c4/staging])
            Apr 29 23:22:24 lsst-qserv-master03 scylla[841750]:  [shard 0] database - Exception while populating keyspace 'system_schema' with column family 'view_virtual_columns' from file '/local_data/apdb2/system_schema/view_virtual_columns-08843b6345dc3be29798a0418295cfaa': std::filesystem::__cxx11::filesystem_error (error system:2, filesystem error: open failed: No such file or directory [/local_data/apdb2/system_schema/view_virtual_columns-08843b6345dc3be29798a0418295cfaa/staging])
            

            Some googling revealed that data_file_directories does not actually support multiple data directories, all data have to be in a single directory. I found a ticket from 2016 for supporting multiple disks/folders, but later comments on that ticket suggest that this is not going to be implemented (I added a comment myself, but response was unhelpful).

            Due to that my initial test utilizes only a single disk for data (plus one separate disk for commit log). First test is done with one replica so I think it will be enough space to get reasonable comparison with 6-month Cassandra run which also used one replica. For later tests I'll probably need to make RAID0 array out of all disks (mdadm array on master03/02).

            Monitoring issues

            Scylla is not Java but they provide JMX access to control and monitor the whole thing (JMX is implemented as a separate service which I believe is a Java app that translates JMX into some other protocol used by main scylla server). With JMX I could use the same monitoring Jython script that I wrote for Cassandra. Except that it did not work - apparently some methods were not implemented 100% compatible in that JMX translation level. With some simple hack I managed to make it work though. JMX is also used by the tools that Scylla did not bother to rewrite (e.g. nodetool) so there is bunch of Java tools that talk JMX which gets translated by intermediate service, sort of ugly.

            Memory issues

            My ap_proto did run OK witch scylla without any changes using cassandra driver module. After running for a while I noticed heavy swapping on server machines, they were very unresponsive at that point. Apparently Scylla thinks that it is the only service worth running on the system and is not good at sharing physical memory with other services. And we are running few other services on those machines (GPFS, docker) which tend to consume quite a bit of memory too so they start fighting and swapping. Fortunately there is a simple way to tell Scylla how much memory it should be allowed to use - the trick is to add --memory NNNG option to the command line in /etc/sysconfig/scylla-server file. I have given it 128GB, with that setting grafana shows that memory is utilized at 65%.

            Connection issues

            Most of the time clients manage to connect to server without issues but sometimes I see warnings on client side of this type:

            2020-05-05 09:09:35,156 [WARNING] cassandra.cluster: [control connection] Error connecting to 141.142.181.163:9042:
            Traceback (most recent call last):
              File "cassandra/cluster.py", line 3407, in cassandra.cluster.ControlConnection._reconnect_internal
              File "cassandra/cluster.py", line 3464, in cassandra.cluster.ControlConnection._try_connect
              File "cassandra/cluster.py", line 3461, in cassandra.cluster.ControlConnection._try_connect
              File "cassandra/cluster.py", line 3563, in cassandra.cluster.ControlConnection._refresh_schema
              File "cassandra/metadata.py", line 142, in cassandra.metadata.Metadata.refresh
              File "cassandra/metadata.py", line 165, in cassandra.metadata.Metadata._rebuild_all
              File "cassandra/metadata.py", line 2339, in get_all_keyspaces
              File "cassandra/metadata.py", line 1850, in get_all_keyspaces
              File "cassandra/metadata.py", line 2536, in cassandra.metadata.SchemaParserV3._query_all
              File "cassandra/connection.py", line 981, in cassandra.connection.Connection.wait_for_responses
              File "cassandra/connection.py", line 979, in cassandra.connection.Connection.wait_for_responses
              File "cassandra/connection.py", line 1431, in cassandra.connection.ResponseWaiter.deliver
            cassandra.OperationTimedOut: errors=None, last_host=None
            

            And when this warning appears it appears for many clients (maybe for each one of 189 clients). On server side this approximate matches the messages of this kind:

            May 05 09:09:37 lsst-qserv-master04 scylla[73986]:  [shard 15] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe)
            May 05 09:09:37 lsst-qserv-master04 scylla[73986]:  [shard 14] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe)
            May 05 09:09:37 lsst-qserv-master04 scylla[73986]:  [shard 5] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe)
            May 05 09:09:37 lsst-qserv-master04 scylla[73986]:  [shard 2] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe)
            May 05 09:09:37 lsst-qserv-master04 scylla[73986]:  [shard 18] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe)
            May 05 09:09:37 lsst-qserv-master04 scylla[73986]:  [shard 16] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe)
            May 05 09:09:37 lsst-qserv-master04 scylla[73986]:  [shard 5] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe)
            May 05 09:09:37 lsst-qserv-master04 scylla[73986]:  [shard 6] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe)
             
            or
             
            May 05 09:09:43 lsst-qserv-master02 scylla[70807]:  [shard 2] storage_proxy - Exception when communicating with 141.142.181.129: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem)
            May 05 09:09:43 lsst-qserv-master02 scylla[70807]:  [shard 3] storage_proxy - Exception when communicating with 141.142.181.129: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem)
            May 05 09:09:43 lsst-qserv-master02 scylla[70807]:  [shard 4] storage_proxy - Exception when communicating with 141.142.181.129: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem)
            

            though the timestamps do not align for later messages.

            These warning do not seem to be fatal though, the clients seem to be able to either reconnect or maybe use different coordinator node, not sure.

            Show
            salnikov Andy Salnikov added a comment - Some notes on on testing setup. Installing ScyllaDB Installation is relatively simple, after registration thee sent me a link with the installation instructions which contained the URL for yum repo config file (needs to be copied to /etc/yum.repos.d/scylla.repo). After adding the repo simple sudo yum install scylla installs it. One issue with install is that Scylla conflicts with Cassandra and cassandra needs to be removed first ( yum remove cassandra cassandra-tools ). Scylla claims that it is cassandra drop-in replacement so some applications such as nodetool are the same. On client side the same python driver (from pip install cassandra-driver ) works almost without issues. Configuration Scylla main configuration file is /etc/scylla/scylla.yaml , but there is also /etc/sysconfig/scylla-server which I had to modify. scylla.yaml is practically the same content as cassandra.yaml with all the same options, though I had discovered that some options do not work at all and some options work in an unexpected way. Documentation for configuration files exists but is not as extensive as Cassandra's: https://docs.scylladb.com/operating-scylla/scylla-yaml/ Before starting the instance on a node one also has to run sudo scylla_setup command which collects some system info, updates system setup (disables selinux :shhh , and writes few config files to /etc/scylla.d/ folder. Problems with initial setup I tried to configure in the same way as Cassandra using separate SSD disks for storage, giving it a list of directories to data_file_directories option in YAML file. Scylla accepted that option but failed to start with some obscure errors which looked like: Apr 29 23:22:24 lsst-qserv-master03 scylla[841750]: [shard 0] database - Exception while populating keyspace 'system_schema' with column family 'computed_columns' from file '/local_data/apdb2/system_schema/computed_columns-cc7c7069374033c192a4c3de78dbd2c4': std::filesystem::__cxx11::filesystem_error (error system:2, filesystem error: open failed: No such file or directory [/local_data/apdb2/system_schema/computed_columns-cc7c7069374033c192a4c3de78dbd2c4/staging]) Apr 29 23:22:24 lsst-qserv-master03 scylla[841750]: [shard 0] database - Exception while populating keyspace 'system_schema' with column family 'view_virtual_columns' from file '/local_data/apdb2/system_schema/view_virtual_columns-08843b6345dc3be29798a0418295cfaa': std::filesystem::__cxx11::filesystem_error (error system:2, filesystem error: open failed: No such file or directory [/local_data/apdb2/system_schema/view_virtual_columns-08843b6345dc3be29798a0418295cfaa/staging]) Some googling revealed that data_file_directories does not actually support multiple data directories, all data have to be in a single directory. I found a ticket from 2016 for supporting multiple disks/folders, but later comments on that ticket suggest that this is not going to be implemented (I added a comment myself, but response was unhelpful). Due to that my initial test utilizes only a single disk for data (plus one separate disk for commit log). First test is done with one replica so I think it will be enough space to get reasonable comparison with 6-month Cassandra run which also used one replica. For later tests I'll probably need to make RAID0 array out of all disks (mdadm array on master03/02). Monitoring issues Scylla is not Java but they provide JMX access to control and monitor the whole thing (JMX is implemented as a separate service which I believe is a Java app that translates JMX into some other protocol used by main scylla server). With JMX I could use the same monitoring Jython script that I wrote for Cassandra. Except that it did not work - apparently some methods were not implemented 100% compatible in that JMX translation level. With some simple hack I managed to make it work though. JMX is also used by the tools that Scylla did not bother to rewrite (e.g. nodetool ) so there is bunch of Java tools that talk JMX which gets translated by intermediate service, sort of ugly. Memory issues My ap_proto did run OK witch scylla without any changes using cassandra driver module. After running for a while I noticed heavy swapping on server machines, they were very unresponsive at that point. Apparently Scylla thinks that it is the only service worth running on the system and is not good at sharing physical memory with other services. And we are running few other services on those machines (GPFS, docker) which tend to consume quite a bit of memory too so they start fighting and swapping. Fortunately there is a simple way to tell Scylla how much memory it should be allowed to use - the trick is to add --memory NNNG option to the command line in /etc/sysconfig/scylla-server file. I have given it 128GB, with that setting grafana shows that memory is utilized at 65%. Connection issues Most of the time clients manage to connect to server without issues but sometimes I see warnings on client side of this type: 2020-05-05 09:09:35,156 [WARNING] cassandra.cluster: [control connection] Error connecting to 141.142.181.163:9042: Traceback (most recent call last): File "cassandra/cluster.py", line 3407, in cassandra.cluster.ControlConnection._reconnect_internal File "cassandra/cluster.py", line 3464, in cassandra.cluster.ControlConnection._try_connect File "cassandra/cluster.py", line 3461, in cassandra.cluster.ControlConnection._try_connect File "cassandra/cluster.py", line 3563, in cassandra.cluster.ControlConnection._refresh_schema File "cassandra/metadata.py", line 142, in cassandra.metadata.Metadata.refresh File "cassandra/metadata.py", line 165, in cassandra.metadata.Metadata._rebuild_all File "cassandra/metadata.py", line 2339, in get_all_keyspaces File "cassandra/metadata.py", line 1850, in get_all_keyspaces File "cassandra/metadata.py", line 2536, in cassandra.metadata.SchemaParserV3._query_all File "cassandra/connection.py", line 981, in cassandra.connection.Connection.wait_for_responses File "cassandra/connection.py", line 979, in cassandra.connection.Connection.wait_for_responses File "cassandra/connection.py", line 1431, in cassandra.connection.ResponseWaiter.deliver cassandra.OperationTimedOut: errors=None, last_host=None And when this warning appears it appears for many clients (maybe for each one of 189 clients). On server side this approximate matches the messages of this kind: May 05 09:09:37 lsst-qserv-master04 scylla[73986]: [shard 15] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe) May 05 09:09:37 lsst-qserv-master04 scylla[73986]: [shard 14] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe) May 05 09:09:37 lsst-qserv-master04 scylla[73986]: [shard 5] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe) May 05 09:09:37 lsst-qserv-master04 scylla[73986]: [shard 2] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe) May 05 09:09:37 lsst-qserv-master04 scylla[73986]: [shard 18] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe) May 05 09:09:37 lsst-qserv-master04 scylla[73986]: [shard 16] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe) May 05 09:09:37 lsst-qserv-master04 scylla[73986]: [shard 5] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe) May 05 09:09:37 lsst-qserv-master04 scylla[73986]: [shard 6] cql_server - exception while processing connection: std::system_error (error system:32, sendmsg: Broken pipe)   or   May 05 09:09:43 lsst-qserv-master02 scylla[70807]: [shard 2] storage_proxy - Exception when communicating with 141.142.181.129: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem) May 05 09:09:43 lsst-qserv-master02 scylla[70807]: [shard 3] storage_proxy - Exception when communicating with 141.142.181.129: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem) May 05 09:09:43 lsst-qserv-master02 scylla[70807]: [shard 4] storage_proxy - Exception when communicating with 141.142.181.129: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem) though the timestamps do not align for later messages. These warning do not seem to be fatal though, the clients seem to be able to either reconnect or maybe use different coordinator node, not sure.
            Hide
            salnikov Andy Salnikov added a comment -

            Quick results from first test with scylla:

            • code name scylla1
            • one instance per machine
            • about 150k visits
            • no replication
            • data goes to a single disk (commitlog goes to  a separate disk)
            • memory use is set to 128GB for each instance

            Metrics from servers (from JMX) look odd, not sure how to interpret that, so I'll ignore them.

            Data sizes:

            $ nodetool status
            Datacenter: datacenter1
            =======================
            Status=Up/Down
            |/ State=Normal/Leaving/Joining/Moving
            --  Address          Load       Tokens       Owns    Host ID                               Rack
            UN  141.142.181.162  1.06 TB    256          ?       c8797aed-03be-4cfc-87df-457a5b0006fe  rack1
            UN  141.142.181.163  1 TB       256          ?       a7fcc51c-1d44-44bc-8529-bb079896e727  rack1
            UN  141.142.181.129  971.7 GB   256          ?       9b35785d-8eee-41b7-ae2c-3316dbae8d90  rack1
            

            Disk usage on master02 (141.142.181.129):

            /dev/sdb1       1.4T  261G  1.1T  20% /local_data/apdb1    <-- commitlog
            /dev/sdd1       1.4T  972G  368G  73% /local_data/apdb2    <-- data
            /dev/sde1       1.4T   34M  1.4T   1% /local_data/apdb3
            /dev/sdf1       1.4T   34M  1.4T   1% /local_data/apdb4
            

            Disk usage on master03 (141.142.181.162):

            /dev/nvme0n1p1  3.7T  262G  3.4T   8% /local_data/apdb1   <-- commitlog
            /dev/nvme1n1p1  3.7T  1.1T  2.6T  30% /local_data/apdb2   <--- data
            /dev/nvme2n1p1  3.7T   33M  3.7T   1% /local_data/apdb3
            /dev/nvme3n1p1  3.7T   33M  3.7T   1% /local_data/apdb4
            /dev/nvme4n1p1  3.7T   33M  3.7T   1% /local_data/apdb5
            

             
            Disk usage on master04 (141.142.181.163):

            /dev/nvme0n1p1  3.7T  262G  3.4T   8% /local_data/apdb1
            /dev/nvme1n1p1  3.7T  1.1T  2.7T  28% /local_data/apdb2
            /dev/nvme2n1p1  3.7T   33M  3.7T   1% /local_data/apdb3
            /dev/nvme3n1p1  3.7T   33M  3.7T   1% /local_data/apdb4
            /dev/nvme4n1p1  3.7T   33M  3.7T   1% /local_data/apdb5
            

            Query timing measured on client side as a function of wall clock time (as usual this is average over 189 CCDs in one visit)

            Storing data:

            Reading data:

            This I think is close or slightly worse than Cassandra with one replica.

            Notebook plots (as a function of visit number):

            Select time for each of three tables:


            And combined fit

            Show
            salnikov Andy Salnikov added a comment - Quick results from first test with scylla: code name scylla1 one instance per machine about 150k visits no replication data goes to a single disk (commitlog goes to  a separate disk) memory use is set to 128GB for each instance Metrics from servers (from JMX) look odd, not sure how to interpret that, so I'll ignore them. Data sizes: $ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 141.142.181.162 1.06 TB 256 ? c8797aed-03be-4cfc-87df-457a5b0006fe rack1 UN 141.142.181.163 1 TB 256 ? a7fcc51c-1d44-44bc-8529-bb079896e727 rack1 UN 141.142.181.129 971.7 GB 256 ? 9b35785d-8eee-41b7-ae2c-3316dbae8d90 rack1 Disk usage on master02 (141.142.181.129): /dev/sdb1 1.4T 261G 1.1T 20% /local_data/apdb1 <-- commitlog /dev/sdd1 1.4T 972G 368G 73% /local_data/apdb2 <-- data /dev/sde1 1.4T 34M 1.4T 1% /local_data/apdb3 /dev/sdf1 1.4T 34M 1.4T 1% /local_data/apdb4 Disk usage on master03 (141.142.181.162): /dev/nvme0n1p1 3.7T 262G 3.4T 8% /local_data/apdb1 <-- commitlog /dev/nvme1n1p1 3.7T 1.1T 2.6T 30% /local_data/apdb2 <--- data /dev/nvme2n1p1 3.7T 33M 3.7T 1% /local_data/apdb3 /dev/nvme3n1p1 3.7T 33M 3.7T 1% /local_data/apdb4 /dev/nvme4n1p1 3.7T 33M 3.7T 1% /local_data/apdb5   Disk usage on master04 (141.142.181.163): /dev/nvme0n1p1 3.7T 262G 3.4T 8% /local_data/apdb1 /dev/nvme1n1p1 3.7T 1.1T 2.7T 28% /local_data/apdb2 /dev/nvme2n1p1 3.7T 33M 3.7T 1% /local_data/apdb3 /dev/nvme3n1p1 3.7T 33M 3.7T 1% /local_data/apdb4 /dev/nvme4n1p1 3.7T 33M 3.7T 1% /local_data/apdb5 Query timing measured on client side as a function of wall clock time (as usual this is average over 189 CCDs in one visit) Storing data: Reading data: This I think is close or slightly worse than Cassandra with one replica. Notebook plots (as a function of visit number): Select time for each of three tables: And combined fit
            Hide
            salnikov Andy Salnikov added a comment - - edited

            I want to test Scylla with three replicas, for that I need to increase usable disk space by merging all SSD space together. I think easiest is to configure LVM RAID0 logical volume. For master02 where we have hardware raid we could probably go back to RAID0 in hw, but I don't want to spend much time on that and I think resulting performance would probably be identical to LVM.

            To make LVM raid0 (a.k.a. striping) volume"

            • unmount all existing volumes, comment/remove from fstab too
            • make a PV out of each one (e.g. pvcreate /dev/nvme0n1p1)
            • make LV group out of all of them: vgcreate VGapdb /dev/nvme?n1p1
            • make striping LV: lvcreate --type raid0 --stripes 5 -l 100%FREE VGapdb --name LVapdb, default stripe size is 64kB
            • make filesystem: mkfs.xfs -f /dev/VGapdb/LVapdb, defaults should be ok, XFS is good at guessing best parameters
            • mount it at /local_data/apdb1, change ownership
            • update Scylla config to use /local_data/apdb1 for everything
            • also needed to remove missing entries from /etc/scylla.d/io_properties.yaml

            Here is what scylla_io_setup reports for I/O performance in the new and old configuration (old one was for a single disk):

            Machine New Old
            master02 mountpoint: /local_data/apdb1
            read_iops: 15,414
            read_bandwidth: 2,322,135,808
            write_iops: 12,041
            write_bandwidth: 4,479,416,320 
            mountpoint: /local_data/apdb1
            read_iops: 2,817
            read_bandwidth: 758,585,088
            write_iops: 6,663
            write_bandwidth: 1,345,401,728 
            master03 mountpoint: /local_data/apdb1
            read_iops: 1,567,277
            read_bandwidth: 8,592,672,768
            write_iops: 1,131,395
            write_bandwidth: 8,546,787,840
            mountpoint: /local_data/apdb1
            read_iops: 667,931
            read_bandwidth: 2,853,252,608
            write_iops: 665,117
            write_bandwidth: 2,792,875,264
            master04 mountpoint: /local_data/apdb1
            read_iops: 1,644,360
            read_bandwidth: 8,490,080,768
            write_iops: 1,105,646
            write_bandwidth: 8,617,294,848
            mountpoint: /local_data/apdb1
            read_iops: 667,188
            read_bandwidth: 2,830,111,744
            write_iops: 668,418
            write_bandwidth: 2,813,575,168 

            And creating keyspace:

            cqlsh> create KEYSPACE apdb WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'} ;
            

            Show
            salnikov Andy Salnikov added a comment - - edited I want to test Scylla with three replicas, for that I need to increase usable disk space by merging all SSD space together. I think easiest is to configure LVM RAID0 logical volume. For master02 where we have hardware raid we could probably go back to RAID0 in hw, but I don't want to spend much time on that and I think resulting performance would probably be identical to LVM. To make LVM raid0 (a.k.a. striping) volume" unmount all existing volumes, comment/remove from fstab too make a PV out of each one (e.g. pvcreate /dev/nvme0n1p1 ) make LV group out of all of them: vgcreate VGapdb /dev/nvme?n1p1 make striping LV: lvcreate --type raid0 --stripes 5 -l 100%FREE VGapdb --name LVapdb , default stripe size is 64kB make filesystem: mkfs.xfs -f /dev/VGapdb/LVapdb , defaults should be ok, XFS is good at guessing best parameters mount it at /local_data/apdb1, change ownership update Scylla config to use /local_data/apdb1 for everything also needed to remove missing entries from /etc/scylla.d/io_properties.yaml Here is what scylla_io_setup reports for I/O performance in the new and old configuration (old one was for a single disk): Machine New Old master02 mountpoint: /local_data/apdb1 read_iops: 15,414 read_bandwidth: 2,322,135,808 write_iops: 12,041 write_bandwidth: 4,479,416,320  mountpoint: /local_data/apdb1 read_iops: 2,817 read_bandwidth: 758,585,088 write_iops: 6,663 write_bandwidth: 1,345,401,728  master03 mountpoint: /local_data/apdb1 read_iops: 1,567,277 read_bandwidth: 8,592,672,768 write_iops: 1,131,395 write_bandwidth: 8,546,787,840 mountpoint: /local_data/apdb1 read_iops: 667,931 read_bandwidth: 2,853,252,608 write_iops: 665,117 write_bandwidth: 2,792,875,264 master04 mountpoint: /local_data/apdb1 read_iops: 1,644,360 read_bandwidth: 8,490,080,768 write_iops: 1,105,646 write_bandwidth: 8,617,294,848 mountpoint: /local_data/apdb1 read_iops: 667,188 read_bandwidth: 2,830,111,744 write_iops: 668,418 write_bandwidth: 2,813,575,168  And creating keyspace: cqlsh> create KEYSPACE apdb WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'} ;
            Hide
            salnikov Andy Salnikov added a comment - - edited

            I stopped the second round of Scylla tests after 178k visits, here is the summary for that test.

            Setup:

            • code name scylla2
            • replication factor 3
            • all data disks configured in RAID0 as described above
            • memory limit 128GB on each node

            Data sizes after 178k visits:

            $ nodetool status
            Datacenter: datacenter1
            =======================
            Status=Up/Down
            |/ State=Normal/Leaving/Joining/Moving
            --  Address          Load       Tokens       Owns    Host ID                               Rack
            UN  141.142.181.162  3.56 TB    256          ?       a0bd6468-ed88-452f-b3e7-471274b9dd37  rack1
            UN  141.142.181.163  3.56 TB    256          ?       e3783b28-4d34-4ca3-8c60-ada87c1532e4  rack1
            UN  141.142.181.129  3.56 TB    256          ?       66ff1086-c01f-4b26-8c80-5440bf26fa6d  rack1
            

            Disk usage on master02:

            $ df -h /local_data/apdb1; du -sh /local_data/apdb1/{commitlog,apdb}
            Filesystem                 Size  Used Avail Use% Mounted on
            /dev/mapper/VGapdb-LVapdb  5.3T  3.9T  1.5T  73% /local_data/apdb1
            259G    /local_data/apdb1/commitlog
            3.6T    /local_data/apdb1/apdb
            

            Disk usage on master03:

            $ df -h /local_data/apdb1; du -sh /local_data/apdb1/{commitlog,apdb}
            Filesystem                 Size  Used Avail Use% Mounted on
            /dev/mapper/VGapdb-LVapdb   19T  3.9T   15T  21% /local_data/apdb1
            259G    /local_data/apdb1/commitlog
            3.6T    /local_data/apdb1/apdb
            

            Disk usage on master04:

            $ df -h /local_data/apdb1; du -sh /local_data/apdb1/{commitlog,apdb}
            Filesystem                 Size  Used Avail Use% Mounted on
            /dev/mapper/VGapdb-LVapdb   19T  3.9T   15T  21% /local_data/apdb1
            259G    /local_data/apdb1/commitlog
            3.6T    /local_data/apdb1/apdb
            

            In this round of tests I tried to accelerate testing by skipping the reading of Dia(Forced)Source for large fraction of events (90%). This could potentially affect measured performance as workload on server side becomes somewhat different. This mode was enabled around visit 64k, and I did not see any significant discontinuities in the plots due to that, but effect was noticeable especially on writing time, I think due to higher load when reading is disabled.

            Here are standard grafana plots for write and read times (as usual average per CCD):

            Top plot shows the effect of disabling read of sources which cause a lot more writes per unit of time and apparently reduces performance for the duration of that time (time around 2020-05-19 15:30).

            Second plot has a discontinuity around 2020-05-20 20:00 (visit ~90k), at that time I decided to stop and restart the cluster. There were GPFS issues and monitoring jobs that were writing logs to GPFS crashed (and also client batch jobs). Scylla itself does not need GPFS but I decided to flush the data and restart cluster anyways (but did not do compaction). Mysteriously after restart performance had improved significantly, I though that JMX monitoring could have negative impact on performance as I did not restart monitoring after restarting cluster, but later I started monitoring again and it did not show any effect on performance. There are also significant fluctuations in read performance, something that was not observed on plots from single replica above.

            In general read performance feels slightly worse that for single replica but not significantly.

            And same set of plots from notebook with the timing as a function of visit.

            Select time for each of three tables:

            And combined fits

             

            Raw numbers, but averaged over CCDs, for total read and write times as a function of visit (only first 100 out 1000 visits where sources were also read):

            (do not try to draw any conclusions from this plot),

            I think that performance looks reasonable even for the current totally inadequate cluster setup, and scaling the cluster up should bring the numbers to a more reasonable range.

             

            Show
            salnikov Andy Salnikov added a comment - - edited I stopped the second round of Scylla tests after 178k visits, here is the summary for that test. Setup: code name scylla2 replication factor 3 all data disks configured in RAID0 as described above memory limit 128GB on each node Data sizes after 178k visits: $ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 141.142.181.162 3.56 TB 256 ? a0bd6468-ed88-452f-b3e7-471274b9dd37 rack1 UN 141.142.181.163 3.56 TB 256 ? e3783b28-4d34-4ca3-8c60-ada87c1532e4 rack1 UN 141.142.181.129 3.56 TB 256 ? 66ff1086-c01f-4b26-8c80-5440bf26fa6d rack1 Disk usage on master02: $ df -h /local_data/apdb1; du -sh /local_data/apdb1/{commitlog,apdb} Filesystem Size Used Avail Use% Mounted on /dev/mapper/VGapdb-LVapdb 5.3T 3.9T 1.5T 73% /local_data/apdb1 259G /local_data/apdb1/commitlog 3.6T /local_data/apdb1/apdb Disk usage on master03: $ df -h /local_data/apdb1; du -sh /local_data/apdb1/{commitlog,apdb} Filesystem Size Used Avail Use% Mounted on /dev/mapper/VGapdb-LVapdb 19T 3.9T 15T 21% /local_data/apdb1 259G /local_data/apdb1/commitlog 3.6T /local_data/apdb1/apdb Disk usage on master04: $ df -h /local_data/apdb1; du -sh /local_data/apdb1/{commitlog,apdb} Filesystem Size Used Avail Use% Mounted on /dev/mapper/VGapdb-LVapdb 19T 3.9T 15T 21% /local_data/apdb1 259G /local_data/apdb1/commitlog 3.6T /local_data/apdb1/apdb In this round of tests I tried to accelerate testing by skipping the reading of Dia(Forced)Source for large fraction of events (90%). This could potentially affect measured performance as workload on server side becomes somewhat different. This mode was enabled around visit 64k, and I did not see any significant discontinuities in the plots due to that, but effect was noticeable especially on writing time, I think due to higher load when reading is disabled. Here are standard grafana plots for write and read times (as usual average per CCD): Top plot shows the effect of disabling read of sources which cause a lot more writes per unit of time and apparently reduces performance for the duration of that time (time around 2020-05-19 15:30). Second plot has a discontinuity around 2020-05-20 20:00 (visit ~90k), at that time I decided to stop and restart the cluster. There were GPFS issues and monitoring jobs that were writing logs to GPFS crashed (and also client batch jobs). Scylla itself does not need GPFS but I decided to flush the data and restart cluster anyways (but did not do compaction). Mysteriously after restart performance had improved significantly, I though that JMX monitoring could have negative impact on performance as I did not restart monitoring after restarting cluster, but later I started monitoring again and it did not show any effect on performance. There are also significant fluctuations in read performance, something that was not observed on plots from single replica above. In general read performance feels slightly worse that for single replica but not significantly. And same set of plots from notebook with the timing as a function of visit. Select time for each of three tables: And combined fits   Raw numbers, but averaged over CCDs, for total read and write times as a function of visit (only first 100 out 1000 visits where sources were also read): (do not try to draw any conclusions from this plot), I think that performance looks reasonable even for the current totally inadequate cluster setup, and scaling the cluster up should bring the numbers to a more reasonable range.  
            Hide
            salnikov Andy Salnikov added a comment -

            I think I messed enough with Scylla, I don't think there is a clear conclusion yet but it definitely has some issues. At this point I'm going to close this ticket and go back to Cassandra to reproduce the same setup - three replicas with out docker, but with just three instances, interesting to compare that with Scylla.

            Show
            salnikov Andy Salnikov added a comment - I think I messed enough with Scylla, I don't think there is a clear conclusion yet but it definitely has some issues. At this point I'm going to close this ticket and go back to Cassandra to reproduce the same setup - three replicas with out docker, but with just three instances, interesting to compare that with Scylla.

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Watchers:
              Andy Hanushevsky, Andy Salnikov, Colin Slater, Fritz Mueller
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.