Alert production will need fully fault-tolerant solution for L1 database. This implies at least two replicas (with that volume of data it's unlikely we can afford more replicas) and transparent or fully-automated fail-over solution. To simplify fail-over operation it's preferred to have master-master type replication which allows transparent writing to any instance, fail-over in that case happens by just switching client to a different instance. For master-master replication we can have several options to chose from:
- using native mysql option for master-master replication, it's mostly a question of configuring two instances (one-time task)
- using third-party product like mariadb galera cluster
Transparent client switching in case when main server goes away is not completely trivial. Standard mysql/mariadb C/C++ connector does not provide transparent switch-over mechanism (I believe only mariadb Connector/Java supports switching). It can be implemented on client side by extending API, that would require some error-catching mechanism to identify disconnects and doing smart attempt to reconnect when possible. Support at more high-level code may be necessary to do something in cases when failure happens in the middle of transaction. Alternatively switching can be implemented by some external mechanism, e.g. transparent proxy sitting in front of the mysql instances (e.g. mysql-proxy or MaxScale). Question still remains what happens when proxy dies, some client support is still needed in this case. Third-party products (server-side) may have their own solution a la proxy which takes care of the server switching.
In addition to NCSA L1 instance(s) there will be read-only L1 instance at Base Site and it needs to be updated from NCSA instances with reasonably short delay. For that master-slave replication seems to be adequate. Three replicas together would make a sort of hybrid cluster, which may be easier managed with mariadb galera cluster.