High availability

Last modified by Simon Morlat on 2019/07/05 11:44

High availability setup

Flexisip can achieve high availability by the combination of 3 things:

  • Several flexisip instances running on multiple host machines, with an appropriate configuration to serve the same domains
  • SRV records to spread the traffic amongst several Flexisip instances, and make sure that client can try alternate routes if one SRV record is down
  • Redis registrar set up in a master/slave configuration and monitored with sentinels

The ideal and simplified setup is such that each host has:

  1. A redis instance configured with both requirepass and masterauth equal passwords, and a slaveof configured to the current Redis master;
  2. A flexisip instance configured to connect to the current redis instance;
  3. A redis-sentinel instance configured to monitor the current Redis master.

Flexisip-ha_conf.png

With that setup, the redis-sentinel will connect to the master redis DB, and start monitoring its slave (including the local instance). The redis instance will start replicating the master and wait for client connections. The flexisip instance will connect to the redis master and start handling client interactions.

On network error, if the master redis database is impacted, the sentinels will elect a new redis master and configure all the redis network to reflect that.

For flexisip, the behavior is as such:

  1. when the connection to the master is working, we periodically ask for the list of slaves of this master
  2. if the connection to the master is lost, we will try successively all known slaves and wait for a new master to be elected.
  3. when a new master is elected, flexisip will drop the slave connection and connect to the new master.
  4. at this point, the network will be able to process registrations again.

Overall, the time it takes will depend on the sentinel configuration. We recommend a 10s delay.

High availability setup requirements

It is REQUIRED to install an NTP daemon on all machines running the REDIS and Flexisip instances. Indeed, flexisip requests REDIS to automatically remove expired registrations. This mechanism is relying on universal time. If any node of the cluster has a wrong time information, then this management of registration is broken. Clients will then experience 404 Not found responses from Flexisip for destinations that were correctly registered.

On a debian system, this is done by installing the NTP daemon:

sudo apt-get install ntp

/etc/ntp.conf might be customized to set the hostname of your favourite NTP server (exemple: the one of your hosting provider).

Sample configurations

Redis master

In file /etc/redis/redis.conf:

bind *
requirepass ComplicatedPassWord123456789
masterauth ComplicatedPassWord123456789

All other Redis instances

bind *
requirepass ComplicatedPassWord123456789
masterauth ComplicatedPassWord123456789
slaveof <master ip> <master port>

All Redis sentinels

In file /etc/redis/sentinel.conf:

# sentinel monitor <name> <ip master> <port> <quorum size>
sentinel monitor flexi1 10.0.0.1 6379 2
sentinel down-after-milliseconds flexi1 10000
sentinel failover-timeout flexi1 20000
sentinel auth-pass flexi1 ComplicatedPassWord123456789

# For Redis 3.2 and later
protected-mode no

The quorum size is the number of sentinels that must be agree on the fact that master is down before triggering the election of the new master. For a cluster of 3 nodes, a quorum of 2 is a good value. If the quorum is equal to the size of the cluster, the election process will never be initiated.

The protected mode must be disable in order sentinels be able to accept requests not coming from loopback interface even if those are listening on all interfaces. Please note that by disabling proteceted mode, you will expose your sentinels to the public network whereas these are not able to authenticate each other. To solve that security issue, the firewall should be set to authorized sentinel request coming from a whitelist of IP addresses.

Alternatively, if all your sentinels are on a safe subnetwork or VPN, you should let the protected mode enabled and make your sentinels listen on the interface with the private network.

Node's flexisip configurations

In file /etc/flexisip/flexisip.conf, in the [global] sections, transports must be defined for each host, for example for host1:

[global]
transports=sips:host1.example.org

in the [module::Registrar] section:

reg-domains=mydomain.com
db-implementation=redis
redis-server-domain=10.0.0.1
redis-server-port=6379
redis-auth-password=ComplicatedPassWord123456789

in [cluster] section:

enabled=true

# List of IP addresses of all nodes present in the cluster
nodes=<IP host1> <IP host2> <IP host3>

TLS certificates

In case SIP/TLS (sips) is used, the TLS server certificate MUST advertise both the hostname and the SRV domain name. As a result, x509 extension SubjectAltName, with DNS fields should be used to advertise both names. The rational for this is:

  • SIP clients are required to verify that the names match the host part of the SIP URI targetted originally, per RFC5922 (Domain Certificates in the Session Initiation Protocol (SIP) ).
  • Flexisip nodes will use their hostname in Record-Route headers, in order to ensure that requests part of a same dialog will take the same path as the request that created the dialog. For this reason, clients may need to connect to SIP URI pointing to the node hostnames.

For example, the TLS certificate used for node "host1.example.org" must have a SubjectAltName with two DNS fields with values "example.org" (the SIP domain resolved by SRV), and "host1.example.org" (the node's hostname resolved by A/AAAA).

Typical DNS SRV records configuration

Here is an example of an active/active configuration for sips with 2 nodes.

_sips._tcp.example.org 3600 IN SRV 0 100 5061 host1.example.org.
_sips._tcp.example.org 3600 IN SRV 0 100 5061 host2.example.org.

Typical scenario in case of failure in a HA configuration

We have 3 hosts, and the current Redis master is the host 1.

  • Host 1 suffers a failure, and becomes unreachable. The other Flexisip instances immediately detect the failure and start connecting to another Redis slave. The enter a wait mode, where no new registration can be made, until a new master is elected.

Flexisip-ha_failure_step_1.png

  • After the configured delay in the sentinels (10s is recommended), they start the election process to set a new Redis master. In this case, Host2 is deemed new master. Host3's redis is reconfigured by the sentinels to adopt the new master. The Host2 and Host3 flexisip notice the change and automatically migrate to the new Redis master database.

Flexisip-ha_failure_step_2.png

  • Once Host1 comes back online, the sentinels will detect its livelyhood and reconfigure it as a slave. The Host1 flexisip will automatically migrate to the newly elected master Redis (Host2).

Flexisip-ha_failure_step_3.png

Tags: flexisip
Created by Simon Morlat on 2017/02/14 11:53