To meet customers’ expectations regarding the quality of communication services the service provider needs to introduce an extra degree of reliability within the network and its applications, so that the service is not interrupted – even if some network components are not functioning. How can this demand be addressed?
The per-server redundancy (when there are two physical servers and each runs a copy of an application, such as PortaSIP) addresses the situation when a single server fails (e.g., hardware fault). But there is another class of “catastrophic” events that can render all servers installed in the same location (rack, hosting center, etc.) unavailable. Such events include natural disasters, power outages at the collocation provider, network routing errors, etc. The only way to overcome this and provide uninterrupted service is to have another set of servers in a different location that can continue operating during the outage at the “main” site.
It is important in this situation that the “secondary” site not only activates and begins providing service as soon as possible, but also that it automatically synchronizes the changes later on (updates balances, xDRs, etc.) to the “main” site.
All of the above is available as the PortaSwitch site redundancy solution, which allows service providers to:
-
Protect themselves against hosting facility outages.
-
Provide service to multiple geographic regions – even if network connectivity between those regions is lost.
-
And finally, perform upgrades to new software versions with zero downtime! This last provision adds an essential benefit to the deployment of PortaSwitch across multiple sites since although one might hope that a hosting facility outage would never happen, one can be certain that sooner or later, there will be a need to perform a software upgrade. Refer to the Zero-downtime upgrade section for more information.
The PortaSwitch site redundancy flow is shown in the diagram below. To distribute incoming traffic among sites, the Dispatching SBC (DSBC) node is used. It allows to manually switch the traffic allocation from one site to another, e.g., PortaOne Support upgrades the main site and configures DSBC to send all incoming traffic to the secondary site. Also, if one of the sites is down, dispatching SBC automatically redirects all traffic to the other site, e.g., connectivity loss. Let’s say the main site is down. The secondary site detects that the main site has become unavailable and activates the “stand-alone” mode. The dispatching SBC now redirects new call and registration requests to the secondary site for processing. So, the secondary site provides service to the end users using the latest available snapshot of the service configuration of the main site. The xDRs for consumed services and changes in balance are accumulated in a separate database (on the stand-by database server) and are taken into consideration when authorizing subsequent activities, so there is no risk of balance overdraft when the stand-alone mode is used.
Once the main site becomes available again, the dispatching SBC can now send requests to the main site. The secondary site starts the process of synchronizing all of the accumulated changes to the main site, and then the secondary site switches back to its normal mode.
Refer to PortaSIP dispatching SBC (DSBC) to read more about the dispatching SBC functions in the PortaSwitch architecture.
Typical deployment scenario
Let’s consider an example of PortaSwitch deployment across multiple sites (fully cloud-based deployment is also possible).
The main and the secondary site are connected through a Local Area Network (LAN).
The main site hosts a standard clustered PortaSwitch (the Configuration server, main and replica database servers, a cluster of PortaBilling OCS and web servers, and the PortaSIP cluster).
Within its “normal” mode of operation at the secondary site:
- The stand-by database server continually retrieves changes from the main site, so it always has an up-to-date snapshot of the database from the main site.
- The OCS servers are in “stand-by” mode, so they do not actively process any requests.
- The PortaSIP cluster provides service as usual (processing incoming calls, playing the IVR, etc.). It uses the OCS servers on the main site for authentication and writes any changes (e.g., updated SIP phone location) into the main database.
Another option is deploying one or more secondary sites in a different city or country using Wide Area Network (WAN) connectivity.
Whatever the choice, there is an essential requirement to provide proper interconnection between the sites. There are a lot of ways to organize the sites into a single corporate network; the selection of the technology depends on existing network infrastructure, equipment, or capabilities of your network provider.
Regardless of the technology you choose, all PortaSwitch servers must be connected via virtual (or physical) Layer 2 connection(s) and be configured as hosts in a single virtual (or physical) private network.
When disaster strikes
If there is an outage (for instance, a motherboard failure) on a single server (e.g., PortaBilling OCS server #1) at the primary site, the primary site continues to operate as usual. Another server within the cluster (PortaBilling OCS server #2 in our example) processes all the requests and there is no need to switch over to the secondary site.
The above statement is true for an outage on any server except the primary database, since an outage there would render all other servers on the primary site (billing engine, PortaSIP) unable to function normally.
Therefore, the activation of the stand-alone mode on the secondary site would only happen if:
- There is an outage on the primary database server.
- There is an outage on all servers at the primary site (e.g., power failure).
- There is a network outage that makes the primary site inaccessible from the secondary site.
In this case, the stand-alone mode would be activated on the secondary site. This is a special mode of operation that allows the site to provide as many services (e.g., placing outgoing calls, receiving incoming calls, accessing IVR auto attendant, placing calls using calling card IVR, etc.) for end users as is still possible. At the same time, we assume that the outage at the main site is (most likely) temporary, so when order is restored, synchronization with the primary site will need to be performed. In stand-alone mode, certain operations are disabled if they could cause a breach in data integrity between the sites – for instance, it would not be possible to create new accounts, change service configurations, etc.
When a service is provided on the secondary site, the billing engine continues to calculate applicable charges according to product, tariff and the responsible party’s other billing parameters (e.g., from the account that originated the call). Changes to the balance and new xDRs are written into a separate database (the “delta” database, which runs on the same physical server as the stand-by database). This allows the billing engine to keep track of already consumed services and avoid a balance overdraft – even if a secondary site has to operate in stand-alone mode for an extended period of time – and this, therefore, results in a clear history of all produced charges. When the primary site becomes available again, these changes are automatically applied to the primary database – and the secondary site is switched back to “normal” mode. All of this happens automatically, without any need for PortaSwitch administrator involvement – and an end user might not even notice that there were any problems at the main site.
Example scenario
Let’s detail what happens in case of a primary site outage using a single customer as an example. The customer “ABC” has account number 12345 provisioned on his IP phone. The customer has a current balance of $98.00, a credit limit of $100 and his rate for calls within the US is $0.10/minute. The primary and secondary sites are configured as previously described.
- A power outage makes the entire primary site unavailable.
- This event is detected by a watchdog script on the secondary site so it switches into “stand-alone” mode (in particular, this enables the OCS server on the secondary site and instructs the PortaSIP cluster on the secondary site to use it as the authorization source).
- If the user’s SIP phone was previously registered to the PortaSIP cluster on the primary site, during the next re-registration attempt the phone will detect that the cluster is no longer available and attempt to contact an alternative server (this list is either pre-programmed into the phone or obtained dynamically using DNS). When it reaches the PortaSIP cluster on the secondary site it registers there. (If the phone is already registered on the PortaSIP cluster on the secondary site, nothing changes.)
- When the user attempts to make an outgoing call, an authorization request is sent to the PortaBilling OCS server on the secondary site.
- The billing engine uses the currently available balance information ($98.00) to compare it with the credit limit ($100.00) and authorizes the call for no more than 20 minutes.
- When, after 12 minutes of conversation, the user hangs up, PortaSIP sends an accounting request to PortaBilling so that charges are applied.
- When PortaBilling processes the request, it calculates the amount to be charged ($1.20) and stores the balance adjustment ($1.20) and the xDR for that call (with all call details such as CLI, CLD, call connect time, etc.) in the delta database.
- Then, when the user makes another call and PortaSIP sends an authorization request, the billing engine calculates the “effective” balance as the sum of the balance in the stand-by database ($98.00) and the balance adjustment stored in the delta database ($1.20). So the effective balance is $99.20 and the call will have a time limit of 8 minutes.
- The user hangs up after 5 minutes, so there is another xDR for that call with the charged amount of $0.50 written to the delta database and the balance adjustment is now $1.70.
- The next call will only be authorized for the remaining $0.30 of available funds – and can only run until the balance reaches the credit limit. This prevents balance overdraft – even if the site operates in stand-alone mode and the balances in the stand-by database are not changed.
- When the primary site comes back up, synchronization takes place.
- First to happen is that funds in the amount of the balance adjustment ($1.70) are locked in the primary database – this ensures that if a customer now tries to use the service on the main site, he will only be able to spend the $0.30 that he has available.
- Next, the secondary site is switched back to “normal” mode.
- And then, individual xDRs are transferred to the primary database.
This two-step process (first funds lock, then actual xDR transfer) ensures the avoidance of balance overdraft on the main site while an xDR transfer is in progress. There can be a large number of xDRs (if a secondary site operated in stand-alone for an extended period of time) and consequently, it can take time to replicate all of them to the primary site.
Site redundancy peculiarities for Internet access services
If you provide Internet access services and want to use the PortaSwitch site redundancy solution, your NAS must support two connections – active and fallback. Your engineers should configure the active connection to send requests to the RADIUS server on the main site and the fallback connection to send requests to the RADIUS server on the secondary site.
When the main site is down, e.g., during ZDU, the NAS can automatically switch from the active to the fallback connection and back.
Stand-alone mode restrictions
The secondary site does not differentiate between these two types of events:
- The primary site is down or has been destroyed (power failure, hurricane, earthquake, etc.).
- The primary site is still up and operational, but connectivity between the primary site and the secondary site is lost. For instance, the primary site is in city A and the secondary site is in city B. So while there is no connectivity between those two city sites, each one functions normally; in each city there are users using the service.
When the secondary site operates in stand-alone mode, it is essential that data integrity between the primary and secondary sites is protected at all times. This means that no operations should be allowed to run on the secondary site that could cause data conflict when merging the data change back to the primary site.
Let’s assume that during a connectivity outage between the sites the service configuration is changed as follows:
- The end user connected to the secondary site, sets up call forwarding to phone number 123.
- On the primary site, the administrator also sets up call forwarding to phone number 456 for this user.
Once connectivity between the sites is restored and a data merge is performed, it may be unclear which configuration is regarded as valid (i.e., which number should be used as the forwarding number). This is called a “split brain” problem and must be prevented from happening.
So although the secondary site can detect that the primary site is not accessible, it regards the primary site as operating normally, since users are making calls, administrators are making changes to the web interface, and data is being changed. Thus, the secondary site (when activated) does not perform all of the functions of the primary site; stand-alone mode requires that some functionality must be disabled.
In short, in stand-alone mode, the only operations allowed are those that change the balance and produce xDRs. All other changes (e.g., changing service configuration attributes or creating new entities) are prohibited.
The services are available with the following restrictions:
- Voice calls – users can make and receive all kinds of phone calls (using IP phones or calling card IVRs) including such complex scenarios as call pickup, call transfer, etc., though, the Presence/Busy Lamp Field and Shared line appearance features are unavailable.
- Voice applications
- IVR applications that do not change the service or account configuration are fully available:
- Account top-up via voucher.
- Email callback.
- Balance information.
- One-stage calling.
- Pass-Through IVR.
- WEB callback.
- Conferencing.
- Some IVR application components/commands modify the service or account configuration; therefore, they are available with limitations:
- Callback calling (account registration is disabled).
- Screening IVR (fraud protection is disabled).
- Prepaid card calling (account registration is disabled).
- SMS Callback (account registration and change password commands are disabled).
- Voicemail (messages are placed in the exim mail queue).
- Call Queues in auto attendant (the first caller in the queue may become disconnected before the secondary site is switched to the stand-alone mode).
- The following IVR applications change the service or account configuration and are unavailable:
- Account self-care.
- Account top-up via credit card.
- Call forwarding management.
- Access to one’s own voice mailbox.
- Payment Remittance – Transfer To.
- IVR applications that do not change the service or account configuration are fully available:
- Web interface – access to the web interface is unavailable.
- Call control API – is unavailable.
If at any point your main site is badly damaged (e.g., by fire or floodwaters) and is beyond repair, you can re-configure your secondary site to act as the main one and have a fully-functioning PortaSwitch.
The procedure consists of the following steps:
- Initialize the Configuration server. On the secondary site, select the server that is most suitable to serve as the Configuration server and run the Configurator install script on it.
- Restore the Configuration server database from the backup which is performed daily and stored on the secondary site’s database server(s). Use the cfgdb.sh script to restore the Configuration server database from its backup copy and run all the services required for its functioning (such as MySQL, Apache, etc.).
- Adjust the restored configuration.
- On the Configuration server web interface, move servers from the failed main site to the secondary site (change their Site name property).
- Make the standby database the master database: create a master database, replicate all the data from the standby database to the new master database and then delete the standby database.
- Add instances that weren’t initially configured on the secondary site (such as web servers, CDR importer, etc.).
- Apply the configuration.
From that point on, the secondary site acts as the main one and provides a fully-functioning PortaSwitch, without restrictions or limitations.