Deploying PortaSwitch across multiple sites

Link copied to clipboard

To meet customers’ expectations regarding the quality of communication services the service provider needs to introduce an extra degree of reliability within the network and its applications, so that the service is not interrupted – even if some network components are not functioning. How can this demand be addressed?

The per-server redundancy (when there are two physical servers and each runs a copy of an application, such as PortaSIP) addresses the situation when a single server fails (e.g., hardware fault). But there is another class of “catastrophic” events that can render all servers installed in the same location (rack, hosting center, etc.) unavailable. Such events include natural disasters, power outages at the collocation provider, network routing errors, etc. The only way to overcome this and provide uninterrupted service is to have another set of servers in a different location that can continue operating during the outage at the “main” site.

It is important in this situation that the “secondary” site not only activates and begins providing service as soon as possible, but also that it automatically synchronizes the changes later on (updates balances, xDRs, etc.) to the “main” site.

All of the above is available as the PortaSwitch site redundancy solution, which allows service providers to:

  • Protect themselves against hosting facility outages.

  • Provide service to multiple geographic regions – even if network connectivity between those regions is lost.

  • And finally, perform upgrades to new software versions with zero downtime! This last provision adds an essential benefit to the deployment of PortaSwitch across multiple sites, since although one might hope that a hosting facility outage would never happen, one can be certain that sooner or later, there will be a need to perform a software upgrade.

    Multi-site architecture

So if the secondary site detects that the main site has become unavailable, the “stand-alone” mode is activated on the secondary site and now it provides the service to end users using the latest available snapshot of the service configuration. The xDRs for consumed services and changes in balance are accumulated in a separate database (on the stand-by database server) and are taken into consideration when authorizing subsequent activities, so there is no risk of balance overdraft when the stand-alone mode is used.

Main site is down

Once the main site becomes available again, the secondary site starts the process of synchronizing all of the accumulated changes to the main site and then the secondary site switches back to its normal (“stand-by”) mode.

Data merge

Typical deployment scenario

Link copied to clipboard

Let’s consider the example of a possible PortaSwitch deployment across multiple sites (a different deployment scenario might be fully cloud-based). The “primary” site hosts a standard PortaSwitch Clustered (the Configuration server, main, and replica database servers, a cluster of PortaBilling OCS and web servers, and the PortaSIP cluster).

PortaSwitch deployment across multiple sites

Within its “normal” mode of operation at the remote site:

  • The stand-by database server continually retrieves changes from the main site, so it always has an up-to-date snapshot of the database from the main site.

  • The OCS servers are in “stand-by” mode, so they do not actively process any requests.

  • The PortaSIP cluster provides service as usual (processing incoming calls, playing the IVR, etc.). It uses the OCS servers on the main site for authentication and writes any changes (e.g., updated SIP phone location) into the primary database.

Another option is deploying secondary site (or sites) in a different city or country using WAN connectivity.

Geo-redundancy

Whatever the choice, there is an essential requirement to provide proper interconnection between the sites. There are a lot of ways to organize the sites into a single corporate network; the selection of the technology depends on existing network infrastructure, equipment or capabilities of your network provider.

Regardless of the technology you choose, all PortaSwitch servers must be connected via virtual (or physical) Layer 2 connection(s) and be configured as hosts in a single virtual (or physical) private network.

When disaster strikes

Link copied to clipboard

If there is an outage (for instance, a motherboard failure) on a single server (e.g., PortaBilling OCS server #1) at the primary site, the primary site continues to operate as usual. Another server within the cluster (PortaBilling OCS server #2 in our example) processes all the requests and there is no need to switch over to the secondary site.

The above statement is true for an outage on any server except the primary database, since an outage there would render all other servers on the primary site (billing engine, PortaSIP) unable to function normally.

Therefore, the activation of the stand-alone mode on the secondary site would only happen if:

  • There is an outage on the primary database server.

  • There is an outage on all servers at the primary site (e.g., power failure).

  • There is a network outage that makes the primary site inaccessible from the secondary site.

In this case, the stand-alone mode would be activated on the secondary site. This is a special mode of operation that allows the site to provide as many services (e.g., placing outgoing calls, receiving incoming calls, accessing IVR auto attendant, placing calls using calling card IVR, etc.) for end users as is still possible. At the same time, we assume that the outage at the main site is (most likely) temporary, so when order is restored, synchronization with the primary site will need to be performed. In stand-alone mode, certain operations are disabled if they could cause a breach in data integrity between the sites – for instance, it would not be possible to create new accounts, change service configurations, etc.

When a service is provided on the secondary site, the billing engine continues to calculate applicable charges according to product, tariff and the responsible party’s other billing parameters (e.g., from the account that originated the call). Changes to the balance and new xDRs are written into a separate database (the “delta” database, which runs on the same physical server as the stand-by database). This allows the billing engine to keep track of already consumed services and avoid a balance overdraft – even if a secondary site has to operate in stand-alone mode for an extended period of time – and this, therefore, results in a clear history of all produced charges. When the primary site becomes available again, these changes are automatically applied to the primary database – and the secondary site is switched back to “normal” mode. All of this happens automatically, without any need for PortaSwitch administrator involvement – and an end user might not even notice that there were any problems at the main site.

Example scenario

Link copied to clipboard

Let’s detail what happens in case of a primary site outage using a single customer as an example. The customer “ABC” has account number 12345 provisioned on his IP phone. The customer has a current balance of $98.00, a credit limit of $100 and his rate for calls within the US is $0.10/minute. The primary and secondary sites are configured as previously described.

  • A power outage makes the entire primary site unavailable.

  • This event is detected by a watchdog script on the secondary site so it switches into “stand-alone” mode (in particular, this enables the OCS server on the secondary site and instructs the PortaSIP cluster on the secondary site to use it as the authorization source).

  • If the user’s SIP phone was previously registered to the PortaSIP cluster on the primary site, during the next re-registration attempt the phone will detect that the cluster is no longer available and attempt to contact an alternative server (this list is either pre-programmed into the phone or obtained dynamically using DNS). When it reaches the PortaSIP cluster on the secondary site it registers there. (If the phone is already registered on the PortaSIP cluster on the secondary site, nothing changes.)

  • When the user attempts to make an outgoing call, an authorization request is sent to the PortaBilling OCS server on the secondary site.

  • The billing engine uses the currently available balance information ($98.00) to compare it with the credit limit ($100.00) and authorizes the call for no more than 20 minutes.

  • When, after 12 minutes of conversation, the user hangs up, PortaSIP sends an accounting request to PortaBilling so that charges are applied.

  • When PortaBilling processes the request, it calculates the amount to be charged ($1.20) and stores the balance adjustment ($1.20) and the xDR for that call (with all call details such as CLI, CLD, call connect time, etc.) in the delta database.

  • Then, when the user makes another call and PortaSIP sends an authorization request, the billing engine calculates the “effective” balance as the sum of the balance in the stand-by database ($98.00) and the balance adjustment stored in the delta database ($1.20). So the effective balance is $99.20 and the call will have a time limit of 8 minutes.

  • The user hangs up after 5 minutes, so there is another xDR for that call with the charged amount of $0.50 written to the delta database and the balance adjustment is now $1.70.

  • The next call will only be authorized for the remaining $0.30 of available funds – and can only run until the balance reaches the credit limit. This prevents balance overdraft – even if the site operates in stand-alone mode and the balances in the stand-by database are not changed.

  • When the primary site comes back up, synchronization takes place.

  • First to happen is that funds in the amount of the balance adjustment ($1.70) are locked in the primary database – this ensures that if a customer now tries to use the service on the main site, he will only be able to spend the $0.30 that he has available.

  • Next, the secondary site is switched back to “normal” mode.

  • And then, individual xDRs are transferred to the primary database.

    This two-step process (first funds lock, then actual xDR transfer) ensures the avoidance of balance overdraft on the main site while an xDR transfer is in progress. There can be a large number of xDRs (if a secondary site operated in stand-alone for an extended period of time) and consequently, it can take time to replicate all of them to the primary site.

Stand-alone mode restrictions

Link copied to clipboard

The secondary site does not differentiate between these two types of events:

  • The primary site is down or has been destroyed (power failure, hurricane, earthquake, etc.).

  • The primary site is still up and operational, but connectivity between the primary site and the secondary site is lost. For instance, the primary site is in city A and the secondary site is in city B. So while there is no connectivity between those two city sites, each one functions normally; in each city there are users using the service.

When the secondary site operates in stand-alone mode, it is essential that data integrity between the primary and secondary sites is protected at all times. This means that no operations should be allowed to run on the secondary site that could cause data conflict when merging data change back to the primary site.

Let’s assume that during a connectivity outage between the sites the service configuration is changed as follows:

  • The end user, connected to the secondary site, sets up call forwarding to phone number 123.

  • On the primary site, the administrator also sets up call forwarding for this user to phone number 456.

Once connectivity between the sites is restored and a data merge is performed, it could be unclear which configuration could be regarded as valid (i.e., which number would end up as the forwarding number). This is called a “split brain” problem and, of course, must be prevented from happening.

So although the secondary site can detect that the primary site is not accessible, it regards the primary site as operating normally, since users are making calls, administrators are making changes to the web interface and data is being changed there. Thus, the secondary site (when activated) does not perform all of the functions of the primary site; stand-alone mode requires that some functionality must be disabled.

In short, in stand-alone mode, the only operations allowed are those that change the balance and produce xDRs. All other changes (e.g., changing service configuration attributes or creating new entities) are prohibited. While the secondary site is in stand-alone mode, users