Metro Cloud

Hi All,

In this blog I wanted to share a few of my experiences when migrating from an on premise datacenter environment to a private (metro) cloud platform.

A lot of enterprises and even small to medium businesses use multiple datacentres for Disaster Recovery (DR) purposes. In many cases these datacentres are local within the same metropolitan area. The distances between those datacentres is usually set between 10 km and 100 km to ensure floods, power outages or other (small) disasters don’t affect services. (Here in Brisbane after our 2011 flood experience this has become pretty much the rule of thumb).

Consider the design below when the following applies:

you have low latency/high bandwidth connectivity between data centers
you don't have/want a stretched storage cluster
you need to use VLAN to VXLAN bridging to migrate workloads. (no cross-vcenter deployment)

A quick walk though of the diagram.

The top bit (in blue) shows a new cloud platform with 3 datacentres (or Co-Locations). The cloud platform has a management, a primary and a secondary DC.

The primary and secondary datacentres are connected through a fibre connection. The management is connected to both primary and secondary DC with preferably fibre connections however a routed or L2 connection suffices.

The management DC hosts the core management infrastructure components. In our case vSphere vCenter (version here is 6.5) and the NSX manager (version here is 6.3)

The primary and secondary DCs are used to host all the virtual workloads. How you spread your workloads is depended on the DR requirements.

I usually split up the services or applications up into 2 categories. Services with built-in recovery capability and the services without it.

Example of services with built-in recovery capabilities are Active Directory, DFS, SQL cluster etc… As those services have multiple nodes. Place the primary node in the primary datacentre and the secondary one in the secondary datacentre. If there is a requirement for quorum or witness you can use the management DC for that purpose.

Services without built-in recovery you place in the primary DC. To ensure those services are DR capable a third-party product is required protect those workloads. In the example above Zerto provides that functionality.

The bottom bit (red part) shows our existing infrastructure. We assume that the sites shown here contain virtual workloads that need to be migrated. We also assume that all the sites are interconnected with each other on a Layer 2 (VLAN) or physical (fibre) level.

We connect our existing infrastructure with the new cloud platform using redundant fibre links. Ultimately we are going to stretch VLAN’s across so make sure you have some form of loop prevention in place.

Let’s say you want to migrate all your existing infrastructure onto a new cloud platform and you want to minimize any changes to your existing applications servers and/or services. The best way to achieve this is to ensure you can migrate your virtual workloads as-is, meaning you don’t want to re-IP any of the servers.

Now if your infrastructure is larger than let’s say 200 virtual workloads it becomes hard to migrate all the workloads in one go. In this case during the migration phase you will have virtual workloads on both the new cloud platform and on your existing infrastructure. Here you will need bridging capabilities between VLAN and VXLAN at some point.

Knowing this and knowing bridging is not supported (yet) for NSX universal objects cross-vCenter architecture is not going to work. Therefore we are back to using 1 vCenter and 1 NSX manager.

With vSphere 6.5 you now have the ability for an active/passive configuration. Here we place the active node in the management DC and the passive node in the secondary datacentre. The NSX manager unfortunately doesn’t have this functionality so we place the manager in the management datacentre for now.

Soo what are the DR scenarios in the new cloud platform?

A: the primary DC goes dark.

This is the most disruptive DR scenario. The primary DC host all our active workloads. Any application/service with built-in recovery should continue to run as the secondary nodes become active over in the secondary DC. All services without built-in recovery are protected using a third party tool (our case Zerto) and can be revived in the secondary DC. The main reason for using Zerto in this design and not SRM is that SRM has the constraint of minimum 2 vCenters.

Because our management DC is still active we can invoke the DR plan straight away without the need of recovering first any of the management components, therefore saving valuable time.

Quick note: For now, the focus is just infrastructure, we will address the network components later.

B: the secondary DC goes dark

This scenario should not have any production impact. All services/applications continue to run as normal as all the primary/active nodes are still up and running in the primary DC. During this time workloads are no longer protected as the failover secondary DC is dark. This is ok as we are only protecting against 1 DC failure. Likewise we lose access to the Zerto manager for the time the datacentre is dark. Again, this is not an issue as even when there would be access to the manager there is still no datacentre to recover too.

C: The management DC goes dark

Not too much drama, all the production applications and services continue to operate however we lose some management functionality.

vCenter becomes active in the secondary datacentre and continues to operate. Unfortunately, the NSX manager does not have that capability. Luckily there are 2 “easy” options:

We protect the NSX manager with Zerto and failover to the secondary datacentre, then start the re-deployment of the controllers.
We redeploy the NSX manager and restore a previously taken backup. A proper automated and well maintained backup process is key to the success of this approach.

Caveats

A big issue to keep in mind when configuring disaster recovery with third party tools is that inherently during a failover the VM object on the primary site gets destroyed and a new VM object is created on the secondary site.

This becomes a problem when using security tags in the NSX environment. Let’s say you create a DNS tag and add the DNS servers to the tag. Then you create a security group based on that tag with some firewall rules attached to it.

In a DR event the third-party tool destroys the original VM objects and creates a new one at the secondary site. In our scenario the DNS security group will be empty unless the third party tool captures the tags and re-associates them with the new VM objects.

As the security group is used for firewall configuration the traffic behavior will change. Depending on your firewall default rules the traffic will be blocked or allowed.

As part of your DR plan you should make sure you can quickly tag the newly create VM objects with the same tags as the ones the workloads originally had.

Side note ensure your MTU size of your network can handle 1600 or more to allow encapsulation for VXLAN.

What about the network?

Glad you asked. It’s important to keep in mind that in this new cloud platform all networks are stretched (L2 connectivity) between the primary and secondary datacentre. This for both VLAN and VXLAN.

(Note: the diagram is a subsection of the top diagram)

In the above diagram for the border routers/firewall you can configure an active/passive configuration. For any northbound traffic (to internet providers) the configuration highly depends on the capability of your service provider so best to check with them.

For the internal network, because all networks are stretched the gateway IP for each subnet can float between primary and secondary DC (active/passive). This means that the router/firewall can failover from the active node to the passive node without affecting production services (theoretically). The diagram highlights this with the red lines as the active connections and the dotted/striped black lines as the passive ones.

If you use NSX edge services gateway, ensure you have HA enabled and ensure the location of the active edge is in the primary DC and the standby in the secondary DC. As per VMware best practice this should be on a dedicated Edge ESX cluster. Similar for your DLR control VM, ensure HA is enabled and the VM’s are placed correctly.

Quick note: In this design, you are not bound to using only NSX components. For example, those border routers/firewall can be physical appliances like Checkpoint or Fortinet. You do need to be aware that those appliances usually don’t talk native VMware version of VXLAN and are therefore need to be connected to a VLAN backed network.

To make sure your still can us VXLAN for your virtual workloads use an NSX Edge Services Gateway or DLR with a VLAN uplink to the appliance as shown in the diagram below.

Lastly the Transition:

To better explain let’s use the following use case

In the existing infrastructure at Site A there are 2 networks and 4 servers that are to be migrated to the new cloud platform. For now, let’s assume we have to migrate server 2 and server 4 at the same time as the first step. Then we need migrate server 1 and lastly in again a separate migration we need to migrate server 3.

First things first we need to create the corresponding networks (VXLANs) in the cloud and connect (bridge) it with the on premise VLAN. There are several ways of doing it, for small environments you probably get away with deploying the standalone NSX edge in your existing environment and configuring the VLAN to match the corresponding VXLAN.

Alternatively, there are hardware based VTEP’s compatible with VMware that specializes in just that functionality.

Secondly we probably want to start thinking about moving the network services (gateways) to the new cloud platform. In preparation, we deploy an edge services gateway (in HA mode as explained before) and create a VLAN backed connection back to the on-premise site. This link will be used for a point to point transport network later.

Now we are ready to migrate servers 2 and 4. There are lots of migration tools out there (Zerto, VeeAm, etc…) that can copy across virtual workloads with a reasonably short outage window.

As you can see in the diagram above servers 2 and 4 are now on the new cloud platform. The important part to highlight here is that server 1 and server 2 are still communicating with each other on the same L2 segment (same goes for server 3 and server 4). From the servers’ perspective, nothing really has changed, however the underlying network is different.

Also important to note that when server 2 wants to talk to server 4 the traffic gets routed via the Site A router. In our case this can be acceptable as we are using a “metro” deployment. However the further the distances between the datacentres the higher the round trip time will be and therefore the higher the latency. This can become a problem and you should take that into account when you are planning your migration by grouping servers accordingly to their communication flows with other servers.

You might ask yourself why not migrate the gateways now to the cloud platform. Yes you could do this now however keep security in mind. In the current traffic flow all traffic is routed via the Site A router, which most likely is also the firewall. By leaving the gateway where it is we ensure that all routed traffic is still inspected.

Most likely you are moving to an NSX platform with the intent is to start using its micro-segmentations capabilities. Let’s say in this stage of the migration we would want to change the gateways services to the new cloud platform and we are using the NSX Distributed

Firewall (DFW) for security and the Distributed Logical router (DLR) for routing purposes. See diagram below

In the scenario above traffic flowing from server 1 to server 3 is no longer inspected.

The distributed firewall works on the VNIC level of objects that is known to the vCenter connected to the NSX manager. In the diagram it’s shown as the green firewall on the server connection to its network. As both Server 1 and Server 3 are NOT connected to the cloud vCenter the DFW does not have a VNIC to inspect traffic upon and in turn cannot block that traffic. So now Server 1 and Server 3 can freely communicate and therefore no longer secured. For this reason it’s probably best to migrate network services per subnet only when no more workloads reside on the originating site.

Ok so we are halfway through our migration, the next one in the list is Server 1. Equally to the previous step, use your migration tool of choice.

As you can see there are no more servers at Site A in the VLAN X network. We now can disconnect the gateway services at Site A for VLAN X and connect it to the routing instance in the new cloud platform. We also must update on both routers (Site A and New Cloud Platform) the routing tables to reflect these changes. Site A router/Firewall probably requires updates their firewall rules as well.

In the new cloud platform any server talking to Server 3 will be inspected by both and the DFW and the router/firewall on Site A.

Lastly we need to migration server 3.

In the same fashion as previously we migrate server 3. We also change over the gateway service from Site A to the new cloud platform. Now all are servers are natively connected and routed using VXLAN only. Make sure you update the routing tables on the routers where needed.

As there are no more workloads at site a we can disconnect the VLAN to VXLAN bridges and tidy up any other configurations. Congratulations you just migrated your infrastructure to the cloud.

Hope this article helps with your migration planning and cloud design. Please provide any comments or insights and I will update the article accordingly.

Cheers
Frits
https://www.linkedin.com/in/fritsreusens/

VMware NSX metro-cloud without Cross-vCenter architecture

Search This Blog

Metro Cloud

A: the primary DC goes dark.

B: the secondary DC goes dark

C: The management DC goes dark

Caveats

What about the network?

Comments

Post a Comment

Popular posts from this blog

VMware NSX with Zerto.