Hi All,
In this blog I wanted
to share a few of my experiences when migrating from an on premise datacenter environment to a private (metro) cloud
platform.
A lot of
enterprises and even small to medium businesses use multiple datacentres for
Disaster Recovery (DR) purposes. In many cases these datacentres are local within
the same metropolitan area. The distances between those datacentres is usually
set between 10 km and 100 km to ensure floods, power outages or other (small) disasters
don’t affect services. (Here in Brisbane after our 2011 flood experience this
has become pretty much the rule of thumb).
Consider the design below when the following applies:
- you have low latency/high bandwidth connectivity between data centers
- you don't have/want a stretched storage cluster
- you need to use VLAN to VXLAN bridging to migrate workloads. (no cross-vcenter deployment)
A quick walk
though of the diagram.
The top bit (in
blue) shows a new cloud platform with 3 datacentres (or Co-Locations). The cloud
platform has a management, a primary and a secondary DC.
The primary and
secondary datacentres are connected through a fibre connection. The management
is connected to both primary and secondary DC with preferably fibre connections
however a routed or L2 connection suffices.
The management
DC hosts the core management infrastructure components. In our case vSphere
vCenter (version here is 6.5) and the NSX manager (version here is 6.3)
The primary and
secondary DCs are used to host all the virtual workloads. How you spread your
workloads is depended on the DR requirements.
I usually split
up the services or applications up into 2 categories. Services with built-in
recovery capability and the services without it.
Example of
services with built-in recovery capabilities are Active Directory, DFS, SQL
cluster etc… As those services have multiple nodes. Place the primary node in
the primary datacentre and the secondary one in the secondary datacentre. If
there is a requirement for quorum or witness you can use the management DC for that
purpose.
Services
without built-in recovery you place in the primary DC. To ensure those services
are DR capable a third-party product is required protect those workloads. In
the example above Zerto provides that functionality.
The bottom bit
(red part) shows our existing infrastructure. We assume that the sites shown
here contain virtual workloads that need to be migrated. We also assume that
all the sites are interconnected with each other on a Layer 2 (VLAN) or
physical (fibre) level.
We connect our
existing infrastructure with the new cloud platform using redundant fibre
links. Ultimately we are going to stretch VLAN’s across so make sure you have some
form of loop prevention in place.
Let’s say you
want to migrate all your existing infrastructure onto a new cloud platform and you
want to minimize any changes to your existing applications servers and/or
services. The best way to achieve this is to ensure you can migrate your
virtual workloads as-is, meaning you don’t want to re-IP any of the servers.
Now if your
infrastructure is larger than let’s say 200 virtual workloads it becomes hard
to migrate all the workloads in one go. In this case during the migration phase
you will have virtual workloads on both the new cloud platform and on your
existing infrastructure. Here you will need bridging capabilities between VLAN
and VXLAN at some point.
Knowing this
and knowing bridging is not supported (yet) for NSX universal objects cross-vCenter
architecture is not going to work. Therefore we are back to using 1 vCenter and
1 NSX manager.
With vSphere
6.5 you now have the ability for an active/passive configuration. Here we place
the active node in the management DC and the passive node in the secondary
datacentre. The NSX manager unfortunately doesn’t have this functionality so we
place the manager in the management datacentre for now.
Soo what are
the DR scenarios in the new cloud platform?
A: the primary DC goes dark.
This is the most
disruptive DR scenario. The primary DC host all our active workloads. Any
application/service with built-in recovery should continue to run as the
secondary nodes become active over in the secondary DC. All services without
built-in recovery are protected using a third party tool (our case Zerto) and
can be revived in the secondary DC. The main reason for using Zerto in this
design and not SRM is that SRM has the constraint of minimum 2 vCenters.
Because our
management DC is still active we can invoke the DR plan straight away without
the need of recovering first any of the management components, therefore saving
valuable time.
Quick note: For now, the focus is just
infrastructure, we will address the network components later.
B: the secondary DC goes dark
This scenario
should not have any production impact. All services/applications continue to
run as normal as all the primary/active nodes are still up and running in the
primary DC. During this time workloads are no longer protected as the failover
secondary DC is dark. This is ok as we are only protecting against 1 DC failure.
Likewise we lose access to the Zerto manager for the time the datacentre is
dark. Again, this is not an issue as even when there would be access to the
manager there is still no datacentre to recover too.
C: The management DC goes dark
Not too much
drama, all the production applications and services continue to operate however
we lose some management functionality.
vCenter becomes
active in the secondary datacentre and continues to operate. Unfortunately, the
NSX manager does not have that capability. Luckily there are 2 “easy” options:
- We protect the NSX manager with Zerto and failover to the secondary datacentre, then start the re-deployment of the controllers.
- We redeploy the NSX manager and restore a previously taken backup. A proper automated and well maintained backup process is key to the success of this approach.
Caveats
A big issue to
keep in mind when configuring disaster recovery with third party tools is that
inherently during a failover the VM object on the primary site gets destroyed
and a new VM object is created on the secondary site.
This becomes a
problem when using security tags in the NSX environment. Let’s say you create a
DNS tag and add the DNS servers to the tag. Then you create a security group
based on that tag with some firewall rules attached to it.
In a DR event
the third-party tool destroys the original VM objects and creates a new one at
the secondary site. In our scenario the DNS security group will be empty unless
the third party tool captures the tags and re-associates them with the new VM
objects.
As the security
group is used for firewall configuration the traffic behavior will change. Depending
on your firewall default rules the traffic will be blocked or allowed.
As part of your
DR plan you should make sure you can quickly tag the newly create VM objects with
the same tags as the ones the workloads originally had.
Side note ensure your MTU size of your network can handle 1600 or more to allow encapsulation for VXLAN.
Side note ensure your MTU size of your network can handle 1600 or more to allow encapsulation for VXLAN.
What about the network?
Glad you asked.
It’s important to keep in mind that in this new cloud platform all networks are
stretched (L2 connectivity) between the primary and secondary datacentre. This
for both VLAN and VXLAN.
(Note: the diagram is a subsection of the top diagram)
In the above
diagram for the border routers/firewall you can configure an active/passive
configuration. For any northbound traffic (to internet providers) the
configuration highly depends on the capability of your service provider so best
to check with them.
For the internal
network, because all networks are stretched the gateway IP for each subnet can
float between primary and secondary DC (active/passive). This means that the
router/firewall can failover from the active node to the passive node without
affecting production services (theoretically). The diagram highlights this with
the red lines as the active connections and the dotted/striped black lines as
the passive ones.
If you use NSX
edge services gateway, ensure you have HA enabled and ensure the location of
the active edge is in the primary DC and the standby in the secondary DC. As
per VMware best practice this should be on a dedicated Edge ESX cluster.
Similar for your DLR control VM, ensure HA is enabled and the VM’s are placed
correctly.
Quick note: In
this design, you are not bound to using only NSX components. For example, those
border routers/firewall can be physical appliances like Checkpoint or Fortinet.
You do need to be aware that those appliances usually don’t talk native VMware
version of VXLAN and are therefore need to be connected to a VLAN backed
network.
To make sure
your still can us VXLAN for your virtual workloads use an NSX Edge Services Gateway
or DLR with a VLAN uplink to the appliance as shown in the diagram below.
Lastly the Transition:
To better
explain let’s use the following use case
In the existing
infrastructure at Site A there are 2 networks and 4 servers that are to be
migrated to the new cloud platform. For now, let’s assume we have to migrate
server 2 and server 4 at the same time as the first step. Then we need migrate
server 1 and lastly in again a separate migration we need to migrate server 3.
First things
first we need to create the corresponding networks (VXLANs) in the cloud and
connect (bridge) it with the on premise VLAN. There are several ways of doing
it, for small environments you probably get away with deploying the standalone
NSX edge in your existing environment and configuring the VLAN to match the
corresponding VXLAN.
Alternatively, there are hardware based VTEP’s compatible
with VMware that specializes in just that functionality.
Secondly we
probably want to start thinking about moving the network services (gateways) to
the new cloud platform. In preparation, we deploy an edge services gateway (in
HA mode as explained before) and create a VLAN backed connection back to the
on-premise site. This link will be used for a point to point transport network later.
Now we are
ready to migrate servers 2 and 4. There are lots of migration tools out there
(Zerto, VeeAm, etc…) that can copy across virtual workloads with a reasonably
short outage window.
As you can see
in the diagram above servers 2 and 4 are now on the new cloud platform. The
important part to highlight here is that server 1 and server 2 are still
communicating with each other on the same L2 segment (same goes for server 3
and server 4). From the servers’ perspective, nothing really has changed,
however the underlying network is different.
Also important
to note that when server 2 wants to talk to server 4 the traffic gets routed via
the Site A router. In our case this can be acceptable as we are using a “metro”
deployment. However the further the distances between the datacentres the
higher the round trip time will be and therefore the higher the latency. This
can become a problem and you should take that into account when you are
planning your migration by grouping servers accordingly to their communication
flows with other servers.
You might ask
yourself why not migrate the gateways now to the cloud platform. Yes you could
do this now however keep security in mind. In the current traffic flow all
traffic is routed via the Site A router, which most likely is also the
firewall. By leaving the gateway where it is we ensure that all routed traffic
is still inspected.
Most likely you
are moving to an NSX platform with the intent is to start using its micro-segmentations
capabilities. Let’s say in this stage of the migration we would want to change
the gateways services to the new cloud platform and we are using the NSX Distributed
Firewall (DFW) for security and the Distributed Logical router (DLR) for routing
purposes. See diagram below
In the scenario
above traffic flowing from server 1 to server 3 is no longer inspected.
The distributed
firewall works on the VNIC level of objects that is known to the vCenter
connected to the NSX manager. In the diagram it’s shown as the green firewall
on the server connection to its network. As both Server 1 and Server 3 are NOT
connected to the cloud vCenter the DFW does not have a VNIC to inspect traffic
upon and in turn cannot block that traffic. So now Server 1 and Server 3 can
freely communicate and therefore no longer secured. For this reason it’s
probably best to migrate network services per subnet only when no more workloads
reside on the originating site.
Ok so we are
halfway through our migration, the next one in the list is Server 1. Equally to
the previous step, use your migration tool of choice.
As you can see
there are no more servers at Site A in the VLAN X network. We now can
disconnect the gateway services at Site A for VLAN X and connect it to the routing
instance in the new cloud platform. We also must update on both routers (Site A
and New Cloud Platform) the routing tables to reflect these changes. Site A
router/Firewall probably requires updates their firewall rules as well.
In the new
cloud platform any server talking to Server 3 will be inspected by both and the
DFW and the router/firewall on Site A.
Lastly we need
to migration server 3.
In the same
fashion as previously we migrate server 3. We also change over the gateway
service from Site A to the new cloud platform. Now all are servers are natively
connected and routed using VXLAN only. Make sure you update the routing tables
on the routers where needed.
As there are no
more workloads at site a we can disconnect the VLAN to VXLAN bridges and tidy
up any other configurations. Congratulations you just migrated your
infrastructure to the cloud.
Hope this
article helps with your migration planning and cloud design. Please provide any
comments or insights and I will update the article accordingly.
Cheers
Frits
https://www.linkedin.com/in/fritsreusens/
Cheers
Frits
https://www.linkedin.com/in/fritsreusens/




Comments
Post a Comment