I have been hearing more and more people talk about the virtues of using CCR with a node in each site. This talk has escalated now that Windows Server 2008 has released to manufacturing. With Windows Server 2008 Failover Cluster environments now have the ability to have nodes in multiple sites without having to use Virtual LANs (VLANs) to provide the networking support.
On the surface, CCR and Windows Server 2008 in a multi-site cluster sounds like the answer to many organization needs. Obviously, I am setting up the argument against this kind of implementation. OK, maybe it wasn’t obvious to some of you. <G>
Anyways, here is a rough sketch (this means that lots of non-discussed components are not shown, i.e. CAS, DC/GC, DNS, etc.) of how this would look if you had two physical locations with them both being in the same AD site to support CCR. In the drawing, Node1 is the active node and replication traffic flows over the WAN link to Node2 which is the passive node. If you look at the drawing, you should immediately see some issues.
Consideration number 1. Where should you put the FSW? In this drawing, it is in the site on the left. Well, what if that is the site that goes down in a flood, tornado, meteor strike, or whatever? If the FSW is lost along with one of the nodes, there will not be an automated failover. OK, this is fixable since we can manually force the cluster to start, but it will impact life in the real world if there is a major disaster, especially if you lose your administrators along with the site. Make sure you document the process in your DR documentation as somebody else might need to perform the task.
Consideration number 2. How do you know which Hub Transport to use for the transport dumpster in order to back fill the surviving node? After all HT1 and HT2 are in the same AD site, which means that they would be used in a load balanced manner, so it is not possible to use one of them to provide full replay of lost transactions. Yes, you can hard code which HT to use, but that makes no sense to me in an HA environment as if you did that, you would lose the redundancy/load balancing functionality gained by having multiple HTs in a site. Of course, you might even have two in the same physical location. Also, let’s say you hard code HT1 for the CMS and it is active on Node1. If you do that, then you lose the transport dumpster along with the location in the event of a major disaster. OK, so let’s say you hard code HT2 for the CMS which is active on Node1. That would mean all of your traffic would be going across the WAN link, which is not exactly a good idea.
Consideration number 3. What about the use of the Wide Area Network (WAN) and its uncontrolled use by many different services? After all, if both physical locations are in the same AD site, will you have issues with clients logging on and authenticating across the WAN link? Will you have problems with the Clustered Mailbox Server (CMS) using the Hub Transport (HT) on the other side of the WAN link? What about the HT using the wrong Domain Controller/Global Catalog server and thus all of its queries being run over the WAN link? Again, you can hard code some of these settings for some applications and services, but even if you do that, there is again the issue of potentially losing redundancy/load balancing.
Consideration number 4. Using Windows Server 2008 and its multi-site improvements impacts DNS and resolution. For example, when Node1 is active, its VIP address is registered with the CMS name. If there is a failover, then the other VIP (for the physical location of Node2) must be registered within DNS and DNS updates needs to be replicated to all DNS servers in the organization. During the time of the updates and shortly after, there will be clients that have the old VIP address in its cache, so it will resolve incorrectly until the cache is updated on the clients. This is not an Exchange issue, but something else that should be considered.
So, what do I recommend? I am glad you asked that question. If you didn’t, too bad, I will answer it anyways.
I highly recommend using CCR within a single physical site that is also an AD site. For disaster recovery reasons, I recommend using Standby Continuous Replication (SCR) to copy transactions to a remote site’s Exchange mailbox server.
FYI, I updated based on some of Scott Schnoll’s comments to me. Scott had some excellent points regarding my concerns listed above. I won’t go through them one by one, but it basically came down to my making the assumption that CCR in a multi-site (stretched AD site) environment would be configured for automatic failover. I did make this assumption because if we were looking for a manual process that would require administrator intervention to get it up and running, then we should be talking SCR, not CCR. High Availability (HA) and Disaster Recovery (DR) are very different in my mind. HA means that processes are automated to reduce downtime to a minimal amount. DR is something that is done when there is a major disaster that requires steps to be taken to recover the environment. CCR is an HA technology and SCR is a DR technology, in my opinion.