Friday, 25 September 2015

Randomly dismounting Exchange 2010 databases and the travelling domain controllers

Some time ago a customer reported an issue with their Exchange 2010 databases being failed over at random times during the day. They had Exchange 2010 DAG environment with 4 servers on the primary and two on the secondary site. Databases on the primary site would randomly crash and failover to another Exchange 2010 server. When inspecting the logs I found this event in the Application log:

Source: MSExchange AD Access
Event ID: 2389
Description: Process STORE.EXE (PID=2668). A request to directory server somedc03.somedomain.hr did not return a result within 30 seconds and is being abandoned. The search will be retried if possible.

Just after that, the databases would crash. This is a screenshot from the Application log:



So I started to investigate why the server could not talk with this DC. I focused on this event log:



Log Name:      Application
Source:        MSExchange ADAccess
Date:          4.12.2013. 16:26:34
Event ID:      2080
Task Category: Topology
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      someex.somedomain.hr
Description:
Process STORE.EXE (PID=5264). Exchange Active Directory Provider has discovered the following servers with the following characteristics:
 (Server name | Roles | Enabled | Reachability | Synchronized | GC capable | PDC | SACL right | Critical Data | Netlogon | OS Version)
In-site:
somedc01.somedomain.hr     CDG 1 7 7 1 0 0 1 7 1
somedc02.sektor2.somedomain.hr   CDG 1 7 7 1 0 1 1 7 1
somedc05.somedomain.hr    CDG 1 7 7 1 0 0 1 7 1
somedc06.somedomain.hrr CDG 1 7 7 1 0 0 1 7 1
somedc07.somedomain.hr  CDG 1 7 7 1 0 1 1 7 1
somedc08.somedomain.hr    CD- 1 6 6 0 0 1 1 6 1
somedc09.somedomain.hr    CDG 1 7 7 1 0 1 1 7 1
somedc10.somedomain.hr    CDG 1 7 7 1 0 1 1 7 1
somedc11.somedomain.hr    CDG 1 7 7 1 0 0 1 7 1
somedc03.somedomain.hr  CD- 1 0 0 0 0 0 0 0 0
somedc04.somedomain.hr  CD- 1 6 0 0 0 0 0 0 0
Out-of-site:
somedc12.somedomain.hr    CDG 1 7 7 1 0 1 1 7 1
somedc13.somedomain.hr  CDG 1 7 7 1 0 1 1 7 1
somedc14.somedomain.hr     CDG 1 7 7 1 0 1 1 7 1
somedc15.somedomain.hr    CDG 1 7 7 1 0 1 1 7 1


Zeroes next to the domain controllers somedc03 and somedc04 suggest there is a problem with it and on Microsoft Technet I found this:

If you see other numbers here (especially 0), there may be a problem with the connection from the Exchange server to the directory service.“

It turned out that these domain controllers are not local to the Exchange servers, instead they are located in the remote sites, basically ships that were floating around the world and had their connections established over satellite links!  I tried to ping them and there was latency of a couple of seconds, but sometimes the connection would be down completely. When Exchange servers picked this domain controllers for their directory lookups, they would start logging errors.

The problem was that Active Directory Sites and Services was not correctly configured so that subnets that these domain controllers reside in belong to separate AD Site. Instead, Exchange servers believed that they were local to them. When we fixed the subnets, the Exchange AD Topology Discovery process no longer discovered the travelling domain controllers and the problem went away.