I swear, every time somebody talks about patching, I cringe. While I support patching, I also understand that there is risk associated with it. Last week, I patched my Group Chat servers. Life has sucked since then.
Previous State: In this particular case, Group Chat was migrated from OCS 2007 R2 to Lync 2010. It was working fine for months. The previous environment had two OCS Group Chat servers.
What Happened: After patching, the Lync Server Channel Service and the Lync Server Lookup Service would not start and stay started. Here is the level of the patches. At this time, these are all current.
Yep, Lync Server 2010 Group Chat is fully up to date.
Errors Experienced: From the user perspective, users saw The DomainName.com is not available error in their notices. Of course, this error is misleading as it is caused simply by the fact that the Group Chat services would not start on the server.
After years and years of working on Windows servers, I have finally learned to use the Event Viewer. On the server side, Event Viewer displayed
Event ID 6381: An error MGCLOOKU is stopping due to an unhandled exception.
Event ID 6381: An error MGCCHANS is stopping due to an unhandled exception.
Research identified that the lookup and channel logs showed that the services were trying to establish connections to the old OCS Group Chat servers. In this case, the old OCS Group Chat servers were not properly removed from the OCS environment. Instead, the Group Chat services were disabled, and the servers were shut down and recycled. This, of course, meant that there were legacy artifacts floating around in Active Directory.
The Fix, It Took Lots of Work:
I started with Randy Wintle’s blog at http://blog.ucmadeeasy.com/2010/11/09/lync-server-2010-active-directory-references-and-how-to-remove-them/ which provided some great information on removing entries for trusted services from Active Directory. Using his steps, I used LDP to identify each server’s DN, and the used AdsiEdit to remove each entry.
Randy’s blog is an excellent source of info, but in this case, it was a great blog post with fantastic instructions. However, that didn’t fix the problem, all by itself.
After turning up the logging to the Trace level in the Group Chat configuration tool, it was found that a stored procedure named procGetPeerServers was still finding the old OCS Group Chat server objects in the Group Chat SQL database. I opened up the stored procedure and saw that it was querying entries from a table, tblServerIdentity.
Deleting the old servers from this table sounded like the answer, and it was, with one issue. There was an issue where an error was thrown when trying to delete certain entries. The error referred to entries in the tblActivePeers. The relationship between two old servers was listed in this table and it needed to be deleted. The aplServerID referred to one of the old servers and the aplPeerID referred to another. Once this record was removed, Lync Group Chat services started again.
1. Why did this only become a problem after patching? I have no clue.
2. Why were these servers not gracefully removed from OCS in the first place? Well, that sounds like it should have happened, but we know how it all works in real life. Tasks are assigned to people, and they get dropped.
3. Why was it a blocker? Who knows? I am still trying to figure out how Group Chat worked in the first place if the other servers had been disabled.
Oh well, life goes on, and Group Chat is once again up and going. I will sleep well tonight.