This is part 1 of a two part blog.
Part1: Time for a Virtual Domain Controller and vMotion gotchas educational trip
Part2: DC gone wild
We receive two brand new HP DL380 earlier this week and more memory to add to an existing ESXi boxes. One DL380 will serve as the third ESXi on an existing 2 node cluster and the other one will be added on the Vmware View 5 Cluster. I needed to add additional memory on this existing 2 node cluster. I also needed to introduce one of the new HP DL380 to the cluster so I did not waste any time and started unboxing the server. I inserted the vSphere 5.0 cd and start installing it on the 4gb sd card (Still using 5.0 because I have not updated my vCenter Appliance to 5.1 stay tuned on that coming up this month). The install finished. At this point I just want to get it up to its bare minimum so I planned on installing patches the following day. I added it to an existing vCenter cluster and configured the standard switches. At this point I wanted to add new memory to the other two ESXi boxes so I started the vmotion process to this brand new DL380. I highlighted the vm that I needed to move, right click and hit “migrate”. The wizard takes me through the checks and everything passed, I therefore hit ok and vmotion began moving machine to the new DL380. The vm includes DC, File, Print, WSUS, Exchange, etc. When the vmotion process finished I shutdown the other ESXi boxes one at a time and upgraded the memory. Little did I know that the time on all machines are now off by decades after the vmotion process, causing “ALL” machine accounts to not talk properly. The subtle vmotion that I have done number of times now caused the Domain Controller time to be off by decades.
What happened? What can cause vmotion to behave like that? This happened due to a number of reason.
First, The vm Domain Controller that I vmotioned also holds the PDC emulator. One of PDC emulators responsibility is to be the timekeeper for the entire domain. You can only have one PDC emulator on a given domain so it is imperative that this vm outsource its time from a reliable NTP server. This Domain Controller vm is configured to grab time from a stratum 1 timesource for a couple of years and as of two months ago some security policy has changed and I am now being firewalled from getting reliable NTP from that specific stratum 1. The NTP issue was not a surprise but I was a little relax about reconfiguring it. Besides whats a couple of seconds off, it was like “no harm no foul”. The NTP is an entire separate blog but in short, I was told to take NTP from an internal stratum 1 GPS timesource that requires MD5 password. Windows NTP has no setting to add MD5, thats the result of my short Google search. I wanted to research it more but I had to scramble and get NTP working right away so what I ended up doing is to configure my Cisco 6500 and 3820 to the stratum1 (Cisco supports MD5 on NTP) and make it a stratum 5 NTP server. I could then configure the Domain Controller vm to point to my Cisco device for NTP without password.
Second, the new ESXi DL380 that I introduce on the cluster has a default misconfigured time. It is off by decades. I did not even venture and look on that section, I just wanted to get it up. You might be thinking, well Totie you should have done initial check first. The only check that I did which I thought was a valid check was to vmotion a less important server and check the server functionality. I did that, but since that less important server is part of the domain and he gets his time from the PDC emulator the machine did not have any problem at all. Disclosure, none of my vm uses “Synchronize guest time with host”.
The combination of this misconfiguration caused the PDC emulator to be decades off after vmotion and here is why. (Disclosure, this is my understanding based on observation) When you vmotion a vm, the entire vm is being paused from the source ESXi and played on the destination ESXi. Once you are migrated your time becomes unreliable so it is forced to grab time elsewhere. For a member server the PDC emulator takes care of that so no problem at all. But if you are the PDC emulator and you are configured to grab NTP from external source and that external source is unreachable where do you get time? On a physical or virtual box the answer is the BIOS. The BIOS supplies the time, on the case of my decades off ESXi box that is the BIOS time it handed to the PDC emulator.
The buttom line is “Time”. The importance of having reliable time. A huge lesson learned for me. And do you think I will make that happen again? It is ingrained in my brain. Mistakes happen along the way but it is how you deal and learn through those mistakes that is the most important. Turn calamity into discovery and learn from it.
Read the next blog and we’ll venture to the Windows admin role, I will explain how I fix this “DC gone wild”.