vMotion change the world

Reminiscing the past and appreciating the greatest thing that came out of IT for the past decade in my opinion.  Nobody comes close to VMware VMotion.  VMotion is a feature that is so genius that I am still amazed how somebody could have envision it.  Imagine virtualization without VMotion, I think cold migrate would be an acceptable feature, where you shutdown the vm and move it to a new host.  That’s a reasonable solution right? yet VMware did the unthinkable.  Being able to live migrate a virtual machine is so way out there that I am so amazed how VMware made it possible.  I mean imagine the sales pitch when it was first introduce, it was probably a typical Monday meeting with all the top engineers and a white board with just the word VMOTION in big letters and somebody says something like this:

“Guys I don’t know how we’re going to do it.  We will migrate a live/hot vm without skipping a beat.  Meeting adjourn, lets make it happen”.

I wonder how everyone reacted.

VMotion is the one feature alone that took virtualization to a whole new level.  I don’t think virtualization will be as good as it is now without VMotion.  The ability to live migrate vm so you can service/replace the underlying hardware is so brilliant.   VMotion is the standard where every type 1 hypervisor is copying and I honestly think that there is any close competition out there.  You might say Microsoft and Xen but really it is not even close, VMotion can go circles on Live Migrate and Xenmotion.

In the future VMotion will be a feature that just works well  that we will take for granted how powerful it is and how so out of the ordinary and cutting edge it is.  It is already becoming common word just like how Google started.  I remember we call AltaVista, Yahoo, Google, Lycos, etc as search engines.  Now you hardly hear the word search engines, it is very common to hear “Google it”.  Just like Google, It is now common to hear “vmotion it”.

I just want to pause and give credit where credit is due.  I applaud the brains behind VMotion.  I also applaud how VMware did not stop developing this good idea.  SVMotion, DRS, SDRS, etc are all rising starts which I firmly believe all came from the vMotion idea.

Part2: DC Gone Wild

This is part 2 of a two part blog.

Part1:  Time for a Virtual Domain Controller and vMotion gotchas educational trip

Part2:  DC gone wild


On Part 1 I explained how I screwed up my domain controller time for the entire domain.  On this section I will go through on how to fix it but first let me introduce the cast so we won’t get confuse.

DC1 – Bad virtual domain controller.  Holds the PDC emulator FSMO role.

DC2 – Good domain controller.

Once we realize that we have a time issue we manually change the time on the two domain controller.  After the manual time change we started getting phone calls of users password expired, this is validated by DC logs that shows machine/user account issue.  Some users are working and some are not, the users that are working seems to all authenticate to DC2 and this is validated by going to the command prompt type “set” and look for “LOGONSERVER=\\DC2”.  At this point I know that DC2 is working and DC1 is not.  I also know that the two are not talking by adding a bogus file on DC1 NETLOGON directory share and looking at DC2 NETLOGON share expecting it to get sync there.  The logs validates the kerberos DC1 machine account issue.

At this point we can fix this in two ways (I’m sure there are other ways)

Option 1 which is the easiest option is just to decomission DC1 and seize all roles and move it to DC2 and problem solves.  If circumstance permits, this is the easiest fix, but there are number of gotchas that I am honestly not prepared to tackle so I put this option as last resort and relocate this option at the back of my head.

Option 2 is just to fix it.  If you already spent a lot of time on that DC like patching it and getting it all locked downed or you use PKI and that DC has a working certificate for PKI authentication, then fixing it is really the only viable option.

To fix it, I use a combination of burflags and netdom.  If this was just a regular machine I would just rejoin it to the domain and problem solve.  This domain controller needed to reset its machine password and convey the new password to DC2 and at the same time I want to make sure that DC1 grab and sync new changes from DC2 and not the other way around.

Burflags – to address the sync issue I use the burflag technique.  I have use this on a handful of occasion where my SYSVOL directory or NETLOGON directory gets out of sync.  There’s surplus of information on the web about this so there is no need to explain it.  What I did is open up regedit on bad DC1 and navigate to “HKEY_LOCAL_MACHINE \SYSTEM\CurrentControlSet\Services\NtFrs\Parameters\ Backup/Restore\Process at Startup” change “burflags” to “D2”.  “D2” tells this DC1 to be nonauthoritative and sync Active Directory from his DC2 brother.

Burflags will take care of the Active Directory inconsistency but we still have kerberos machine account issue to take care.  To do this we need to employ the help of NETDOM.

Netdom – I am using 2008 R2 so Netdom is included, you might need to download support tools if you are using 2003 server.  We will reset DC1 machine account password and let DC2 know about the change.  Log-in to DC1 and stop “Kerberos Key Distribution Center” service and set it to manual.  Next open up a command prompt (make sure you run as admin)   and issue command “netdom.exe resetpwd /s:DC2 /ud:MYDOMAIN\Administrator /pd:MyAdminPassword”.  Make sure it says “successful” otherwise check syntax.  Also take note that I am issuing this NETDOM utility from the bad DC1 and that you “resetpwd /s:DC2”, DC2 being the good server.   Once successful, normalize and set Kerberos Key Distribution Center service to Automatic and reboot DC1.

At this point the machine should be back to a normal working state.  It is normal behavior for SYSVOL and NETLOGON share to not exist for the first couple of minutes  depending on how big they are.  Remember the burflags “D2” entry causes DC1 to delete all these share content and sync up with DC2.  In my case it took between 5 to 10 minutes.

That is all folks, I hope you enjoy that 2 part series.

Part1: Time for a Virtual Domain Controller and vMotion gotchas educational trip

This is part 1 of a two part blog.

Part1:  Time for a Virtual Domain Controller and vMotion gotchas educational trip

Part2:  DC gone wild

We receive two brand new HP DL380 earlier this week and more memory to add to an existing ESXi boxes.  One DL380 will serve as the third ESXi on an existing 2 node cluster and the other one will be added on the Vmware View 5 Cluster.  I needed to add additional memory on this existing 2 node cluster.   I also needed to introduce one of the new HP DL380 to the cluster so I did not waste any time and started unboxing the server.  I inserted the vSphere 5.0 cd and start installing it on the 4gb sd card (Still using 5.0 because I have not updated my vCenter Appliance to 5.1 stay tuned on that coming up this month).  The install finished.   At this point I just want to get it up to its bare minimum so I planned on installing patches the following day.  I added it to an existing vCenter cluster and configured the standard switches.  At this point I wanted to add new memory to the other two ESXi boxes so I started the vmotion process to this brand new DL380.  I highlighted the vm that I needed to move, right click and hit “migrate”.  The wizard takes me through the checks and everything passed, I therefore hit ok and vmotion began moving machine to the new DL380.  The vm includes DC, File, Print, WSUS, Exchange, etc.  When the vmotion process finished I  shutdown the other ESXi boxes one at a time and upgraded the memory.  Little did I know that the time on all machines are now off by decades after the vmotion process, causing “ALL” machine accounts to not talk properly.  The subtle vmotion that I have done number of times now caused the Domain Controller time to be off by decades.

What happened? What can cause vmotion to behave like that?  This happened due to a number of reason.

First, The vm Domain Controller that I vmotioned also holds the PDC emulator.  One of PDC emulators responsibility is to be the timekeeper for the entire domain.  You can only have one PDC emulator on a given domain so it is imperative that this vm outsource its time from a reliable NTP server.  This Domain Controller vm is configured to grab time from a stratum 1 timesource for a couple of years and as of two months ago some security policy has changed and I am now being firewalled from getting reliable NTP from that specific stratum 1.  The NTP issue was not a surprise but I was a little relax about reconfiguring it.  Besides whats a couple of seconds off, it was like “no harm no foul”.  The NTP is an entire separate blog but in short, I was told to take NTP from an internal stratum 1 GPS timesource that requires MD5 password.  Windows NTP has no setting to add MD5, thats the result of my short Google search.  I wanted to research it more but I had to scramble and get NTP working right away so what I ended up doing is to configure my Cisco 6500 and 3820 to the stratum1 (Cisco supports MD5 on NTP) and make it a stratum 5 NTP server.  I could then configure the Domain Controller vm to point to my Cisco device for NTP without password.

Second, the new ESXi DL380 that I introduce on the cluster has a default misconfigured time.  It is off by decades.  I did not even venture and look on that section, I just wanted to get it up.  You might be thinking, well Totie you should have done initial check first.  The only check that I did which I thought was a valid check was to vmotion a less important server and check the server functionality.  I did that, but since that less important server is part of the domain and he gets his time from the PDC emulator the machine did not have any problem at all.  Disclosure, none of my vm uses “Synchronize guest time with host”.

The combination of this misconfiguration caused the PDC emulator to be decades off after vmotion and here is why.  (Disclosure, this is my understanding based on observation) When you vmotion a vm, the entire vm is being paused from the source ESXi and played on the destination ESXi.  Once you are migrated your time becomes unreliable so it is forced to grab time elsewhere.  For a member server the PDC emulator takes care of that so no problem at all.  But if you are the PDC emulator and you are configured to grab NTP from external source and that external source is unreachable where do you get time? On a physical or virtual box the answer is the BIOS.  The BIOS supplies the time, on the case of my decades off ESXi box that is the BIOS time it handed to the PDC emulator.

The buttom line is “Time”.  The importance of having reliable time.  A huge lesson learned for me.  And do you think I will make that happen again? It is ingrained in my brain.  Mistakes happen along the way but it is how you deal and learn through those mistakes that is the most important.  Turn calamity into discovery and learn from it.

Read the next blog and we’ll venture to the Windows admin role, I will explain how I fix this “DC gone wild”.