I was called by a client on a Sunday morning after scheduled power outage in the computer room the night before. The client was having trouble bringing up their VMware™ environment, which consists of 120 virtual servers. They had been attempting to bring up the virtual server environment, and each time, the systems failed to ‘come up’ on the network. They had been working on the problem for several hours, prior to calling me at 7:50 AM on Sunday. I arrived on-site and found that all of the non-VMware™ systems were fully restored to operational status after the power was turned back on (beginning at 3:00 AM). However, they attempted to get the VMware™ servers operational (using the console ports) and were unsuccessful. Thinking it was a problem with the datacomm gear in the computer room, they needed additional assistance.
After briefly consulting them about their findings, I learned that the VMware™ systems kept responding that there was a duplicate IP address in use when each of the servers were attempting to come online. The server team tried bringing many different VMware™ servers up on the network, with the same result. They even changed the IP addresses on multiple servers attempting to bring them online, to no avail (same error message on the server console each time). The strange part was 1) the servers were hard coded with the same static IP addresses since they were installed months earlier, 2) the server team attempted to ping the IP addresses that were reportedly now ‘duplicate IP addresses in use’ and found nothing, 3) if they brought any of the servers up on an isolated network switch, it would work fine even when they then connected the server back into the main switch gear after it was operational (very puzzling, and leading them to believe it was a problem with the main datacomm gear), and 4) everything had been working fine for the previous several months with these same addresses.
Identifying the problem
It was evident that a packet analyzer was needed, and the process of collecting data began. With packets pouring in by the thousands, I left the analyzer running for a few minutes prior to pausing to look at the data for clues on what was causing this problem.
When a VM system comes online (other systems do this as well), it sends an ARP packet out on the network to determine if anyone else is using the IP address that it is attempting to use. If it determines some other system is using the IP address, it will no longer use layer 3 services, and will not provision itself with an IP address (it uses 0.0.0.0 for all additional network packets).
In looking through the packet captures, I could see where the server (the VMware™ server we were initially troubleshooting) would send out an ARP packet requesting the MAC address for 192.168.0.10 (the IP address assigned to the server). A device on the network responded with the MAC address of xx:xx:xx:ce:20:cd (manufacturer code obscured), and the IP of 192.168.0.10. In looking at the manufacturer’s code portion of the MAC address, it was easy to tell that this system was not a VMware™ box.
The next step was to quickly locate this system, and isolate it from the rest of the network. The client has a server using scanning software to identify all of the machines connected to the network, by IP address. However, this scanner software could not locate the offending box. Next, ping tests were issued from multiple machines in attempts to illicit a response. The box would not respond to pings. Lastly, we went through the MAC tables for each of the main switches to determine the source port of the MAC address for this system. It was located within a few minutes, and isolated to a switch on the third floor, module #A port #1.
Problem resolved
Now that the source of the problem was identified, the switch port connecting this device was administratively shutdown and the network staff was dispatched to locate the device. It was determined that the offending device was a specialized clock that used the network to obtain time (using NTP protocol). The clock was promptly disconnected from the network. The server team was notified of the corrective action, and once again began the process of bringing up the VMware™ servers. At this time, all the VMware™ servers returned back online to full operational status with no errors.
Having a duplicate IP address on the network would cause the VMware™ servers to not come online. But that didn’t explain why none of the VMware™ systems would come up on the network, as the time clocks static IP address was different than any of the VMware™ servers. To find out why, more detailed analysis was undertaken.
Typically, when the VMware™ system sends out an ARP (for the purposes of determining who else is up in the cluster) it pads the THA (Target Hardware Address) of the ARP packet with all zeroes (00:00:00:00:00:00). In the case of when it attempts to come online, and determine if the IP address is already in use, it pads the THA with all ones (FF:FF:FF:FF:FF:FF). It was only at this time that the ARP reply from the clock sent a response saying it was indeed 192.168.0.10 (or any other IP address for that matter). Thereby causing the network layer (Layer 3) of the VMware™ server to shutdown and relinquish it’s IP address (and not allowing the server to be operational on the network). However, this is also why, when the staff attempted to illicit ping responses from the clock, it would not respond with a reply, since the THA is not all ones in the ping request. This only further exacerbated the earlier troubleshooting efforts of the IT staff. Fortunately for the client, the VMware™ systems were stable prior to this time, otherwise they would have had some very strange problems with individual servers not coming online, which would have been difficult for them to troubleshoot.
Post mortem
Based upon the analysis above, the clock was determined to have defective firmware code, where it responded to any and all gratuitous ARP packets sent out on the network, regardless of the IP address of the sender. In each case, the clock confirmed it too had the same IP address (even though it did not), and responded to the gratuitous ARP request. Due to this factor, the clock was not allowed on the network after this occurrence. Since the manufacturer was unable to correct the firmware code on the clock, the malfunctioning clock was returned for another model which receives it’s time using radio waves (non-networked). The manufacturer initially denied the problem, but later stated they had one other client reporting a similar type of problem (the other client was unable to produce the technical analysis describing why the clocks firmware was malfunctioning).