Our time clocks keep getting lost

26 02 2008

One of my clients uses time clocks (for its payroll system) at several of their larger facilities. A large portion of their staff use these time clocks when entering and leaving work. This data is tied into their payroll system, which uses CyberShift™ as their workforce management software. On several occasions, they ‘lost’ one or more of the time clocks, where the server could no longer see the time clock. They were working with the CyberShift™ technical support staff, which was not able to come up with a ‘fix’. They called me with the expectation of determining why this was happening.

Identifying the problem

Since this was not an everyday occurrence, I asked the client to call me in when it happened again to begin the analysis. Upon arrival, I learned that they had lost connections to all time clocks at a specific site. I went to their main site where the computer room housed the system server, to verify if the server could connect to the time clocks from there. I viewed several connection attempts, but I did not view any data coming from the time clocks in question. After this, we went to the site where the time clocks resided and connected the packet analyzer. This time, we could see the time clocks receive data from the server, and send data back to the server. However, each data stream was abrupt, and new TCP sessions were initiated for each request coming in from the server.

So, both the server and the time clocks were sending data to each other, but only the data from the server was actually getting to the time clocks. The data being sent by the time clocks was not getting to the server.

Since, 1) everything appeared normal at this time, and 2) we knew the data was not getting back to the server, we focused on determining why the data could be seen on the local LAN, but not getting through to the LAN where the server resided. They were using a T1 to connect this remote site to their main site. They had recently cutover their switch gear to a different manufacturer, and also installed a new router due to the cutover. Thinking they may have a routing issue, they stated that no other applications or users were exhibiting any problems/symptoms.

We looked for the obvious things first – packet loss on the local LAN, packet loss on the T1, routing configuration, etc. No problems with any of this. So, this required a more detailed look through the packet decodes. Starting with layer one of the OSI model and working up towards layer seven (this can be done in reverse as well), we began going over the packets from the time clocks to the server in fine detail. We had already looked at the upper layers during the initial analysis and noted that layer three (network layer) through seven (application layer) appeared normal (no IP address problems, no DNS problems, no transport problems, applications were communicating on valid port numbers, etc. etc), so we focused the analysis at the lower layers (< layer three). Having already looked into layer one and layer two issues (using the snmp counters) on the switch and router, we next looked at layer two on the packet decodes. This was where the problem was located. In going through the packet decodes, it was determined the MAC address for the router was different for the time clocks than for all other traffic on the network. In doing more investigation, it was found that the time clocks were using the ‘old’ routers MAC address, even though a new router was being used.

The packet analyzer was reconnected to the network and let it run for quite awhile. During this time, the time clock never sent an ARP packet out for the default gateway. It continued to use the MAC address of the previous router, even though it had been removed earlier during the week. We were able to document the time clocks ARP cache never timed out for communicating with the router, even though it had changed three days prior.

Problem resolved

To get the client back online that day – we did a temporary fix by depowering and repowering the time clock (which was under lock and key). This cleared the ARP cache for the timeclock, requiring the time clock to send out an ARP packet to locate the default gateway, which responded properly.

After communicating this information to CyberShift™, they communicated they purchased the time clocks from a separate manufacturer. When we spoke with the technical support department at the time clock manufacturer about the ARP cache never timing out, they communicated that they used a third party NIC card. They needed to contact this company prior to accepting the results. After a few con-calls with CyberShift™, the time clock manufacturer, and the NIC card manufacturer, it was decided that a firmware upgrade would correct the problem. They sent the new firmware code, and we loaded it onto some test time clocks. We retested to see if the time clocks could establish communication with the new router when the MAC address changed on the default gateway (by either ARP’ing for the gateway after an ARP cache timeout, or listening to the gratuitous ARP sent from the ‘new’ router when it came online). It did neither. This information was communicated to the time clock manufacturer, and they once again went back to researching the problem.

The time clock manufacturer got back in contact with the client about 30 days after the initial troubleshooting. They agreed that it was a defect in the time clock system, and asked that the client replace all time clocks at all facilities with their new time clock (which had been tested, and did not exhibit the same problem). This task was performed, and the problem has not returned. Kudos to CyberShift™, and the time clock manufacturer for taking ownership and getting a permanent solution to the problem.

Advertisement

Actions

Information




Follow

Get every new post delivered to your Inbox.