Troubleshooting via a maze of network devices

04.12.2006
Several years ago, a new category of network equipment for bandwidth optimization emerged. Dubbed "traffic managers" or "traffic shapers," these devices go beyond firewalls by classifying traffic based on information deep within the packet.

Essentially, they are OSI Layer 7 switches with buffering capability for queuing and partitioning. Packeteer was an early leader in this field and continues to hold a significant portion of the market today.

Traffic managers are but one component of today's networks that can make troubleshooting connectivity issues quite complicated. There are routers, switches, firewalls, authentication servers, network access control devices, load balancers, intrusion-prevention systems, virtual private networks -- the list goes on and on. Each system adds another layer to troubleshoot when things don't work as planned.

I recently found myself troubleshooting a situation involving intermittent connectivity to a company's e-mail portal. While ultimately the issue was caused by a traffic manager, I reached that conclusion only after navigating a maze of network devices. While my path to resolution wasn't perfect, the methodology of working with several devices that could be an issue emphasized several critical network lessons.

The problem

Access was intermittent from off-site but worked fine from the site itself to a Web portal for access to e-mail. The problem pointed to a block at the network edge, and that is where I began my investigation.

The portal vendor provided a list of ports that needed to be opened for proper operation. Because this was an intermittent problem, I felt sure that the necessary ports were not blocked. If there was such a block, there would be no access at all from off-site. For completeness, I checked the configuration of all intermediate devices, including firewalls, routers and traffic managers, verifying that none of the ports was blocked.

At this point, I was fairly convinced that it was either an application issue or that another port needed access that the vendor had either omitted or was not aware of. I used IPTraf to determine which ports were being accessed on the portal server from within the network (in other words, on a known working connection). The vendor port list was correct.

A traffic manager was used to manage the company's Internet bandwidth. I created a rule on the traffic manager to classify all packets to the portal's server in an attempt to see if perhaps the traffic manager was misclassifying an application. After a short time, the applications were registering correctly: HTTP and SSL. I removed the classification and turned my attention to the firewall.

Sometimes, firewalls have connection timeout rules that can block applications after a period of inactivity. However, the logs of the firewalls involved revealed no such blocks.

I then used tcpdump at the firewall to analyze traffic at the network edge. From that trace, it was apparent that some clients were not sending a PUSH packet to open the necessary portal application. This pointed to a client issue.

To try to replicate a client problem, I took traces using Ethereal on my laptop connected to a public ISP to access the portal. Every test revealed a successful connection on the necessary ports. I initially did not understand why I couldn't replicate the problem, and I came to the conclusion that the network was not the culprit.

Although I did not realize it, by this time I had committed three critical troubleshooting errors. These errors led me to incorrectly conclude that the network was not responsible.

Still, continuing to insist that the network was not at fault served no useful purpose, because the vendor felt otherwise. I had to investigate further to find definitive proof to back my conclusions.

The solution

It was at this time that I realized I was connecting to the site using a VPN. While using the VPN client, I was emulating an internal network machine, not an external machine. All of my outside emulation tests were invalid. I shut down the VPN client and began my outside tests again. This time, access sometimes worked and sometimes didn't.

This test proved that the problem was not client-related, since the laptop worked fine when it emulated an on-site machine via the VPN but had intermittent connectivity without using the VPN. I rechecked every network device in line for a port block and found none. At this point, I began to think the issue had to be above Layer 4.

The only device that could possibly block above Layer 4 in this situation was the traffic manager, so I turned my attention to it again. The traffic manager had a basic configuration to deny popular peer-to-peer applications that a firewall rule based on ports may not catch. Other than that, there were no blocks -- just prioritizations and partitions based on applications and IP addresses.

I created the same inbound logging rule as before and monitored the traffic classifications. This time, I ran the test for a greater period of time than before. The extra time for traffic classification revealed the reason for the intermittent connectivity and the solution.

The traffic manager was misclassifying some of the portal connections as Skype. Skype, a popular peer-to-peer IP telephony application, was banned by the corporate security policy and enforced by the traffic manager peer-to-peer rules. The solution was to allow Skype-identified traffic to the portal server. Since the server was not actually running Skype, this was an acceptable solution.

Lessons learned

It's important to note that the problem was not with the Skype application, because it was never used. However, it was with the misclassification of some portal traffic as Skype by the traffic manager, coupled with the peer-to-peer denial rule, that caused the intermittent connectivity. This was unexpected but is a reminder of the necessity to always keep an open mind when troubleshooting.

Mistakes happen, but if lessons are not learned from the mistakes, there's a good probability that the same mistakes may be repeated. My mistakes emphasized three important lessons.

1. Create a valid test environment.

When testing, the object is to emulate the conditions of those experiencing the problem as closely as possible. While this seems obvious, sometimes creating an exact test environment is not trivial. In my case, I overlooked the fact that I was testing off-site connectivity within a VPN tunnel. This led to incorrect conclusions and wasted time.

2. Run the tests to completion.

Sometimes when taking traces or analyzing logs, the process is stopped before all of the necessary data for formulating proper conclusions have been gathered. This is particularly true when dealing with intermittent problems. In this case, had I let the traffic manager analyze the portal traffic a bit longer, I would have discovered why the application did not work properly.

3. Consider the network topology.

When I analyzed traffic between the firewall and the traffic manager, I had already (incorrectly) discounted the traffic manager as a source of the problem because of the first two mistakes. Still, I believed I was looking at edge traffic, even though I was analyzing at a point that was ultimately beyond the problem.

The best tests are those that validate other test results. While I made several troubleshooting errors, when subsequent tests did not support the previous conclusions, I put pride aside and returned to the drawing board. Persistence is often a network administrator's greatest skill.

I am glad, however, that I did not commit a fourth mistake. If I had stuck to my guns, relying on inaccurate test data and assumptions, I would have let the "It's not the network" response be my final one.

Greg Schaffer is director of network services at Middle Tennessee State University. He has over 15 years of experience in networking, primarily in higher education. He can be reached at newtnoise@comcast.net.