Application Performance Troubleshooting

In my last posts, Application Analysis Using TCP Retransmissions, Part 1 and Application Analysis Using TCP Retransmissions, Part 2, I described how application performance is impacted by TCP Retransmissions. Of course, there are many other causes of poor application performance. My review of this topic came about because Pete Welcher and I have just finished investigating whether application slowness at a customer site was due to the network or due to the applications themselves.

Packet Loss
As I described in the blog posts about TCP retransmissions, even small amounts of packet loss can cause significant application slowness. There are several sources of packet loss, which includes errors and excessive buffering. Another source, which I didn’t mention in those posts, is ingress overruns. Some switch vendors design their interface cards with one ASIC to handle multiple ports, typically either 2, 4, or 8 ports. A couple of busy servers that happen to be connected to the same ASIC via the common set of ports can over-subscribe the ASIC’s processing capabilities. When that happens, the ASIC has to drop the ingress frames because it has run out of some internal resource, such as buffering or bandwidth to the backplane. The ingress overruns, like congestion-induced egress discards, tend to happen during traffic bursts. High numbers of ingress overruns, which can be found in the output of show interface (see below), indicates that the interface isn’t able to keep up with the attached server. Note the Input queue drops and the overrun figures are the same and that if you divide the number of overruns by the total number of packets input, the result is 0.0018 = 0.1%. This rate is the average over seven weeks of operation. We would need to look at the performance charts recorded by the NMS to see when the busy times are and perhaps break out the packet capture tools to see the number of TCP retransmisisons. I’ll bet that this server isn’t performing very well for the customers who are trying to use it. The server needs to be moved to a switch port that can handle the offered load, as described in a Cisco support forum article.

GigabitEthernet2/10 is up, line protocol is up (connected)
 Hardware is C6k 1000Mb 802.3, address is 0013.aaaa.bbbb (bia 0013.aaaa.bbbb)
 Description: Server N connection
 MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
   reliability 255/255, txload 1/255, rxload 1/255
 Encapsulation ARPA, loopback not set
 Keepalive set (10 sec)
 Full-duplex, 1000Mb/s, media type is 10/100/1000BaseT
 input flow-control is off, output flow-control is off
 Clock mode is auto
 ARP type: ARPA, ARP Timeout 04:00:00
 Last input never, output never, output hang never
 Last clearing of "show interface" counters 7w0d
 Input queue: 0/2000/420145/0 (size/max/drops/flushes); Total output drops: 3041
 Queueing strategy: fifo
 Output queue: 0/40 (size/max)
 5 minute input rate 180000 bits/sec, 16 packets/sec
 5 minute output rate 11000 bits/sec, 8 packets/sec
   231689521 packets input, 330293039276 bytes, 0 no buffer
   Received 708 broadcasts (5 multicasts)
   0 runts, 0 giants, 0 throttles
   0 input errors, 0 CRC, 1 frame, 420145 overrun, 0 ignored
   0 watchdog, 0 multicast, 0 pause input
   0 input packets with dribble condition detected
   1379776652 packets output, 19707878215 bytes, 0 underruns
   0 output errors, 0 collisions, 0 interface resets
   0 babbles, 0 late collision, 0 deferred
   0 lost carrier, 0 no carrier, 0 PAUSE output
   0 output buffer failures, 0 output buffers swapped out

Location
Another source of poor application performance is the location of the servers. An application will typically consist of multiple “tiers”, each of which performs some type of processing that was requested by the client. There may be a web server that talks to a middleware server that in turn talks to a DB server. The DB server may actually talk with several DB servers to collect the data to satisfy the client request. This division of duties works well when the servers are located near one another. But when the servers are far apart, the additional latency can have a big negative impact on the overall system performance. An application that works well when all the servers are in the same data center may perform poorly after the system administrators move a VM from the original data center to another data center. If the application is written poorly, making many DB calls to retrieve data, the overall system performance will degrade quickly as the latency between servers increases.

Silly DB queries
If a key DB server must exist at a separate data center, it becomes critical to system performance that the application developers keep in mind the additional latency introduced by the remote access. The application may need to be written (or re-written) so that the number of transactions to the remote server are minimized. Stored procedures may need to be used to reduce the volume of data that is returned in a DB query, reducing the impact of round trip times as well as network congestion over the MAN/WAN path.

DB Optimization
The application developers need to profile the application and look for ways to optimize application performance. Do the critical tables have the correct indexing? Are the queries appropriate, or do they generate a significant load on the DB server? Some queries may need to be broken into smaller queries that operate on temporary tables in order to reduce the working memory that is needed. A good DB developer is worth hiring to make sure that the application performs well, even under the best network conditions.

Server performance (CPU, disk, I/O)
After the application is deployed, check the server performance and create alerts when it is too low. Checking CPU, memory, disk, and I/O is obvious, but is often ignored until someone determines that a problem exists. In the world of virtual machines, the VM host performance also needs to be monitored. It is a shared resource. Just like when a new application causes network congestion on a shared link a new server instance added to an existing VM host may exhaust some of the resources of that host, affecting all the application servers on that host. Since the network team rarely has visibility into the VM host performance, it is good to build rapport with the server team so that when an application problem is detected, you can determine whether server performance (and the VM host on which that server instance is running) is one of the sources of the poor performance.

DNS/LDAP/Active Directory
Finally, there are may be problems with the services that many applications use. I’ve seen an application that was performing well suddenly slow to a crawl. Upon investigating, I found that the initial connection to several functions took several seconds to get established. Doing a ping from the affected server to its down-stream servers showed what was happening, because the first lookup hangs and the second lookup resolves the correct address. The server was configured with several DNS entries. But the first DNS server in the list had been decommissioned, so the server had to time out the first DNS request for a hostname-to-address lookup (or a reverse query in some cases). In the best case, the local DNS cache will be enabled and it will affect a few transactions as the cache is reloaded or as it is refreshed. This case can be maddening to try to diagnose unless you recognize the symptoms. But in the worst case, it happens on every client transaction and it looks like a slow network, yet you can’t find anything in the network that is causing the slowness. A packet capture can help here, because you’ll see the initial lookup request, some retries, no reply, then the server switches to the correct name server and immediately is able to complete the transaction.

Summary
The network is always the first thing that is blamed when applications run slow and sometimes it is at fault. But about the same number of times, it isn’t the network. It is something related to the server or how the application is architected. The end result is the same in either case. We’re better off when the network, server, and applications teams work closely with each other, treating the whole as a big system that they need to make work well together. Instead of “You have a problem with your network” or “You have a problem with your server”, it should be “We have a problem with the application; let’s work together to solve it.”

-Terry

_____________________________________________________________________________________________

Re-posted with Permission

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

Re-posted with Permission

Leave a Reply

Related Topics