I was recently checking out a product that does syslog correlation and noticed that it had not reported a couple of events that I could see in syslog-ng’s log. I use syslog-ng because it is free, easy to install and configure, performs filtering, and forwards to other destinations. I normally have it configured to log everything to the local filesystem and to filter and forward specific events to other systems. It provides a good de-coupling mechanism between the network devices that are sending syslog messages and the systems that must process syslog. For example, NetMRI needs to receive Cisco CONFIG_I events indicating that a configuration change has been made.
The product that I was configuring was running on a separate server. It needed to receive syslog events and its display wasn’t showing me all the events that syslog-ng was recording. At first I blamed the product, but I then decided to replace it with another copy of syslog-ng to simplify the test. The test setup was syslog-ng running on Server A, a RedHat EL5 server, receiving syslog events from all the network equipment. Server B, a Centos 5.3 server, was configured with a second copy of syslog-ng, also logging to the filesystem. Server A was forwarding all Cisco syslog events to Server B. The rate of syslog events was on the order of 10 packets per second during peaks. Each packet was pretty small, because Cisco syslog messages tend to be small. I was very surprised to find that a measurable percentage of the syslog messages were being dropped on System B, even with syslog-ng. So it wasn’t a problem within the product that I was trying to install.
The next step was to verify that the UDP packets were making it from System A to System B. I ran tcpdump on both systems and verified that System A was sending the forwarded packets and that System B was receiving them. But syslog-ng was still not receiving all the events. Looking through System B’s syslog events and the tcpdump events, I could see that the packets were being received by the system, but were not being received by syslog-ng.
There are a number of web sites that discuss UDP packet loss. A good one is 29West.com’s UDP Buffer Sizing page, which includes commands for reporting the number of dropped UDP packets for several operating systems. On my system, it showed a lot of UDP packet errors:
$ netstat -us … Udp: 29582255 packets received 6898 packets to unknown port received. 15597 packet receive errors 29934317 packets sent
That definitely looked like the problem. So I worked on a number of recommendations for adjusting the UDP packet buffers. Some recommendations consume a lot of buffer space, as described in the 29West.com article above. I still had packet drops. I then switched System B to use a RedHat release and the packet errors dropped significantly. It turns out that the Centos 5.3 release drops many UDP packets, event at relatively low packet rates.
I would have expected any modern Linux kernel to be able to handle a load of hundreds of UDP packets per second on a 1-core server where there is no other competing process. But for some reason Centos has a problem handling even modest UDP packet loads. Switching to RedHat EL5 eliminated most (but not all) of the packet loss.
This brings me to another point that I find myself often making to network management vendors: syslog and traps are inherently unreliable due to the nature of their transport protocol: UDP. My recommendation to vendors is: Don’t write your network management application as if UDP were a reliable protocol. Use multiple mechanisms or multiple requests to get the data that’s needed to create informative answers to common questions. My recommendation to users: verify that the syslog and trap receivers are not dropping packets.
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html