Handling NMS Performance Data, Part 2

I described collecting network performance data in last week’s blog Handling NMS Performance Data, Part 1. This week, I want to describe how to efficiently store the collected data. I have heard the stories about vendors who used a relational DB to store interface performance data and how those systems didn’t perform well at large scale – over 50,000 interfaces per polling engine.

Most NMS developers are actually good database developers, so they naturally prefer storing data right into a relational database. It makes their life easy because they can run SQL queries that do a lot of work for them. It is also a common interface that they can use for all their interactions with the data. But there’s a cost to taking this approach. The DB API is relatively heavy-weight because of its relational capabilities. What we have is a typical optimization tradeoff. Is the time the developers spend more important than the time the system spends handling the data? A number of NMS development efforts have had poor performance because the wrong tradeoffs were selected.

What causes the slow performance? A relational database is powerful because it allows the developer to easily create relations between data and make powerful queries against that data and its relationships. It reduces data storage in many cases because it can store metadata in one place and reference it from multiple places. In a network, the metadata might be the device’s name, its management addresses, location, etc, all referenced by a unique device ID. An interface or configuration entry in the DB can simply reference the device by its ID to get access to the higher-level meta-data about the device. One change in the meta-data is reflected immediately in all references to that data instead of having it duplicated for each interface. This is all good.

The problem occurs when high volumes of data need to be handled. The performance problem is because a relational DB needs to index the data as it is inserted into the database in order to quickly extract it. If indexing is not done, the DB read operations take longer. So there’s a performance penalty on either the inserts or the reads (which are called ‘selects’ in the SQL language). On top of the insert operation, we need to add DB logging, which is similar to real-time backups (most DBs will allow the log to be played back from a known checkpoint in order to bring a DB back up to date in case of a system crash). Even though the log may be (and should be) on a different disk than the DB itself, the DB uses memory and CPU to perform the logging. The ease of use comes with a price.

Is there an alternative? Yes. All NMS systems roll up the collected data over longer time intervals, typically an hour. The roll-up calculations are typically to record values such as MIN, MAX, AVG, and 95th Percentile. These are the values that are used in performance thresholding, error rate thresholds, trend analysis, and correlation. Keep the collected data that is required for the roll-up period in an in-memory cache (memory is inexpensive these days, so use it to optimize system performance). An efficient data structure will allow very rapid access to the data in the cache. The roll-up data is created from the cache and stored in the DB. This approach allows the power of the relational DB to be applied to the summaries, which is what is normally done. The raw data in the cache is then written directly into the filesystem, using an on-disk data structure that makes it easy to access the raw data.

Why does this work well? In normal use, the raw data is rarely accessed. It is used to create the roll-up summary data that is used for network performance trending. The network staff typically examines only a few interfaces each day, so the best case is to optimize the raw data storage mechanism. The result is a big performance boost over using the DB to store raw data.

What are the advantages of this approach?

Reduced database storage requirements.
Improved database performance.
Less contention for database resources and disk I/O.
Raw data is more efficiently stored.
Historical raw data can be easily moved to a SAN for long-term storage.
Detailed displays of performance data is easily performed as long as the raw data is easily accessed.
Micro sampling of specific interfaces can be done without a major impact on the polling engine.
Remote collectors can perform the periodic roll-up calculations and forward only the required data to the NMS analysis engine. Or, even better, keep all the data locally and have the central analysis system download rules to the polling engine where preliminary identification can be performed, matching those interfaces against a given criteria.

Using these techniques, an NMS can increase its data collection performance and decrease its database storage requirements. The end result is an increase in overall system performance, which can be applied to making the UI run faster. And that’s a good thing.

-Terry

_____________________________________________________________________________________________

Re-posted with Permission

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

Re-posted with Permission

Leave a Reply

Related Topics