Just about all monitoring systems allow you to build your own custom monitoring probes. Since most operating systems have only basic monitoring of system resources, finer granularity is usually something you have to build yourself. Additionally, having monitoring doesn't automatically setup your escalation paths or decide on monitoring thresholds. Custom scripting can be a challenge, but if you know where to look, you can create monitoring that is tailor made to not only your technical environment, but also your operational cycles and needs. Monitoring is only as good as you make it. In this article we will explore some useful examples of writing scripts to monitor common hardware, software, and operating systems which we support. We'll also talk about how to monitor your OpenVMS system using Unix infrastructure. The scripting and approach will be open enough for any monitoring system, but we'll use Zabbix as our main test case. One can also use scripting for standalone monitoring. Let's check it out!
Scripting is an essential skill for all system administrators. Without it, you are stuck with only being able to do whatever pre-package admin tools allow you to do.
Scripting unlocks automation. Automation is what we all want to make our jobs easier, produce more clear results, and provide a consistent baseline of operational jobs over time.
In this article we will explore first simply creating custom monitoring scripts to be run standalone, then I will show you how to integrate them into a completely monitoring system such as Zabbix or others.
The truth is that there are a very large number of scripting languages and just about all of them work great for monitoring. Most languages can deal with the simple addition and multiplication that you need to compute your metrics. Additionally, most scripting languages will allow you to execute system utilities and then scrape/parse the output into data structures.
So, how do you choose a language to write your scripts in. Well consider three key factors.
The language you choose matters less than the factors above matter. So, choose one that will allow your scripts to live longer in the organization than simply your employment term.
Most monitoring systems have some kind of features or functionality for external monitoring programs. They generally involve some client side configuration as well. In most cases, agents either push or pull packed or structured data to the server. However, not every client is going to work the same way. Let's examine some of the most popular monitoring systems to see how it works.
Zabbix is a very popular open source monitoring suite. It's also my personal favorite and what we use here at The PARSEC Group. The way one adds custom monitoring scripts to an agent is to follow these steps.
Adding a Custom Script to Zabbix Agents
The process is covered in detail in the Zabbix documentation. https://www.zabbix.com/documentation/current/manual/config/items/itemtypes/external
Nagios, NetSaint and similar monitoring tools use a similar setup to Zabbix, though a bit more simple. You must be using NRPE to call external scripts, agents such as check_nt++ will not work. You define a name in the agent configuration, then call that name from the server process. The documentation for the process is here.
Unix was built around standard text files. To some that means traditional Latin + ASCII to others that might mean UTF-8. In either case, Unix's text manipulation tools are copious, simple, and very effective. Once you have the monitoring data you want as text, you are free to put more fru-fru around it such as HTML, XML, or other obfuscation or window-dressing. However, if you do not start with the raw data, you are going to have a hard time. Here are some reasons why using anything other than straight textual output from basic monitoring tools can be foolish from an administrator's perspective.
When it comes to basic information about the health of a system, KISS rules. Remember that those who want to make things more complex are almost certainly selling something. Of course some fancy-pants monitoring product wants you to try to get XML data for their CIM provider. Then you are stuck with them to parse it for you until the end of time. Great for them, but sucks for you.
Here is a dirty secret most monitoring tool companies will not tell you. They very rarely do even a half-baked job of trying to monitor server hardware. Admittedly it can be quite difficult. The hardware vendors have many ways of implementing hardware checks and some are better than others.
I want to discuss some specifics on each platform to give you some idea where to go to monitor these machines.
There are many operating systems that you can run on an Itanium. OpenVMS, HP-UX, Windows, and Linux are the main ones that cme to mind. However, both Linux and Windows dropped IA64 pretty soon after they saw the light of day. The promised uber-compilers never appeared for these platforms and the vendors gave up hoping for them. HP had already drank a couple of gallons of the kool aid and didn't feel comfortable backing away from it, so they stuck with the Itanic and appear to be ready to go down with the ship to some degree.
So, if you are running Linux or Windows on IA64 your options are severely limited and you are probably better off by using the IPMI or SNMP monitoring that can be done from the system's ILO port. Some of the hardware, like power supplies, can be monitored from that view.
If you happen to have HP-UX you have one good option for hardware monitoring from a script and that is the 'stm' or 'cstm' (depends on how old your HPUX box is) which is a system alert & information gathering tool. To see what it can offer you try this
HP-UX CSTM Example
echo "selclass qualifier system;info;wait;infolog" | cstm
The only place I ever saw HP put any effort into providing good server hardware monitoring for the Integrity is on their WBEM provider which is really only useful to a few WBEM based monitoring systems like BMC Patrol, who HP had many relationships with but never got married. Instead they tried to foster something called Pegasus CIM (a WBEM provider & consumer) which never really went anywhere. Within the WBEM agent there are actually status indications for drives, DIMMs, fans, CPUs, PCIe cards, and power supplies. It's a bit more specific and detailed versus what you would see in CSTM.
WBEM and CIM do have some open source implementations. However, because the protocols are XML based and use arcane back-end web-queries with massive XML requests, it's still pretty tough to use for most sysadmins. The tools are also very specific. They require the admin to know exactly what objects to query and have no real way to “explore” the CIM objects without first knowing what they are. Sean Swehla of IBM maintains a big set of tools to pull status from WBEM enabled systems, but honestly, you are probably better off trying to get what you need from SNMP (despite it being less well populated) because getting the CIM provider to just “give up the data” is no trivial task. You need WBEM tools which aren't readily accessible nor easy to use at all. Take a look at 'wbemcli' for example. There is no easy way to say “just show me everything you've got” to WBEM. The queries must be structured in XML and they are returned in XML and to my knowledge there is no universal way to request the whole corpus.
Alpha systems are more simple than the IA64 that came after. They have a simple firmware interface called the SRM. This is the pre-boot environment many call “The Chevron Prompt” because it looks like this:
# It looks like an officer's shoulder badge >>>
The alpha can give you some pretty great diagnostics from the console view but this isn't really scriptable because the console is not available when the machine is booted. If you have an Alpha down at the firmware level (SRM only, ARCS firmware is only for Windows NT) then try using these commands to show any hardware problems after a system crash.
show crash show error show fru show memory show config
If you happen to be using Tru64, you have a few options to check the system out. Here is a list of notable commands for Tru64
Tru64 Hardware Related Commands
For an OpenVMS machine on Alpha, there are some options such as running the system analyzer 'sda' to produce a list of hardware components. However, VMS does not track the state on these devices. It will definitely throw read/write errors when it cannot read from a disk device, but as far as I'm aware one cannot monitor things like system fan RPM or power supply status from the OpenVMS command line on an Alpha. If I'm wrong about this, please set me straight!
Despite not having great status checks for hardware from the DCL CLI, one does have quite a bit of information in the SNMP MIB-tree for OpenVMS. If you load the MIB data, you can see the values of the OIDs shown by name, which is much more friendly. Then you can see a maximal list of system runtime values that can be monitored. Occasionally, hardware monitoring bits can be found there.
The SPARC platform dates back to the late 1980's, but really hit it's high point in the Dot-Com Era in the late 1990s. They also saw a short resurgences right before Sun was purchased by Oracle, though the profits of Sun didn't reflect that, the units-sold did.
The SPARC platform almost universally uses an OBP (Open Boot PROM) firmware setup. That's somewhat good news. The bad news is that OBP has nearly zero standard diagnostic routines (some exceptions, but nothing universal). So, it's almost worthless from a monitoring point of view.
There are two main sources of what I'd call “good” monitoring information from SPARC servers. One is from the Solaris tool 'prtdiag'. Running that with the '-v' flag is enough to get most of the useful information you need such as the state of power supplies, DIMMs, CPUs, and controller cards. Identifying bad drives can sometimes be done from 'cfgadm' or 'luxadm' output, but that's also inconsistent across hardware generations.
I said there were two good sources of info. The other source is the system's controller board such as an ALOM, ILOM (v1, v2, v3), or LOM. These devices are system controllers which stay on anytime the machine is physically plugged in and supervise & monitor the system hardware. They are the same as a Dell DRAC or HP ILO in terms of functionality, but they tend to store a lot more hardware health information. The iLOMs and LOMs are harder to work with (newer) and use SMASH DMTF and other overcomplicated protocols to fetch what should be simple operations asking for hardware health. As dumb as they've made it, it's still somewhat accessible. I'd first check the output from 'prtdiag' and see if it's adequate before you turn to trying to scrape “show /System” from your iLOM. If you do go that route, you probably want to use some kind of Expect script to fetch the information for you into a text file and go from there.
The upside of systems using SMASH DMTF command line interfaces like the iLOM is that they usually also have IPMI support. This is a MUCH better option than using CIM or even SNMP. More discussion on IPMI when we talk about x86 PC servers. Hopefully, you don't need either because you are running Solaris and you have 'prtdiag' satisfying your needs.
Of all the systems I consider to be “hardware Unix platforms” the POWER architecture is one of the worst and weakest when it comes to getting simple hardware health status. First, the attitude of IBM is that you shouldn't be touching it anyway. To their thinking, an IBM field engineer should be on-contract to come and do this sort of thing for you.
One thing you'll notice is that IBM likes to use a lot of hexadecimal codes and obscure error messages. Why? Well, the obvious explanation is that they can charge you more that way!
Hardware health information on POWER comes from far too many places. First off, again IBM's attitude at the AIX level is “we will tell you if something goes wrong.” Notice that is different from “We will show you the status of everything, no matter if it's up/down or good/bad.” This is a problem in the extreme on IBM hardware. If AIX sees it, it's been abstracted to an ODM entry that might have no reflection of the hardware reality. For example, hdisk99 might be SCSI ID 0 and hdisk2 might be SCSI ID 1. Everything is an abstraction.
So, where can we get hardware health information from AIX? Well, there are some of the usual places like the fact that they have an SNMP agent (which does offer quite a bit of hardware info compared with other vendors) and CIM agents, which like other CIM agents should probably be rejected outright due to their narrow applicability, opacity, and low functionality outside of curated environments. There is also the HMC which collects hardware health information from the POWER system controller (the “SC”). The HMC generally stores hardware warnings associated with the “frame” (the whole server, not just an LPAR) and you can review any errors there, but no, you still can't get what I'd describe as an “enumerated healthcheck” where each hardware component's current health (good or bad) is displayed in an easy to read & parse textual format. Thus, the whole arrangement is not as good as something like prtdiag for example.
One would not think that vendors could be so incredibly dumb as to ignore the need for such a tool to exist on pretty much all server systems, but though IBM clearly thinks it's a great blunder to make - they aren't alone.
Most X86 based server systems which are cheap and no-name (and some that aren't like HP or Dell) use some type of IPMI based hardware monitoring. The situation with IPMI is quite a bit better than with CIM. The tools, such as the venerable 'ipmitool' are actually likely to work and give you some helpful information. Supermicro relies pretty much exclusively on this method for their remote hardware monitoring.
There are a few other spacey options such as expensive remote access cards for PC servers. These days those are hard to find in the correct form factor and interface. So, using something like an ATEN remote management card is probably not an option for most modern servers as they are full height cards which use an older PCI (not PCIe) interface which most servers have long since dropped & deprecated. However, _if_ you happen to have a server that uses one, great! You are in for a treat because they work fabulously and provide easy to use and great command line tools for Linux and BSD users to setup and probe the cards easily. Just understand that these days are probably gone and nobody is going to allow server management to ever be this simple again.
Vendors nowadays can't resist throwing these giant XML or binary based protocols at the problem rather “just freakin' tell me!” via a simple command line program anyone could use on the back-end. They lie to themselves and say that it's easier to integrate into downstream monitoring products. That is a total lie: in fact it makes it significantly harder versus just spewing the data out of a simple CLI program. If you are a sysadmin, be sure when you get to talk to your vendors that you tell them this, too. If all they ever hear is praise while they are out on the golf course selling your boss more servers, then the situation will never change.
Dell Openmanage is a ridiculously large multi-gig set of about 20-30 RPM packages from Dell that have some chance of installing some helpful tools that might actually give your Linux server the ability to perform two critical sysadmin tasks. These are:
OpenManage does a lot of other things too. It's part of an entire suite of Dell products and if you decide to drink the full gallon of Kool-Aid and probably launch a few support tickets, you can make it all work. However, there are several problems with OpenManage.
Here's where I'd normally tell you about the HP Server Support Packs or Proliant Support Packs (PSPs). However, in just about every way, the PSP is identical to Dell OpenManage. It's a set of packages to further enable your system, install “updated” drivers (which are often even older than what you have), and most importantly to HP: push you to buy and use HP SIM for monitoring your systems.
The good news is that if you hold your mouth right, you can setup SIM for free. It's a pretty bad system with a lot of Java based garbage and slow poorly designed GUIs. However, some people like that sort of thing. I'm not one of them, but I won't sit here and tell you that SIM doesn't work. It can work if you have the right attitude and put in the appropriate level of effort. However, I'd hesitate to use the PSP and SIM as my primary means of monitoring hardware.
Unless it's going to give me a driver I don't even have available or a monitoring tool I can't get any other way (unlikely) then the PSP is far too much crap to install on a production system without making me nervous and a bit irritated. It's mostly just irrelevant with newer operating systems anyway.
Let's talk about the nuance and methods of monitoring certain system services and hardware. Not all systems can be monitored the same way and not all the values come with the same quantifiers etc…
On a Unix system we often use “CPU Load” but few people actually understand what it means. The load on a CPU in unix is measured as a floating point number from 0 to infinity. This value represents the number of jobs waiting in the run queue.
So, how to interpret the value? Like a lot of things “it depends”. The run queue of any given system has a 1:1 relationship with the number of CPU threads the system has. On modern systems that could be mean the number of cores times some sort of multiplier (ie.. Hyperthreading on x86, SMT on POWER, or threads on SPARC).
So, let's say you have a system with 4 CPUs and none of them have any threads or extra cores. So, you just have 4 CPU run queues. This means that a “perfect load” on your system would be the value of 4. Anything above that value indicates jobs are waiting for CPU time to get service. Anything below that value indicates that the CPU has some idle moments it could be doing work, but isn't.
So, to know if your system is overloaded, you first need to know how many CPU threads your system thinks it has. Then you can properly interpret the CPU load value.
There are other measures of CPU activity such as the “busy %” of any CPU or all CPUs. So, for example, a 4-CPU system with a load average of 3.9 probably has a CPU busy percentage of 390% in absolute terms or 97.5% as a per-CPU average. In English we'd say that the system is very busy and nearly out of CPU cycles to give to your jobs.
Disks have more complex performance dimensions than CPUs. RAM has similar dynamics to disks but obviously is much faster. Ultimately, we care about several aspects of disk health as a systems admin.
Each of these is import and in many cases it's not always clear how to get every statistic. On modern Unix flavors nearly all of this information can be gleaned from the 'iostat' output with some combination of flags.
However there are some things we need to unpack around the hardware health. For example, what about disks which are part of a logical RAID drive. How can I know if that RAID disk is healthy or degraded due to losing some of it's members. How do I know if the RAID controller's spare drives are used up or still ready for action? All of this information needs to come together to give us a valid understanding of the systems disk/storage health.
Network monitoring from a hardware perspective mostly just means monitoring for link. However, throughput and latency are also common statistics to log and analyze.
One overlooked gem for network hardware monitoring is mining the statistics from 'netstat' and/or 'ifconfig' to understand any framing errors on your NICs. These types of errors often indicate some kind of hardware issue such as a bad cable, bad switch, or bad fiber media transceiver.
What is a “user event” ? Anything that the user (in this case probably a systems admin) would want to know about. Here are a few examples of system events that we might want to monitor.
All of these leave enough breadcrumbs that they can be monitored, The key is looking for those breadcrumbs and developing scripts that will successfully parse and report them consistently.
The next section is basically “put up or shut up.” I've talked about how monitoring can be done in multiple scenarios. We've discussed how to interpret the data you get from the tools at hand. Lastly, I've established that, at least in my not-so-humble opinion, getting too fancy with marking up monitoring data is a tremendous mistake.
The basic fundamental reasons we do monitoring is to find and fix problems before they occur and when they do occur to understand their scope and impact as quickly as we can. Then, when the problems are addressed, the monitoring should tell us when things are working as expected again. Taking the guesswork out of that process and automating it has tremendous value.
However, for example, what value does XML data have by itself? None really. So, the technology itself is a means to an end. If it doesn't give us basic monitoring functionality, you better roll up your sleeves and invest your business logic (and IT resources) into producing some useful automation. In other words, start scripting your own solution because the vendors have had it wrong basically forever. Keep in mind they are somewhat disincentivized to solve your IT problems because doing so removes a selling point from some of their other products you haven't bought, yet. It's the same reason why automobile companies never create those “tear down” picture-by-picture service guides for your car. The have little or no interest in telling you how the car works. They'd rather just sell you another one.
Next let's examine some sample standalone monitoring scripts
Almost all monitoring systems use one of two mechanisms to monitor custom scripts. These are return code and output strings. The most common is the use of the return code. In Unix if a program returns a non-zero exit code, that generally means things did not go well.
When it comes to string based output, I tend to think that the KISS principle is the most important. This is why in my sample scripts I give the option to turn off verbosity and simply output “OK”. This way, on systems which rely on the final strings from the test-script, you can merely check for something other than “OK” and consider that an error.
In the same way, systems that check for exit codes should work with the samples out of the box. Notice how I'm careful to exit non-zero under any kind of failure or warning scenario.
Another thing you'll notice is that once a script outputs an error or warning, there is some attempt to prevent it from repeating. Monitoring systems and the organization behind them should be more well adapted to watching for one meaningful event or alert in a sea of normality rather than ignoring a lot of repeating events that might already be cleared.
In most cases, monitoring agents aren't configured to use any external scripts. They have internal counters and items they monitor for on their own. There are quite a few problems with this.
Having even a basic healthcheck which reaches real people is better than having a fancy monitoring agent that will never alert you about anything.
Keep in mind that most triggers and limits that a monitoring vendor might set won't be a one-size fits all. It'll be a one-size annoys most. Good monitoring always starts with knowing what you want to look for.
In Zabbix, configuring an agent for the use of an external script involves three steps. First you need to write the script. Second, assign the script a “key” on the agent side configuration. Last, call that key from the server.
Here is an example for setting up a Zabbix agent key for one of the sample scripts.
.Using an External Script
Zabbix, unlike a lot of other monitoring systems, can also allow you to pass data to scripts. This has a lot of power, but it's also somewhat dangerous if you don't have trust with the monitoring operations team. Zabbix is also powerful enough that most of the types of monitoring I'm doing with the basic health check scripts could also be done just using the internal Zabbix Agent pre-populated “items” (data points and metrics collected by the agent like CPU time etc..).
On the Zabbix server side one needs to go into the GUI interface and define the new script we want to use as an “item” which is attached to a “host”. Keep in mind that items are just data points. If you want to do anything based on data you get from a script, use a trigger.
Using external scripts in systems besides Zabbix is similar but different enough to be a whole different set of operations. Let's look at a couple of others.
HP Openview is a monitoring system that has been around a long time and inspires a lot of strong feelings. It can be made to work quite well and heavily relies on SNMP monitoring. The expectation of the HPOV guys is that if you have something worth monitoring, you can simply populate an SNMP OID with the data. This is a bit of lunacy, since that's pretty tough to do (impossible in some cases).
So, maybe I lied a bit. I just wanted to call down HPOV. To really do it right, you need to write your custom script nearly completely with HPOV in mind. You'll need to set some environment variables and make sure you can call utilities like 'opcmon' or you won't be able to pass messages to the monitoring console via your newly created script template.
It's a mess. It's one of the reasons people abandon their monitoring altogether sometimes and start over. When they do start over, they usually start with something made with their past struggles in mind. That is to say, something that can realize the dream HPOV had (like Zabbix) or another system which is radically simple, like a basic health check.
Nagios uses external scripts as bread-and-butter heavy lifting monitoring scripts. The two main agents, NRPE and NSClient++ both have a very simple INI-like configuration file format. They can define external scripts much the same way as Zabbix. Give them a name (similar to a “Key” in Zabbix) and call the name from the server.
However, unlike Zabbix, Nagios is a bit more authoritarian about it's exit codes. This can mean that a decent monitoring script that already works might need to be adapted for Nagios to use the exit codes it prefers. However, if all you need is a non-zero code, then even un-adapted scripts will work fine.
.Nagios Script Exit Codes.
Also, just for fun, here is an example of an NSClient++ configuration with multiple external scripts defined:
.External Scripts in NSClient++
[/settings/external] healthcheck=scripts\myhealthcheck.bat foo=scripts\foo.bat bar=scripts\foo.bat