**Introduction**
Just about all monitoring systems allow you to build your own custom
monitoring probes. Since most operating systems have only basic monitoring
of system resources, finer granularity is usually something you have to
build yourself. Additionally, having monitoring doesn't automatically setup
your escalation paths or decide on monitoring thresholds. Custom scripting
can be a challenge, but if you know where to look, you can create monitoring
that is tailor made to not only your technical environment, but also your
operational cycles and needs. Monitoring is only as good as you make it.
In this article we will explore some useful examples of writing scripts to
monitor common hardware, software, and operating systems which we support.
We'll also talk about how to monitor your OpenVMS system using Unix
infrastructure. The scripting and approach will be open enough for any
monitoring system, but we'll use Zabbix as our main test case. One can also
use scripting for standalone monitoring. Let's check it out!
Scripting is an essential skill for all system administrators. Without it,
you are stuck with only being able to do whatever pre-package admin tools
allow you to do.
Scripting unlocks automation. Automation is what we all want to make our
jobs easier, produce more clear results, and provide a consistent baseline
of operational jobs over time.
In this article we will explore first simply creating custom monitoring scripts
to be run standalone, then I will show you how to integrate them into a
completely monitoring system such as Zabbix or others.
=== Selection of Scripting Languages ===
The truth is that there are a very large number of scripting languages and
just about all of them work great for monitoring. Most languages can deal
with the simple addition and multiplication that you need to compute your
metrics. Additionally, most scripting languages will allow you to execute
system utilities and then scrape/parse the output into data structures.
So, how do you choose a language to write your scripts in. Well consider
three key factors.
* Familiarity: How long will it take you to figure out HOW to write your script? Pick a language you either already know or have enthusiasm to learn. If you have no real clue, then choose shell scripting. This is the most straightforward choice.
* Maintainability: How easy will it be for someone who isn't you to pick up the scripts, figure out what you did, and extend or fix them?
* Compatibility: How appropriate is the language to the environment, OS,and business context? Don't choose an obscure and hard to use language none of the other sysadmins will know. Don't choose a language which has poor math features or is known to be inadequate with string handling.
The language you choose matters less than the factors above matter. So,
choose one that will allow your scripts to live longer in the organization
than simply your employment term.
=== Common Monitoring API features ===
Most monitoring systems have some kind of features or functionality for
external monitoring programs. They generally involve some client side
configuration as well. In most cases, agents either push or pull packed or
structured data to the server. However, not every client is going to work
the same way. Let's examine some of the most popular monitoring systems to
see how it works.
==== Zabbix Scripts ====
Zabbix is a very popular open source monitoring suite. It's also my personal
favorite and what we use here at The PARSEC Group. The way one adds custom
monitoring scripts to an agent is to follow these steps.
**Adding a Custom Script to Zabbix Agents**
- Get the agent fully setup and federated to the Zabbix server using standard methods.
- Go into the agent configuration file and add a "key" to invoke the script.
- Allow scripts in the agent configuration. You can choose to also allow parameters to be passed to the script. In some cases this is needed, especially when one script does multiple jobs or handles multiple objects.
- Add an "item" to the monitored system on the Zabbix Management console and call the key from there.
The process is covered in detail in the Zabbix documentation. https://www.zabbix.com/documentation/current/manual/config/items/itemtypes/external
==== Nagios ====
Nagios, NetSaint and similar monitoring tools use a similar setup to Zabbix,
though a bit more simple. You must be using NRPE to call external scripts,
agents such as check_nt++ will not work. You define a name in the agent
configuration, then call that name from the server process. The
documentation for the process is here.
https://support.nagios.com/kb/article.php?id=528
=== Why Plain Text Rules for Monitoring Data ===
Unix was built around standard text files. To some that means traditional Latin + ASCII to others that might mean UTF-8. In either case, Unix's text manipulation tools are copious, simple, and very effective. Once you have the monitoring data you want as text, you are free to put more fru-fru around it such as HTML, XML, or other obfuscation or window-dressing. However, if you do not start with the raw data, you are going to have a hard time. Here are some reasons why using anything other than straight textual
output from basic monitoring tools can be foolish from an administrator's perspective.
- There are dozens of tools to filter, re-organize, or represent textual data which come with Unix (Awk, Sed, 'tr', cut, and many more). Every Unix machine has these tools
- There is almost no system that can't **import** text.
- The data we want doesn't need markup. Just a string like "Fan1: OK" is perfect and anything else is just a waste of time & effort.
- XML, JSON, YAML, and other structured text formats are fine, but this data doesn't actually need all that context and structure. Did you really need an XML hierarchy to tell you that "PS1 Fan" was a part of your power supply which is a "child" part of your server? No! You already knew that! You just need to know if the darn thing is spinning!
- The value of structured data is less than the value of the **accessibility** of text. Generally the data can be parsed and re-structured later if needed.
When it comes to basic information about the health of a system, KISS rules. Remember that those who want to make things more complex are almost certainly __selling__ something. Of course some fancy-pants monitoring product wants you to try to get XML data for their CIM provider. Then you are stuck with **them** to parse it for you until the end of time. Great for them, but sucks for you.
=== Monitoring Server Hardware ===
Here is a dirty secret most monitoring tool companies will not tell you.
They very rarely do even a half-baked job of trying to monitor server
hardware. Admittedly it can be quite difficult. The hardware vendors have
many ways of implementing hardware checks and some are better than others.
I want to discuss some specifics on each platform to give you some idea
where to go to monitor these machines.
==== Itanium Systems ====
There are many operating systems that you can run on an Itanium. OpenVMS,
HP-UX, Windows, and Linux are the main ones that cme to mind. However, both
Linux and Windows dropped IA64 pretty soon after they saw the light of day.
The promised uber-compilers never appeared for these platforms and the
vendors gave up hoping for them. HP had already drank a couple of gallons of
the kool aid and didn't feel comfortable backing away from it, so they
stuck with the Itanic and appear to be ready to go down with the ship to
some degree.
So, if you are running Linux or Windows on IA64 your options are severely
limited and you are probably better off by using the IPMI or SNMP monitoring
that can be done from the system's ILO port. Some of the hardware, like
power supplies, can be monitored from that view.
If you happen to have HP-UX you have one good option for hardware monitoring
from a script and that is the 'stm' or 'cstm' (depends on how old your HPUX
box is) which is a system alert & information gathering tool. To see what it
can offer you try this
**HP-UX CSTM Example**
echo "selclass qualifier system;info;wait;infolog" | cstm
==== WBEM Monitoring is Painful ====
The only place I ever saw HP put any effort into providing good server
hardware monitoring for the Integrity is on their WBEM provider which is
really only useful to a few WBEM based monitoring systems like BMC Patrol,
who HP had many relationships with but never got married. Instead they
tried to foster something called Pegasus CIM (a WBEM provider & consumer)
which never really went anywhere. Within the WBEM agent there are actually
status indications for drives, DIMMs, fans, CPUs, PCIe cards, and power
supplies. It's a bit more specific and detailed versus what you would see in
CSTM.
WBEM and CIM do have some open source implementations. However, because the
protocols are XML based and use arcane back-end web-queries with massive XML
requests, it's still pretty tough to use for most sysadmins. The tools are
also very specific. They require the admin to know exactly what objects to
query and have no real way to "explore" the CIM objects without first
knowing what they are. Sean Swehla of IBM maintains a big set of tools to
pull status from WBEM enabled systems, but honestly, you are probably better
off trying to get what you need from SNMP (despite it being less well
populated) because getting the CIM provider to just "give up the data" is no
trivial task. You need WBEM tools which aren't readily accessible nor easy
to use at all. Take a look at 'wbemcli' for example. There is no easy way
to say "just show me everything you've got" to WBEM. The queries must be
structured in XML and they are returned in XML and to my knowledge there is
no universal way to request the whole corpus.
=== Alpha Systems ===
Alpha systems are more simple than the IA64 that came after. They have a
simple firmware interface called the SRM. This is the pre-boot environment
many call "The Chevron Prompt" because it looks like this:
# It looks like an officer's shoulder badge
>>>
The alpha can give you some pretty great diagnostics from the console view
but this isn't really scriptable because the console is not available when
the machine is booted. If you have an Alpha down at the firmware level (SRM
only, ARCS firmware is only for Windows NT) then try using these commands to
show any hardware problems after a system crash.
show crash
show error
show fru
show memory
show config
If you happen to be using Tru64, you have a few options to check the system
out. Here is a list of notable commands for Tru64
**Tru64 Hardware Related Commands**
* hwmgr show component -full
* hwmgr show scsi -full
* hwmgr show fiber
* scu scan edt
* consvar -l
For an OpenVMS machine on Alpha, there are some options such as running the
system analyzer 'sda' to produce a list of hardware components. However, VMS
does not track the state on these devices. It will definitely throw
read/write errors when it cannot read from a disk device, but as far as I'm
aware one cannot monitor things like system fan RPM or power supply status
from the OpenVMS command line on an Alpha. If I'm wrong about this, please
set me straight!
Despite not having great status checks for hardware from the DCL CLI, one
does have quite a bit of information in the SNMP MIB-tree for OpenVMS. If you
load the MIB data, you can see the values of the OIDs shown by name, which
is much more friendly. Then you can see a maximal list of system runtime
values that can be monitored. Occasionally, hardware monitoring bits can be
found there.
=== SPARC Systems ===
The SPARC platform dates back to the late 1980's, but really hit it's high
point in the Dot-Com Era in the late 1990s. They also saw a short
resurgences right before Sun was purchased by Oracle, though the profits of
Sun didn't reflect that, the units-sold did.
The SPARC platform almost universally uses an OBP (Open Boot PROM) firmware
setup. That's somewhat good news. The bad news is that OBP has nearly zero
standard diagnostic routines (some exceptions, but nothing universal). So,
it's almost worthless from a monitoring point of view.
There are two main sources of what I'd call "good" monitoring information
from SPARC servers. One is from the Solaris tool 'prtdiag'. Running that
with the '-v' flag is enough to get most of the useful information you need
such as the state of power supplies, DIMMs, CPUs, and controller cards.
Identifying bad drives can sometimes be done from 'cfgadm' or 'luxadm'
output, but that's also inconsistent across hardware generations.
I said there were two good sources of info. The other source is the
system's controller board such as an ALOM, ILOM (v1, v2, v3), or LOM. These
devices are system controllers which stay on anytime the machine is
physically plugged in and supervise & monitor the system hardware. They are
the same as a Dell DRAC or HP ILO in terms of functionality, but they tend
to store a lot more hardware health information. The iLOMs and LOMs are
harder to work with (newer) and use SMASH DMTF and other overcomplicated
protocols to fetch what should be simple operations asking for hardware
health. As dumb as they've made it, it's still somewhat accessible. I'd
first check the output from 'prtdiag' and see if it's adequate before you
turn to trying to scrape "show /System" from your iLOM. If you do go that
route, you probably want to use some kind of Expect script to fetch the
information for you into a text file and go from there.
The upside of systems using SMASH DMTF command line interfaces like the iLOM
is that they usually also have IPMI support. This is a MUCH better option
than using CIM or even SNMP. More discussion on IPMI when we talk about x86
PC servers. Hopefully, you don't need either because you are running
Solaris and you have 'prtdiag' satisfying your needs.
=== POWER Systems ===
Of all the systems I consider to be "hardware Unix platforms" the POWER
architecture is one of the worst and weakest when it comes to getting simple
hardware health status. First, the attitude of IBM is that you shouldn't be
touching it anyway. To their thinking, an IBM field engineer should be
on-contract to come and do this sort of thing for you.
One thing you'll notice is that IBM likes to use a lot of hexadecimal codes
and obscure error messages. Why? Well, the obvious explanation is that they
can charge you more that way!
Hardware health information on POWER comes from far too many places. First
off, again IBM's attitude at the AIX level is "we will tell you if something
goes wrong." Notice that is different from "We will show you the status of
everything, no matter if it's up/down or good/bad." This is a problem in the
extreme on IBM hardware. If AIX sees it, it's been abstracted to an ODM
entry that might have no reflection of the hardware reality. For example,
hdisk99 might be SCSI ID 0 and hdisk2 might be SCSI ID 1. Everything is an
abstraction.
So, where can we get hardware health information from AIX? Well, there are some of the usual places like the fact that they have an SNMP agent (which does offer quite a bit of hardware info compared with other vendors) and CIM agents, which like other CIM agents should probably be rejected outright due to their narrow applicability, opacity, and low functionality outside of curated environments. There is
also the HMC which collects hardware health information from the POWER system controller (the "SC"). The HMC generally stores hardware warnings associated with the "frame" (the whole server, not just an LPAR) and you can review any errors there, but no, you still can't get what I'd describe as an "enumerated healthcheck" where each hardware component's current health (good or bad) is displayed in an easy to read & parse textual format. Thus, the whole arrangement is not as good as something like **prtdiag** for example.
One would not think that vendors could be so incredibly dumb as to ignore the need for such a tool to exist on pretty much all server systems, but though IBM clearly thinks it's a great blunder to make - they aren't alone.
=== Common x86_64 Systems ===
Most X86 based server systems which are cheap and no-name (and some that aren't like HP or Dell) use some type of IPMI based hardware monitoring. The situation with IPMI is quite a bit better than with CIM. The tools, such as the venerable 'ipmitool' are actually likely to work and give you some helpful information. Supermicro relies pretty much exclusively on this method for their remote hardware monitoring.
There are a few other spacey options such as expensive remote access cards
for PC servers. These days those are hard to find in the correct form
factor and interface. So, using something like an ATEN remote management
card is probably not an option for most modern servers as they are full
height cards which use an older PCI (not PCIe) interface which most servers
have long since dropped & deprecated. However, _if_ you happen to have a
server that uses one, great! You are in for a treat because they work
fabulously and provide easy to use and great command line tools for Linux
and BSD users to setup and probe the cards easily. Just understand that
these days are probably gone and nobody is going to allow server management
to ever be this simple again.
Vendors nowadays can't resist throwing these giant XML or binary based
protocols at the problem rather "just freakin' tell me!" via a simple
command line program anyone could use on the back-end. They lie to
themselves and say that it's easier to integrate into downstream monitoring
products. That is a total lie: in fact it makes it significantly harder
versus just spewing the data out of a simple CLI program. If you are a
sysadmin, be sure when you get to talk to your vendors that you tell them
this, too. If all they ever hear is praise while they are out on the golf
course selling your boss more servers, then the situation will never change.
==== Dell OpenManage ====
Dell Openmanage is a ridiculously large multi-gig set of about 20-30 RPM
packages from Dell that have some chance of installing some helpful tools
that might actually give your Linux server the ability to perform two
critical sysadmin tasks. These are:
- Show the status of the hardware sensors (temp, fan speed, VRM status, power supplies, CPU, memory status, etc..)
- Show the status of logical drives on a RAID controller answering the devilishly simple question of "Are the drives in the system okay?"
OpenManage does a lot of other things too. It's part of an entire suite of Dell products and if you decide to drink the full gallon of Kool-Aid and probably launch a few support tickets, you can make it all work. However, there are several problems with OpenManage.
- It's absolutely HUGE for what it really does. Last I looked it was close to 3GB of installed software. So, I can find out if anything is broken in the system or if there is a bad RAID LUN? Pathetic.
- It's incredibly overcomplicated because they want you integrate it with more useful Dell products that cost piles of cash. Even if it was 100% free, it's still way too much work for much too small of a payoff.
- The online documentation is terrible and always out of date. They don't see to understand the sysadmin's needs AT ALL. It generally redirects to marketing materials meant to sell you something. We just want the hardware health!
- They integrate a bunch of firmware update tools that should be part of a different package. Monitoring and lifecycle management are different things. They smear a lot of IT tasks and roles with OpenManage and it makes division of labor more difficult, not easier.
==== HP Support Pack ====
Here's where I'd normally tell you about the HP Server Support Packs or
Proliant Support Packs (PSPs). However, in just about every way, the PSP is
identical to Dell OpenManage. It's a set of packages to further enable your
system, install "updated" drivers (which are often even older than what you
have), and most importantly to HP: push you to buy and use HP SIM for
monitoring your systems.
The good news is that if you hold your mouth right, you can setup SIM for
free. It's a pretty bad system with a lot of Java based garbage and slow
poorly designed GUIs. However, some people like that sort of thing. I'm not
one of them, but I won't sit here and tell you that SIM doesn't work. It can
work if you have the right attitude and put in the appropriate level of
effort. However, I'd hesitate to use the PSP and SIM as my primary means of
monitoring hardware.
- Given the fact that LM77 compliant sensors can easily be read without all the reams of extra packages, why bother? You don't need the PSP.
- RAID controllers generally have utilities available from the chipset manufacturers which are much more "to the point" and simply give up the info you want as text.
- Given that the OS you are using (Red Hat, CentOS, SuSE, etc..) already has their own driver updates that are newer than HPE's in most cases, why bother with the PSP? You might get an **older** driver.
- The RPMs installed by the PSP can block other RPMs from installing and then they tend to get stale unless you are using some kind of online repo to update the PSP also (which is possible).
Unless it's going to give me a driver I don't even have available or a
monitoring tool I can't get any other way (unlikely) then the PSP is **far**
too much crap to install on a production system without making me nervous
and a bit irritated. It's mostly just irrelevant with newer operating
systems anyway.
== Common System Monitoring Tasks ==
Let's talk about the nuance and methods of monitoring certain system
services and hardware. Not all systems can be monitored the same way and
not all the values come with the same quantifiers etc...
=== CPU Load ===
On a Unix system we often use "CPU Load" but few people actually understand
what it means. The load on a CPU in unix is measured as a floating point
number from 0 to infinity. This value represents the number of jobs waiting
in the run queue.
So, how to interpret the value? Like a lot of things "it depends". The run
queue of any given system has a 1:1 relationship with the number of CPU
threads the system has. On modern systems that could be mean the number of
cores times some sort of multiplier (ie.. Hyperthreading on x86, SMT on
POWER, or threads on SPARC).
So, let's say you have a system with 4 CPUs and none of them have any
threads or extra cores. So, you just have 4 CPU run queues. This means
that a "perfect load" on your system would be the value of 4. Anything
above that value indicates jobs are waiting for CPU time to get service.
Anything below that value indicates that the CPU has some idle moments it
could be doing work, but isn't.
So, to know if your system is overloaded, you first need to know how many
CPU threads your system thinks it has. Then you can properly interpret the
CPU load value.
There are other measures of CPU activity such as the "busy %" of any CPU or
all CPUs. So, for example, a 4-CPU system with a load average of 3.9
probably has a CPU busy percentage of 390% in absolute terms or 97.5% as a
per-CPU average. In English we'd say that the system is very busy and
nearly out of CPU cycles to give to your jobs.
=== Disk Health ===
Disks have more complex performance dimensions than CPUs. RAM has similar
dynamics to disks but obviously is much faster. Ultimately, we care about
several aspects of disk health as a systems admin.
- Is the disk working at all?
- How fast is it's throughput?
- How good is the latency on the disk (lower is better)
- How busy is the I/O queue that's servicing the disk (disk busy %)
- Are there any predictive failures from S.M.A.R.T. or similar ?
- Is it an SSD with degraded performance due to TRIM issues or perhaps a spinning drive that's been laid out with a bad sector size?
Each of these is import and in many cases it's not always clear how to get
every statistic. On modern Unix flavors nearly all of this information can
be gleaned from the 'iostat' output with some combination of flags.
However there are some things we need to unpack around the hardware health.
For example, what about disks which are part of a logical RAID drive. How
can I know if that RAID disk is healthy or degraded due to losing some of
it's members. How do I know if the RAID controller's spare drives are used
up or still ready for action? All of this information needs to come
together to give us a valid understanding of the systems disk/storage
health.
=== Network Monitoring ===
Network monitoring from a hardware perspective mostly just means monitoring
for link. However, throughput and latency are also common statistics to log
and analyze.
One overlooked gem for network hardware monitoring is mining the statistics
from 'netstat' and/or 'ifconfig' to understand any framing errors on your
NICs. These types of errors often indicate some kind of hardware issue such
as a bad cable, bad switch, or bad fiber media transceiver.
=== User Events ===
What is a "user event" ? Anything that the user (in this case probably a
systems admin) would want to know about. Here are a few examples of system
events that we might want to monitor.
- OS initiated reboot
- Unexpected reboot based on a machine check exception (hardware)
- Applications we care about are/are-not running.
- The system cover was removed from the system.
- Hardware was added or removed.
- Power went away or came back from one of our power supplies.
- Administrative users logged in and did "stuff"
All of these leave enough breadcrumbs that they can be monitored, The key is
looking for those breadcrumbs and developing scripts that will successfully
parse and report them consistently.
== Standalone Scripts ==
The next section is basically "put up or shut up." I've talked about how
monitoring can be done in multiple scenarios. We've discussed how to
interpret the data you get from the tools at hand. Lastly, I've established
that, at least in my not-so-humble opinion, getting too fancy with marking
up monitoring data is a tremendous mistake.
The basic fundamental reasons we do monitoring is to find and fix problems
before they occur and when they do occur to understand their scope and
impact as quickly as we can. Then, when the problems are addressed, the
monitoring should tell us when things are working as expected again. Taking
the guesswork out of that process and automating it has tremendous value.
However, for example, what value does XML data have by itself? None really.
So, the technology itself is a means to an end. If it doesn't give us basic
monitoring functionality, you better roll up your sleeves and invest your
business logic (and IT resources) into producing some useful automation. In
other words, start scripting your own solution because the vendors have had
it wrong basically forever. Keep in mind they are somewhat disincentivized
to solve your IT problems because doing so removes a selling point from some
of their other products you haven't bought, yet. It's the same reason why
automobile companies never create those "tear down" picture-by-picture
service guides for your car. The have little or no interest in telling you
how the car works. They'd rather just sell you another one.
Next let's examine some sample standalone monitoring scripts
=== Disk Health Script ===
{{ :disk_health_check.sh |Disk Health Check Script}}
=== CPU Health Script ===
{{ :cpu_health_check.sh |CPU Load Check Script}}
=== Network Health Script ===
{{ :net_health_check.sh |Network Health Check Script}}
=== User Event Script ===
{{ :event_check.sh |Event Check Script}}
== Integration of Custom Scripts into Zabbix and Others ==
Almost all monitoring systems use one of two mechanisms to monitor custom
scripts. These are return code and output strings. The most common is the
use of the return code. In Unix if a program returns a non-zero exit code,
that generally means things did not go well.
When it comes to string based output, I tend to think that the KISS
principle is the most important. This is why in my sample scripts I give the
option to turn off verbosity and simply output "OK". This way, on systems
which rely on the final strings from the test-script, you can merely check
for something other than "OK" and consider that an error.
In the same way, systems that check for exit codes should work with the
samples out of the box. Notice how I'm careful to exit non-zero under any
kind of failure or warning scenario.
Another thing you'll notice is that once a script outputs an error or
warning, there is some attempt to prevent it from repeating. Monitoring
systems and the organization behind them should be more well adapted to
watching for one meaningful event or alert in a sea of normality rather than
ignoring a lot of repeating events that might already be cleared.
=== Agent Configuration ===
In most cases, monitoring agents aren't configured to use any external
scripts. They have internal counters and items they monitor for on their
own. There are quite a few problems with this.
- The default configuration may have no triggers for events. It may only be collecting data.
- The default agent configuration may not include critical information you need in your organization to insure your application uptime.
- The default agent configuration may not have any escalation points that work.
Having even a basic healthcheck which reaches real people is better than
having a fancy monitoring agent that will never alert you about anything.
Keep in mind that most triggers and limits that a monitoring vendor might
set won't be a one-size fits all. It'll be a one-size annoys most. Good
monitoring always starts with knowing what you want to look for.
In Zabbix, configuring an agent for the use of an external script involves
three steps. First you need to write the script. Second, assign the script a
"key" on the agent side configuration. Last, call that key from the server.
Here is an example for setting up a Zabbix agent key for one of the sample
scripts.
.Using an External Script
----
UserParameter=netcheck,/bhc/net_health_check.sh
----
Zabbix, unlike a lot of other monitoring systems, can also allow you to pass
data to scripts. This has a lot of power, but it's also somewhat dangerous
if you don't have trust with the monitoring operations team. Zabbix is also
powerful enough that most of the types of monitoring I'm doing with the
basic health check scripts could also be done just using the internal Zabbix
Agent pre-populated "items" (data points and metrics collected by the agent
like CPU time etc..).
=== Server Configuration ===
On the Zabbix server side one needs to go into the GUI interface and define
the new script we want to use as an "item" which is attached to a "host".
Keep in mind that items are just data points. If you want to do anything
based on data you get from a script, use a trigger.
=== Other Monitoring Systems ===
Using external scripts in systems besides Zabbix is similar but different
enough to be a whole different set of operations. Let's look at a couple of
others.
==== HP OpenView ====
HP Openview is a monitoring system that has been around a long time and
inspires a lot of strong feelings. It can be made to work quite well and
heavily relies on SNMP monitoring. The expectation of the HPOV guys is that
if you have something worth monitoring, you can simply populate an SNMP OID
with the data. This is a bit of lunacy, since that's pretty tough to do
(impossible in some cases).
So, maybe I lied a bit. I just wanted to call down HPOV. To really do it
right, you need to write your custom script nearly completely with HPOV in
mind. You'll need to set some environment variables and make sure you can
call utilities like 'opcmon' or you won't be able to pass messages to the
monitoring console via your newly created script template.
It's a mess. It's one of the reasons people abandon their monitoring
altogether sometimes and start over. When they do start over, they usually
start with something made with their past struggles in mind. That is to
say, something that can realize the dream HPOV had (like Zabbix) or another
system which is radically simple, like a basic health check.
==== Nagios ====
Nagios uses external scripts as bread-and-butter heavy lifting monitoring
scripts. The two main agents, NRPE and NSClient++ both have a very simple
INI-like configuration file format. They can define external scripts much
the same way as Zabbix. Give them a name (similar to a "Key" in Zabbix) and
call the name from the server.
However, unlike Zabbix, Nagios is a bit more authoritarian about it's exit
codes. This can mean that a decent monitoring script that already works
might need to be adapted for Nagios to use the exit codes it prefers.
However, if all you need is a non-zero code, then even un-adapted scripts
will work fine.
.Nagios Script Exit Codes.
* 0 = OK
* 1 = WARNING
* 2 = CRITICAL
* Anything Else = UNKNOWN
Also, just for fun, here is an example of an NSClient++ configuration with
multiple external scripts defined:
.External Scripts in NSClient++
----
[/settings/external]
healthcheck=scripts\myhealthcheck.bat
foo=scripts\foo.bat
bar=scripts\foo.bat
----