The Nagios Setup Explained

8
11828
Perfect!

Perfect!

In this article, we shall discuss Nagios, an open source software that is deployed in most data centres to monitor various system and network parameters.

The practice of appointing prefects is a time-honoured one. Whether it is in schools, colleges, armies or societies, from time immemorial, overseers (or prefects) like district magistrates or religious priests, have played an important part in regulating and monitoring the performance and day-to-day activities of the groups of people they oversee.

Information Technology is no different — an overseer/monitor is required to constantly regulate and monitor the health of the hardware, software and network in modern-day data centres. These “prefects” of the data centre provide vital information to systems administrators — such as the amount of free disk space, network outages, application and hardware downtime, and even server-room temperatures.

Nagios (a recursive acronym for “Nagios Ain’t Gonna Insist On Sainthood”) has been one of the most favoured “prefects” of the data centre, monitoring parameters such as systems status (whether a system is up and running; CPU/memory/disk usage, etc.), service status (whether a service is up and running — e.g., DNS, Web server, mail server, etc.), and many other factors including room temperature and even humidity! It can generate alerts (through email/SMS) when the monitored parameters exceed preset thresholds.

As I sit down to write this article, I am glad to share with you that this “perfect prefect” has saved many of my clients hundreds of hours of downtime. Just recently, a customer decided to move a problematic database from a central database server host, because Nagios had alerted us about a possible problem with one of the schemas, which was adversely affecting the overall health of the database server, and could have severely affected other mission-critical production database schemas on the same host.

In this article, I shall try to dispel a commonly held myth — that Nagios is difficult to install. I distinctly remember that about five years back, a senior manager in a big IT firm called me and mentioned this as one of the reasons why the company planned to outsource the installation to us. It might have required a bit of tweaking then, but now that is no more the case — you can easily install and configure it to meet your requirements.

Let us install and configure Nagios to monitor a sample service, and hence get an idea of how Nagios can benefit you and your organisation.

We’ll install Nagios on an RHEL 5 host called prefect.knafl.org. We will use it to monitor itself — whenever it is available — and send alerts to nagios-admin@localhost in case of an outage. In a future article, perhaps, we will look at monitoring remote hosts and services.

Installation

On Red Hat Enterprise Linux, Nagios can be easily installed using the EPEL Repository. To the uninitiated, EPEL is: “Extra Packages for Enterprise Linux (or EPEL), a Fedora Special Interest Group that creates, maintains, and manages a high-quality set of additional packages for Enterprise Linux, including, but not limited to, Red Hat Enterprise Linux (RHEL), CentOS and Scientific Linux (SL).”

To ensure that Nagios is available in the EPEL repository, let’s browse the relevant repository (since ours is a 64-bit host, we’re looking at the x86_64 EPEL repository. On jumping to packages whose names begin with “N”, we can see that (as of this writing), there are 65 Nagios packages (RPMs) available for 64-bit RHEL 5. We can check this using the following command (on the URL for the group of packages we just mentioned):

[vbg@vbg ~]$ elinks --dump http://download.fedora.redhat.com/pub/epel/5/x86_64/repoview/letter_n.group.html | grep -i nagios | grep -v html | wc -l
65
[vbg@vbg ~]$

To install Nagios from EPEL, add the EPEL repository to yum, and then install the RPMs. The instructions to add the EPEL repository (clearly mentioned on the EPEL site) are as follows:

  1. Download the relevant RPM to set up the repo:
    [vbg@prefect downloads]$ wget -c http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
  2. Install it:
    [root@prefect ~]# rpm -Uvh /home/vbg/downloads/epel-release-5-4.noarch.rpm
    warning: /home/vbg/downloads/epel-release-5-4.noarch.rpm: Header V3 DSA signature: NOKEY, key ID 217521f6
    Preparing...                ########################################### [100%]
    1:epel-release              ########################################### [100%]

On listing the files installed by the RPM, you will see a GPG key (for checking package signatures) and a repo file to identify the package source:

[root@prefect ~]# rpm -ql epel-release-5-4
/etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL
/etc/yum.repos.d/epel-testing.repo
/etc/yum.repos.d/epel.repo
/usr/share/doc/epel-release-5
/usr/share/doc/epel-release-5/GPL

Let us now install Nagios:

[root@prefect ~]# yum clean all
Loaded plugins: rhnplugin, security
Cleaning up Everything
[root@prefect ~]# yum list nagios*

Experience tells us that all packages should never be installed on a host — only the desired ones should be installed. Therefore, begin with the basic packages:

[root@prefect ~]# yum install nagios nagios-common nagios-plugins \
nagios-plugins-http nagios-plugins-disk nagios-plugins-ping

Hopefully, the above packages will be installed on your system without any hiccups. Any doubts about Nagios installation being complex should now be removed.

To configure Nagios, you first need to find its configuration files — which is simple with the rpm tool’s switches. To locate configuration files provided by the Nagios package, simply run the following command:

[root@prefect ~]# rpm -qc nagios
/etc/httpd/conf.d/nagios.conf
/etc/logrotate.d/nagios
/etc/nagios/cgi.cfg
/etc/nagios/commands.cfg
/etc/nagios/localhost.cfg
/etc/nagios/nagios.cfg
/etc/nagios/private/resource.cfg

Let’s have a look at the various configuration files (each with a specific purpose), and understand how Nagios uses them. They are:

  • The main configuration file — /etc/nagios/nagios.cfg
  • Object definition files — /etc/nagios/commands.cfg and /etc/nagios/localhost.cfg
  • Resource configuration file — /etc/nagios/private/resource.cfg
  • CGI configuration file — /etc/nagios/cgi.cfg

The Apache configuration file (/etc/httpd/conf.d/nagios.conf) contains the directive for the URLs http://<nagios-host>/nagios/, and http://<nagios-host>/nagios/cgi-bin/, whereas the /etc/logrotate.d/nagios file is a log rotation configuration file.

The main configuration file

The /etc/nagios/nagios.cfg file controls the behaviour of the Nagios process and also the CGIs. There are many configuration directives in this file, and all of them are well documented. Let us look at some of the more important ones to get our basic configuration going:

  • Log file: This should be the first directive — the log file where host and service events are logged. Be careful that the file is accessible and writeable by the nagios user:
    log_file=<path-of-log-file>
  • Nagios user and group: These are the user and group names under which the nagios process runs. The yum installation, as above, creates both a user and a group named nagios, which we will use:
    nagios_user=nagios
    nagios_group=nagios
  • Object definition file(s): This parameter can be specified multiple times. These files contain definitions for each host and service, as well as groups of hosts and services. As an example, the yum installation creates two object configuration files: commands.cfg and localhost.cfg. We will look at these a little later. The parameter syntax is as follows:
    cfg_file=<path-of-object-definition-file_1>
    cfg_file=<path-of-object-definition-file_2>
  • Object cache file (object_cache_file): To speed up operations, the nagios service caches the read object definitions and configurations them in a cache file, which is then read by the CGI. This also prevents inconsistencies, such as when an object file is being modified, and is saved before all changes are completed.
  • Status file (status_file): This file is where the status of all monitored hosts and services is stored by Nagios, to be processed by the CGI scripts.
  • Resource file (resource_file): This parameter too can be specified multiple times. Resource files contain macros that are expanded by Nagios when executing a command found in the commands file. We can look at this in detail below. The CGIs do not read these files, and they can contain sensitive information such as user names and passwords. Therefore, restrictive permissions such as 600 (only the owner can read/write) should be placed on these files. As you can see, the Nagios RPMs install these files in a separate directory, /etc/nagios/private, which is owned by the root user and readable by the nagios group:
    [root@prefect ~]# ls -ld /etc/nagios/
    drwxr-xr-x 3 root root 4096 May 21 08:31 /etc/nagios/
    [root@prefect ~]# ls -ld /etc/nagios/private/
    drwxr-x--- 2 root nagios 4096 May 20 07:35 /etc/nagios/private/

The object and resource definition files

Objects are entities that need to be monitored, or are used for monitoring. Some examples are commands, hosts, groups, services and contacts. Let us explore a host object and a command object in this article.

Host object definitions are used to define a particular host that is being monitored; the mandatory directives are:

  • host_name: a short name for the host. Multiple services can be monitored on a single host. Normally, the FQDN is used.
  • alias: a longer description.
  • address: the IP address of the host being monitored.
  • max_check_attempts: the number of attempts to check the host, if a non-OK state is returned.
  • check_period: the period name (which is also defined), during which checks should be made.
  • contact_groups: the contact groups (people to be contacted) in case of problems (or recoveries) with this host.
  • notification_interval: the time interval (by default, in minutes) after which notifications will be sent, in case the host is still down.
  • notification_period: the time period in which notifications should be sent. In case the host is down in a time period that is not in this period, no notifications will be sent.
  • notification_options: This directive can have the following values:
    • d: send notifications when the host is down
    • u: send notifications if the host is unreachable
    • r: send notifications on recoveries
    • f: when the host starts and stops flapping (flapping is usually used to determine whether a service/host is stable. Flapping occurs when a service/host changes states too frequently.)
    • n: no notifications will be sent

A more efficient way to use host definitions is to define templates and use them. A snippet from the file /etc/nagios/localhost.cfg that defines a template, and then uses it for a host object definition, is shown below:

define host{
use linux-server            ; Name of host template to use
; This host definition will inherit all variables that are defined
; in (or inherited by) the linux-server host template definition.
host_name localhost
alias           localhost
address         127.0.0.1
}

The use statement above specifies that this host definition uses a template called linux-server. It is defined in the same file, as follows:

define host{
name linux-server     ; The name of this host template
use  generic-host     ; This template inherits other values from the generic-host template
check_period 24x7     ; By default, Linux hosts are checked round the clock
max_check_attempts 10 ; Check each Linux host 10 times (max)
check_command check-host-alive   ; Default command to check Linux hosts
notification_period workhours    ; Linux admins hate to be woken; only notify in the day
; Note that notification_period overrides the value
; inherited from the generic-host template!
notification_interval  120           ; Resend notification every 2 hours
notification_options   d,u,r         ; Only send notifications for specific host states
contact_groups         admins        ; Notifications get sent to the admins by default
register               0             ; DONT REGISTER THIS TEMPLATE DEFINITION!
}

This template further uses a template called generic-host, which is also defined in the same file, as:

define host{
name generic-host            ; The name of this host template
notifications_enabled 1      ; Host notifications are enabled
event_handler_enabled 1      ; Host event handler is enabled
flap_detection_enabled 1     ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1          ; Process performance data
retain_status_information 1  ; Retain status information across program restarts
retain_nonstatus_information  1 ; Retain non-status information across program restarts
notification_period 24x7     ; Send host notifications at any time
register 0                   ; DONT REGISTER THIS TEMPLATE DEFINITION!
}

Other objects referenced in the above snippets are:

  • contact_groups called “admins”
  • notification_period called “workhours” and “24×7”
  • check_command called “check-host-alive”

The contact_group object called “admins” is also defined in the same file:

define contactgroup{
contactgroup_name       admins
alias                   Nagios Administrators
members                 nagios-admin
}

The member nagios-admin of the above contact-groups is defined as:

define contact{
contact_name                    nagios-admin
alias                           Nagios Admin
service_notification_period     24x7
host_notification_period        24x7
service_notification_options    w,u,c,r
host_notification_options       d,r
service_notification_commands   notify-by-email
host_notification_commands      host-notify-by-email
email                           nagios-admin@localhost
}

The time period “workhours” is defined as:

define timeperiod{
timeperiod_name workhours
alias           "Normal" Working Hours
monday          09:00-17:00
tuesday         09:00-17:00
wednesday       09:00-17:00
thursday        09:00-17:00
friday          09:00-17:00
}

The time period “24×7” is defined as:

define timeperiod{
timeperiod_name 24x7
alias           24 Hours A Day, 7 Days A Week
sunday          00:00-24:00
monday          00:00-24:00
tuesday         00:00-24:00
wednesday       00:00-24:00
thursday        00:00-24:00
friday          00:00-24:00
saturday        00:00-24:00
}

Command definitions are used to define commands that Nagios will use. They can include macros from resource definition files. The command used in the localhost.cfg file for localhost is defined in /etc/nagios/commands.cfg as:

define command{
command_name    check-host-alive
command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 1
}

$USER1$ is a macro defined in /etc/nagios/private/resource.cfg as a file system path:

$USER1$=/usr/lib64/nagios/plugins

Once the host, host-groups, commands and time periods have been defined, it is time to define services. For the purpose of this introductory article, we will use only the ping service. Again, the service definition sections in the sample configuration file listed below are self-explanatory.

define service{
use local-service         ; Name of service template to use
host_name localhost
service_description PING
check_command check_ping!100.0,20%!500.0,60%
}

This definition uses a template, local-service, defined as:

define service{
name local-service   ; The name of this service template
use generic-service  ; Inherit default values from the generic-service definition
check_period 24x7    ; The service can be checked at any time of the day
max_check_attempts 4 ; Re-check up to 4 times to determine final (hard) state
normal_check_interval 5 ; Check service every 5 minutes normally
retry_check_interval 1  ; Re-check every minute until a hard state can be determined
contact_groups admins   ; Send notifications to all in the 'admins' group
notification_options w,u,c,r ; Send warning, unknown, critical, and recovery notifications
notification_interval 60     ; Re-notify about problems every hour
notification_period 24x7     ; Notifications can be sent out at any time
register 0                   ; DONT REGISTER THIS TEMPLATE DEFINITION!
}

The local-service template further uses a template, generic-service. For our use-case scenario, please ensure that you comment out all other service definitions in this configuration file.

Therefore, to sum up the various files used, based on the default configuration, our Nagios instance is set up thus:

  • Will monitor localhost (IP address 127.0.0.1)
  • It will be monitored 24×7.
  • This host is checked using the command /usr/lib64/nagios/plugins/check_ping.
  • Notifications are sent if the host is down, is unreachable or has recovered.
  • Notifications go to nagios-admin@localhost, but are sent only during workhours, and will be resent every two hours if the host is still down or unreachable.

The CGI configuration file

The CGI configuration file (/etc/nagios/cgi.cfg) configures the CGI scripts and the Web GUI of Nagios. The significant parameters are:

  • main_config_file: The path of the main Nagios configuration file, and where the CGI scripts should find it.
  • physical_html_path: The filesystem path for Nagios HTML files.
  • url_html_path: The URL portion appended to the base URL, that will access the Nagios HTML files.
  • refresh_rate: Specifies the refresh rate for various CGIs such as status.
  • use_authentication: Specifies that the CGI scripts should use authentication.

Once Nagios has been configured, you will need to add an authentication file to be able to access Nagios pages. By default, the Apache configuration directives (specified in /etc/httpd/conf.d/nagios.conf) rely on basic authentication, and allow access only from localhost. The user authentication file /etc/nagios/passwd needs to be created. You can do this using the htpasswd command:

[root@prefect ~]# htpasswd -bc /etc/nagios/passwd nagios-admin admin@123

This creates the nagios-admin user, with the password admin@123 and stores the details in the file /etc/nagios/passwd.

Hopefully, we are ready to test our base Nagios installation now. Start the nagios service and check the logs. Restart the Apache service:

[root@prefect nagios]# /etc/init.d/httpd restart
[root@prefect nagios]# /etc/init.d/nagios start
[root@prefect nagios]# tailf /var/log/nagios/nagios.log

If the Nagios logs are fine, you should now open your browser and connect to http://localhost/nagios/, authenticate as nagios-admin and check the Host summary. The configured host, localhost, should be up.

There is a wealth of information available on Nagios, and the documentation provided along with the installation is also quite good. Go on, build your prefect and manage your data centre.

8 COMMENTS

  1. Hi all,

    I am a regular reader of LFU, I been searching for a how to of Nagios and found useful article on this web page. I would like to thank you all for your contribution. And I have a suggestion it is that it will be more understandable if you could add some more screen shots along with the HowTo also as a learner I am having issues with understanding the technical terms used in this article, it would be helpful if you give a brief intro about the technical terms used in the articles.

    Thanks,

    Karthik.

  2. Nagios is the best option, in a sea of sub-optimal choices. It’s extensible stable, and has a very simple plugin API. The main problem with Nagios, is it’s dreadful configuration, hence the reason why so many tacked on DB interfaces were created. It’s not that the configuration is difficult, as it’s actually fairly understandable. The problem is that it’s not easily manageable, when scaling over a hundred hosts.

  3. Hi Guys,

    I need a help in nagios notifications setup.

    I have a script that will generate alert for w and critical based on a condition.

    What I want is that “Nagios should send out an alert for warning only if the condition i.e. written in the script exists for more than an 1hr then. No change is required for critical.”

    Please help me on this :(

LEAVE A REPLY

Please enter your comment!
Please enter your name here