Chapter 7 |
Alarm Management |
This chapter covers the following topics:
The Sun Management Center software monitors your hardware and software. When abnormal conditions occur, the Sun Management Center notifies you, through alarms. These alarms are triggered by conditions falling outside of predetermined ranges, or by Sun Management Center rules. Default alarm conditions and rules are included in the modules. In addition, you may also set up your own alarm thresholds.
In Sun Management Center, an alarm rule performs one or more alarm checks. Each alarm check evaluates alarm criteria to determine if the managed property is in a corresponding alarm state. Actions can also be triggered by the alarm states; these actions are known as status actions.
Thus, each alarm rule is associated with a number of alarm criteria, alarm states, and optional status actions. If none of the alarm criteria are satisfied, the node is considered to be in the ok state, and hence nodes without an alarm rule are always considered okay.
Typically, the alarm rule is evaluated after completing the refresh operation (the refresh request and the subsequent data cascade). For some alarm rules, the rule may be triggered whenever a particular error message appears in a log file. These are referred to as log rules.
Each managed property can be assigned a single associated alarm rule. This assignment is done in the model file. A generic rule, called rCompare, is provided. This rule performs numeric comparisons, regular expression checks, or string comparisons. The exact checks that are to be performed are controlled by alarm check and alarm limit parameters (as described later in this chapter).
Each managed property is assigned a single associated alarm rule. This assignment is done in the model file. For standard alarm types (HI, LO, etc.), the rule rCompare is used by default. If a managed property is not assigned an alarm rule, then no alarm checking is performed on that managed property.
Custom rules can employ a wide variety of alarm criteria. They can examine the value of the node to which the rule is attached, or the value or status of a different node. A special category of rules, referred to as log rules, can be triggered to fire whenever a message matching a specified regular expression appears in a log file.
For more information, refer to the Chapter 8.
The alarm file for the Solaris example is shown below. Note that the tree structure specified in the agent file is used in this file when specifying alarm information for the managed properties.
Alarm severities are specified to reflect the relative significance of the various managed properties. Alarm limits are also specified when appropriate.
Alarm limits are specified for a number of file systems typically found on many systems.
![]() |
Managing Alarms using rCompare |
1. | Create the rule in the models file. |
These are explained in the section on "Using the rCompare Rule in the Models File." |
2. | Create the alarm definition file. |
These are explained in the section on "Creating the Alarm File." |
3. | Speficy alarm limits and other alarm criteria. |
These are explained in the section on "Specifying the Alarm Criteria." |
4. | Specify actions to be performed based on the alarm state. |
These are explained in the section on "Specifying Status Actions." |
In this action, data and alarm type primitives are added to the managed properties defined in the data model structure created previously. For instance, the CPU managed properties like idle, busy, system, user, and average can be represented by the PERCENT data type.
For the properties that represent CPU usage levels (busy, system, user, and average), it is also prudent to perform a high alarm check to detect instances when these properties exceed specified limits. Thus, these properties must use the PERCENTHI primitive. Conversely, the idle property reflects the percent of time the CPU is not in use; as a result, a PERCENTLO primitive can be used to detect times of low CPU usage.
Similar reasoning is exercised when assigning data and alarm type primitives to the other managed properties. Also, to illustrate the use of rules, the rUsrChk rule (defined in the solaris-example-d.rul file) is attached to the consoleUser object. The resulting data model structure with data and alarm type primitives is:
CODE EXAMPLE 7-2 Solaris Example--Intermediate Data Model cpu = { [ use MANAGED-OBJECT ] idle = { [ use PERCENTLO MANAGED-PROPERTY ] } busy = { [ use PERCENTHI MANAGED-PROPERTY ] } system = { [ use PERCENTHI MANAGED-PROPERTY ] } user = { [ use PERCENTHI MANAGED-PROPERTY ] } average = { [ use PERCENTHI MANAGED-PROPERTY ] } } system = { [ use MANAGED-OBJECT ] userstats = { [ use MANAGED-PROPERTY-CLASS ] numUsers = { [ use INTHI MANAGED-PROPERTY ] } numSessions = { [ use INTHI MANAGED-PROPERTY ] } primaryUser = { [ use STRING MANAGED-PROPERTY ] } } load = { [ use MANAGED-PROPERTY-CLASS ] one = { [ use FLOATHI MANAGED-PROPERTY ] } five = { [ use FLOATHI MANAGED-PROPERTY ] } fifteen = { [ use FLOATHI MANAGED-PROPERTY ] } } } filesystems = { [ use MANAGED-OBJECT ] fileTable = { [ use MANAGED-OBJECT-TABLE ] fileEntry = { [ use MANAGED-OBJECT-TABLE-ENTRY ] index = mount mount = { [ use STRING MANAGED-PROPERTY ] } size = { [ use INT MANAGED-PROPERTY ] } avail = { [ use INTLO MANAGED-PROPERTY ] } pctUsed = { [ use PERCENTHI MANAGED-PROPERTY ] } } } }
Data type primitives can be optionally combined with an alarm type that characterizes the alarm checks performed on the property's data value.
These data and alarm primitives have the following form:
<data type>[<alarm type>]
where
- <data type> represents the type of data stored in the primitive
- <alarm type> optionally specifies the type of alarm checks to perform
The alarm type specification is optional and must be combined with a valid data type. The alarm type defines the alarm check to perform on the data value. The alarm check criteria is specified in the Chapter 8. If the alarm type is not specified, no alarm checks are performed on the affected managed property.
The possible values for the alarm type are:
HIchecks if data value is greater than the specified alarm limits. This alarm type can only be combined with the INT, FLOAT, PERCENT, and COUNTER data types. It cannot be used in combination with the STRING data type. LO
checks if data value is less than the specified alarm limits. This alarm type can only be combined with the INT, FLOAT, PERCENT, and COUNTER data types. It cannot be used in combination with the STRING data type. HILO
checks if data value is less than or greater than the specified alarm limits. This alarm type can only be combined with the INT, FLOAT, PERCENT, and COUNTER data types. It cannot be used in combination with the STRING data type. EQ
checks if data value is equal to the specified alarm criteria. This alarm type can only be combined with the INT, FLOAT, PERCENT, COUNTER and STRING data types. NE
checks if data value is not equal to the specified alarm criteria. This alarm type can only be combined with the INT, FLOAT, PERCENT, COUNTER and STRING data types. REGEXP
checks if data value matches the regular expression alarm criteria. This alarm type can only be used in combination with the STRING data type. RULE
executes a rule check procedure. This alarm type can be used in conjunction with any data type, or, optionally, the data type can be left blank if the node exists only to support a rule and has no associated data, discussed in the Rules chapter.
Examples of data and alarm type primitives are:
INT general integer type with no alarm checking FLOATHI a floating point value that will check to see if the data value is greater than the specified alarm limits STRINGREGEXP specifies a string type with alarm checks using regular expression patterns INTRULE an integer value to which a rule check is applied RULE contains no value (empty string), but a rule is to be executed for the node
The required content in the model realization file to load the alarma file is:
[load <modules><-subspec>-d.def]
The alarm file defines information used by alarm checks performed on managed properties. This file is loaded by the agent file with the following line:
[ load <module><-subspec>-d.def ]
The contents of the alarm file are overlaid on top of the MIB object tree defined in the agent file.
In general, alarm management information is likely to be modified by site administrators, whereas the object hierarchies and DAQ mechanisms specified in the agent file are not typically modified. Therefore, alarm management information is defined in a separate file from the agent file to facilitate the specification of site-specific alarm information defaults.
The file name format is:
<module><-subspec>-d.def
For example:
solaris-example-d.def
Note - The alarm file (*.def) is sometimes interchangeably called the default file.
The alarm file, which is in the module configuration file format mimics the same tree structure specified in its corresponding agent file and contains entries only for those nodes with alarm specifications.
If alarm file is left empty, the refresh operations are still executed and managed property data is acquired, but the managed properties never goes into alarm.
The alarm file can specify the following alarm management related qualifiers for any of the managed property nodes in the agent file tree structure:
alarmChecks = <alarm check1> < alarm check2) alarmlimit:<alarm check> = <alarm limit> alarmSeverity = <integer [0-9]> alarmWindow = <Alarm window timex specification > statusActions(<alarm event>) = <action1> <action2>... statusService(<action1>) = <service> statusCommand(<action1>) = <command> statusService(<action2>) = <service> statusCommand(<action2>) = <command>
where:
- <alarm check> is a specification of an alarm check that has the following format <alarm state>-<alarm test>. The sections, "Alarm Checks" and "Specifying Alarm Limits," describe this specification.
- <alarm limit> specifies the threshold criterion for this check. The sections "Alarm Checks" and "Specifying Alarm Limits" describe this specification.
- <actionN> specifies a logical name of an action to be executed.
- <service> specifies an execution context for the command to be run.
- <command> specifies a command to execute.
The actual alarm checks that are performed when using the rCompare (using the standard alarm types) rule are specified by the alarmChecks qualifier. The alarmChecks qualifier does not typically have to be specified in the alarm file since every alarm type already defines an appropriate set of default alarm checks.
In general, the alarm checks specified by the alarm type primitives, which are used by managed properties, are adequate. However, if a different set of alarm checks must be specified for a managed property, the alarmChecks qualifier can be used to override the default alarm checks specification.
alarmChecks = <alarm check> [<alarm check2> ... ]
where
- <alarm check> is an alarm check specification of the form: <alarm state>-<alarm test>
Possible alarm states are info, warning, or error.
The following sections describe the possible alarm tests and alarm checks.
The standard alarm types and their default alarm checks are:
HI alarmChecks = error-gt warning-gt info-gt
LO alarmChecks = error-lt warning-lt info-lt
HILO alarmChecks = error-gt error-lt warning-gt warning-lt info-gt info-lt
EQ alarmChecks = error-eq warning-eq info-eq
NE alarmChecks = error-ne warning-ne info-ne
REGEXP alarmChecks = error-rx warning-rx info-rx
Alarm checks are performed in the order they are listed. Thus, alarm checks should be listed from highest to lowest alarm severity.
For example, the HI alarm type primitive defines the following alarm checks:
alarmChecks = error-gt warning-gt info-gt
This indicates that three alarm checks may be performed. The first check, error-gt, tests whether the data value is greater than the corresponding alarm limit. If the test is positive, the managed property is given the error alarm state and no more alarm checks are performed.
If the first check is negative, the second check, warning-gt, tests whether the data value is greater than the corresponding alarm limit. If the test is positive, the managed property is given the warning alarm state and no further alarm checks are performed.
If the second check is negative, the last check, info-gt, tests whether the data value is greater than the corresponding alarm limit. If the test is positive, the managed property is given the info alarm state. Otherwise, the managed property is given the ok state.
An alarm limit can be specified for each alarm check defined for the managed property. Alarm limits can be specified for managed properties whose data values are either scalars or vectors.
If no alarm limits are specified for a managed property, then the alarm checks are not performed for that managed property.
Scalars
Alarm limits are specified for scalars as follows:
alarmlimit:<alarm check> = <alarm limit or criteria>
where:
- <alarm check> is an alarm check specification of the form:
<alarm state>-<alarm test>
Solaris Example--Scalar Alarm Limit
The cpu.busy managed property is assigned the FLOATHI data and alarm type. Since it does not override the alarm checks, it uses the default alarm checks specified in the FLOATHI primitive.
cpu = { busy = { alarmlimit:error-gt = 95 alarmlimit:warning-gt = 90 alarmlimit:info-gt = } }
Vectors
Alarm limits can also be specified for a table of managed properties whose data values are vectors. An alarm limit can be specified for each vector element by qualifying the alarm limit with the rowname. The rowname is used as an index to identify rows in the table. The managed property designated to be the rowname is specified in the Model file using the index qualifier.
In addition, default alarm limits can be specified for vector elements that do not have explicitly defined alarm limits.
Alarm limits are specified for vectors as follows:
alarmlimit:<alarm check>() = <default alarm limit> alarmlimit:<alarm check>(<rowname>) = <alarm limit for row element>
where:
- <alarm check> is an alarm check specification of the form: <alarm state>-<alarm test>.
- <rowname> is the data value used to identify the row in the table. The column is specified by the index qualifier.
Solaris Example--Vector Alarm Limit
The following example demonstrates the specification of alarm limits for the avail managed property in the filesystems table of the Solaris Example module. The avail managed property is the amount of available disk space.
Default error, warning, and info alarm limits for the amount of available disk space in a file system are specified by the alarm limit entries qualified by(). These default alarm limits are applied to filesystems that do not have explicitly set alarm limits.
The mount managed property was designated as index; hence, its values are used as the rowname. Thus, the mount name of file systems is used to reference specific rows.
Alarm limits for the amount of available disk space for the /usr filesystem are specified by the alarm limits entries qualified by (/usr).
filesystems = { fileTable = { fileEntry = { avail = { alarmlimit:error-lt() = 7000 alarmlimit:warning-lt() = 12000 alarmlimit:info-lt() = alarmlimit:error-lt(/usr) = 5000 alarmlimit:warning-lt(/usr) = 10000 alarmlimit:info-lt(/usr) = } } } }
The alarm severity provides additional granularity for the ranking of alarms within each alarm state. You can prioritize alarms associated with specific managed properties relative to the alarms of other managed properties within the same alarm state by setting the alarmSeverity qualifier.
alarmSeverity = <integer>
The alarmSeverity can be set to an integer value ranging from 0 to 9. The greater the number, the higher the alarm rank. TABLE 7-1 lists the default alarm severities.
TABLE 7-1 Alarm Severities Alarm State
State Value
Default Severity
OK
0
0
OFF
0
1
DIS
0
1
INF
0
5
WRN
1
5
ERR
2
5
IRR
2
7
DWN
2
9
Solaris Example--CPU Alarm Severity
The cpu average managed property is assigned a higher alarm severity than the cpu busy managed property since the average CPU is of greater significance than instantaneous CPU measurements.
Thus, if both the busy and average managed properties go into the error state, the average alarm would be ranked higher.
If the busy goes into error while the average property goes into warning, the busy alarm would be ranked higher.
The cpu busy and average alarm information with the specification of their relative severities are:
cpu = { busy = { alarmSeverity = 3 alarmlimit:error-lt = 95 alarmlimit:warning-lt = 90 alarmlimit:info-lt = } average = { alarmSeverity = 7 alarmlimit:error-gt = 95 alarmlimit:warning-gt = 90 alarmlimit:info-gt = } }
Alarm checking can be set to be active only at particular times for a managed property by specifying the alarmWindow qualifier.
Scalars
alarmWindow = < time specification >
Vectors
alaraWindow() = < time specification> alarmWindow(<rowname>) = < time specification>
where
<rowname> is the data value used to identify the row in the table. The column is specified by the index qualifier.
The alarmWindow can be set to any valid time specification window, and specifies the time window during which alarm checking is performed. At times outside the window, the alarm checks are not executed at all. If no alarm window is specified, alarms checks are by default always done.
Solaris Example--CPU Alarm Window
In the following example, the cpu busy time alarm window is set so that alarms can only be generated between 1:00 in the afternoon and midnight. At other times, alarm checks are not performed.
cpu = { busy = { alarmWindow = time>13:00 alarmlimit:error-lt = 95 alarmlimit:warning-lt = 90 alarmlimit:info-lt = } }
When the alarm state of a managed property changes, an alarm event is generated. These alarm events can be used to trigger actions to perform pro-active or remedial actions based on the detected alarm condition. These actions are referred to as status actions and can be specified in the alarm file:
statusActions(<a21larm event>) = <action1> <action2> ... statusService(<action1>) = <service> statusCommand(<action1>) = <command> statusService(<action2>) = <service> statusCommand(<action2>) = <command>
where:
- <actionN> specifies a logical name of an action to be executed.
- <service> specifies an execution context for the command to be run.
- <command> specifies a command to execute.
An event is generated for a managed property whenever the alarm state of the managed property changes. Possible <alarm event> values include:
init when tree is initialized
change on any alarm state change
ok when alarm state goes to the ok state
irr when alarm state goes to irrational state, this is typically caused by a data acquisition error
down-eq- when an entire module changes to the down state (such as a database being unavailable)
off-eq when an entire module turns off as scheduled (through the module active time window qualifier)
disabled-eq when an entire module is disabled manually by a an end-user at the console
info-<alarm test> when the info alarm check is satisfied
warning-<alarm test> when warning alarm check is satisfied
error-<alarm test> when error alarm check is satisfied
The <alarm test> can be lt, gt, eq, ne, or rx, depending on the alarm type primitive used by the managed property.
In the following example, the 'help' message is sent to the console when the cpu busy time goes into the error state.
cpu = { busy = { statusActions(error-lt) = sayhello statusService(sayhello) = _services.sh statusCommand(sayhello) = echo "hello"> /dev/console alarmlimit:error-lt = 95 alarmlimit:warning-lt = alarmlimit:info-lt = } }