HP

HP Management Agents for Servers (Server component)

English
  Server Subsystem  |  Logs   

Logs

»Table of Contents
»Index
»Server Subsystem
»System
»Storage
»Utilization
»Power Subsystem
»Recovery
»Management Processor
»Tasks
Logs
»Printable version
»Glossary
»Using Help
» Integrated Management Log
» Critical Errors
» Correctable Errors
» Power On Messages
» Remote Insight Log

Integrated Management Log

The Integrated Management Log records system events, critical errors, power-on message errors, and memory errors. The log also records catastrophic hardware and software errors that typically cause a system to fail. This information helps to quickly identify and correct the problem and minimize downtime.

Each event log entry has a status to identify the severity of the event (OID: 1.3.6.1.4.1.232.6.2.11.3.1.2 - cpqHeEventLogEntrySeverity):

  • Informational - General information about a system event.

  • Repaired - Indication that this entry has been repaired. Users must mark entries as repaired.

  • Caution - Indication that a non-fatal error condition has occurred.

  • Critical - A component of the system has failed.

If any events in the log have a condition of Caution, the overall log condition will be marked as degraded. If Critical events exist in the log, the overall log condition will be marked as failed.

To clear a degraded or failed event log, mark the log entry as repaired after you have repaired the condition that caused a log entry to be generated. Perform the following steps.

  1. Highlight the log entries in the Integrated Management Log.

  2. Click the [Mark Repaired] button. This button is located at the bottom of the Integrated Management Log Section of the Web Browser.

Agents must have sets enabled and you must have the correct SNMP Community string to be able to mark log entries as corrected.

You must enter the Monitor and Control community strings for this device. The HP Insight Management Agent and HP Systems Insight Manager will use these community strings to communicate with the OS SNMP service. If you elect not to create a Control community string, it will not be possible to perform certain operations such as clearing the integrated management log or changing agent configuration settings.

The description column gives a brief description of the error or event (OID: 1.3.6.1.4.1.232.6.2.11.3.1.8 - cpqHeEventLogErrorDesc). The update time column contains the last time this log was updated (OID: 1.3.6.1.4.1.232.6.2.11.3.1.6 - cpqHeEventLogInitialTime). The status column contains the status of the log entry (OID: 1.3.6.1.4.1.232.6.2.11.3.1.7 - cpqHeEventLogUpdateTime).

Refer to the Integrated Management Log User Guide for more information.

Critical Errors

The Critical Error Log records non-correctable memory errors, as well as catastrophic hardware and software errors that cause a system to fail. This information helps you quickly identify and correct the problem, minimizing downtime.

This section displays a description of critical errors. The date and time of each error is followed by a brief description of the error. The time shown is rounded to the nearest hour.

If critical errors are marked with an exclamation point (!), indicating corrective action is required, the log condition is degraded. To eliminate the exclamation mark and indicate that an entry has been corrected, select the entries you wish to clear and click the [Correct Marked Entries] button or run Diagnostics on the device. An asterisk ( * ) indicates the log entry to which the Last Failure Message applies.

Agents must have sets enabled and you must have the correct SNMP Community string to be able to mark entries as corrected.

The following list displays errors that may be logged. If you receive any of these errors, run Diagnostics on your system or consult your software documentation.

  • Abnormal Program Termination - The device has detected a fatal software error resulting in a device failure.

  • ASR Base Memory Parity Error - The system detected a data error in base memory following a reset due to an ASR timeout.

  • ASR Extended Memory Parity Error - The system detected a data error in extended memory following a reset due to an ASR timeout.

  • ASR Memory Parity Error - The system ROM was unable to allocate enough memory to create a stack. It was unable to put a message on the screen or continue booting the server.

  • ASR Reset Limit Reached - The maximum number of system resets has been reached. The HP System Configuration Utility will be loaded.

  • ASR Reset Occurred - No error data is logged.

  • ASR Test Event - An ASR Test event was generated by the user through the system utilities. No action is required since the event was user-generated to test the ASR configuration.

  • ASR Timeout NMI - The server has generated an ASR NMI because the ASR timer has not been refreshed. This generally indicates a driver has not relinquished control of the processor causing a server failure. The resulting ASR NMI was generated to log this event. Note the module that was executing.

  • CPU Internal Corrected Error Threshold Exceeded - The system has detected that a CPU has exceeded the threshold for the number of internal ECC cache errors.

  • CPU Processor Power Module Failed - The system has detected that a processor's power module has failed.

  • Critical Temperature - The system's critical temperature has been exceeded and auto shutdown has been initiated.

  • Error Detected On Bootup - The system detected an error during the Power-On Self-Test.

  • Exception - The processor has detected a critical exception resulting in a device failure.

  • Fan Failure - The system or processor fan failed.

  • NMI - CPU Local Error - The processor experienced a fatal error resulting in a device failure.

  • NMI - Expansion Board Error - A board on the expansion bus indicated an error condition causing a device failure.

  • NMI - Expansion Bus Arbitration Error - Memory refresh cycles were delayed, potentially leading to data loss. The error results in a system failure.

  • NMI - Expansion Bus Master Time-out - A bus master expansion board in the indicated slot did not release the bus after its maximum time resulting in a device failure.

  • NMI - Expansion Bus Slave Time-out - A board on the expansion bus delayed a bus cycle beyond the maximum time resulting in a device failure.

  • NMI - Failsafe Timer Expiration - The software was unable to reset the system failsafe timer, resulting in a system failure.

  • NMI - Processor Address Error 1 - A processor internal address parity checking error occurred, resulting in a device failure.

  • NMI - Processor Address Error 2 - The processor detected an address parity error during an inquire cycle.

  • NMI - Processor Cache Parity Error - A data error occurred within the processor cache, resulting in a system failure.

  • NMI - Processor Internal Error 1 - A processor internal parity error occurred, resulting in a device failure.

  • NMI - Processor Internal Error 2 - The processor detected an internal parity error or a functional redundancy error.

  • NMI - Processor Parity Error - The processor detected a data error resulting in a device failure.

  • NMI - Software Generated Interrupt - Software indicated a system error resulting in a system failure.

  • NMI - System Concurrency Error - A potential error condition was detected within the Data Flow Manager, resulting in a system failure.

  • NMI - Uncorrectable Memory Error - The device experienced an uncorrectable memory parity error resulting in a device failure.

  • NMI - Unknown Error Type - The device driver does not recognize this NMI. You may need to upgrade your health driver.

  • Processor Failure - The processor failed during the Power-On Self-Test.

  • Server Manager Failure - An error occurred in the server interface with the Server Manager/R.

  • UPS A/C Line Failure/Shutdown or Battery Low - The device has initiated a UPS or operating system shutdown, or the battery is almost depleted after an AC line failure.

The Last Failure Message on this window displays the last failure message associated with a critical error.

Correctable Errors

This alarm indicates that a block of memory has failed or is failing and may need to be replaced. This condition is generally non-critical since the memory controller can correct the problem. However, this type of error indicates that a memory component is failing or has failed in the system issuing the alarm. The system continues to correct any errors it can.

Memory errors are corrected by the ECC memory subsystem when they occur. If you notice an increase in these errors, correct the problems as soon as possible. Further degradation of the memory components may occur, and then errors may no longer be correctable

Power On Messages

This section displays the Power-On messages logged when the device was turned on. Refer to your device documentation for a listing of possible Power-On error messages and their meanings. Click the [Clear Power-On Message] button to clear the power-on message log. This button is only available if there are messages to clear.

Remote Insight Log

The Event Log section displays the list of events stored in the Remote Insight Board event log. A user with the appropriate authority can clear these events. Each event includes the following information:

  • Index displays a numeric index for each event.

  • Time of Event displays the time the event occurred.

    OID: 1.3.6.1.4.1.232.9.2.3.2.1.3 - cpqSm2EventLogDate

  • Description displays a text description of the event.

    OID: 1.3.6.1.4.1.232.9.2.3.2.1.4 - cpqSm2EventLogMessage