fbpx

Netreo How To: Troubleshooting Alarms Not Recovering

When you’re monitoring an enterprise IT infrastructure, alarms and troubleshooting go hand in hand. Sure, Netreo automatically resolves a number of incidents that trigger certain alarms and helps you triage the rest. But what about those times when alarms go hard/critical and take longer than expected to recover?

For example, a customer has an open incident for a service check, threshold or a host down that went critical. After a few hours or even days, the alarm for that incident has not recovered. This post explains the troubleshooting steps for Host, Threshold, and Service Checks scenarios that take longer than expected to recover.

Host Down – Test Connectivity First

In a Host down scenario, an incident created an alarm that has not recovered. Your first step should be to make sure that the device is reachable by Netreo by pinging the device. Use either the Ping or Credentials and Connectivity method for testing.

To test connectivity on a device that is currently in Netreo, navigate to the device dashboard and click under Reports. Here you will find a few useful commands.


Ping will initiate 6 ping checks from Netreo Primary/Service Engine server to the host device.

Ping

By default, most devices in Netreo will have a “Ping this host” service check. So, looking if this service check is critical should be among your first steps when testing for connectivity.

If the Ping service check is Critical, run the Ping check manually to confirm device connectivity by clicking on Reports -> Ping

Result

If the Ping test shows 0% packet loss, there is connectivity to the device, and you should continue troubleshooting. If the Pings test shows packet loss of 100%, then you have no connectivity and should reach out to your Network team to troubleshoot connectivity between Netreo and the device.

Another useful tool is the Credential and Connectivity Test. Navigate to Administration -> Tools -> Credential and Connectivity Test.

  1. Select ICMP (Ping) and enter the IP address of the device
  2. If the device requires a Service Engine to be reached, select the appropriate SE
  3. Click Test and verify the results of the Ping request
    1. Focus on the “1 host up” part of the output. If this is present, the device is reachable. If the result shows “0 host up” then the device is not reachable
    2. When the result shows 0 host up, capture a screenshot to confirm there may be an issue with the connectivity of the device
  4. Furthermore, you can use this same connectivity test to confirm if specific ports are open, filtered or closed on devices
    1. Ports to check
      1. SNMP – UDP/161
      2. Windows – TCP/5985
    2. If the result includes filtered or closed in the message, then follow up with the Network Support Services team to get these ports available

How to Troubleshoot Discovery Polling Failures

To validate that credentials are entered for the device, go to the Device Dashboard. From the dashboard, go to the Device Admin Main Page through the gear icon. On the bottom of the page, you’ll find the Authentication fields. If there is a lock icon on the right side of the field, it is locked by a template. To see which Template is locking the field, simply hover your mouse over the icon.

For SNMP Devices

Using the Test SNMP feature enables you to access additional information. This information can tell you if the device’s current credentials are reporting any data from the device.

First, check SysDescr, which will let you know if the device is responding to SNMP credentials.

You can also use Custom OID under the Select Method dropdown to confirm a valid response from a specific OID from a Poller.

For Windows Devices

For all Windows devices, use the Test WMI tool to confirm the device’s credentials. The Test WMI tool confirms whether or not Windows service classes are receiving viable data.

If Test WMI fails, take the following steps:

  • Confirm port 5985 is open and accessible to Netreo
  • Confirm credentials applied to the device have the permission to query WMI classes

Check out WMI Class Reference details in Netreo Documentation for the current list of class variables.

Discovery Failure During Auto Scanning

If you have a device fail during the IP Candidates table, you can find this via the Failed group under the Device Management page. 

Note: The Device may not display here if the device is an existing device and it failed to rediscover.

How Does an Automatic Discovery Occur

In addition to automatic device discovery, onboarding and configuration during initial Netreo deployments, Automatic Discovery occurs:

  1. During the regularly scheduled auto-discovery process
  2. Device reboots (if uptime is less than 5 minutes)
  3. Occasionally, during software upgrades

For detailed information on Netreo’s Discovery Poll and Auto Discovery / Rediscovery, plus additional help with any questions you may have, be sure to visit Netreo Documentation.

You can also check the Audit Log for another way to confirm whether a device was discovered properly and which templates were applied to the device.

System Diagnostics

If you still haven’t uncovered why this pesky alarm hasn’t recovered, your next step is to check Systems Diagnostics.

There are a few places within a Netreo instance that can be accessed to see system level information, as well as information related to the processes and actions that a Netreo instance performs or utilizes. The following walks you through the more relevant widgets within the System Diagnostics page that will reveal the details you need.

Since Netreo maintains the Netreo Cloud for SaaS deployment, the following is only relevant to troubleshooting Netreo on-premises deployments.

System Diagnostics (On-prem instances only)

Overall system performance variables for any on-premises Netreo instance are visible within the System Diagnostics screen. This screen is accessible by going to Administration -> System -> System Diagnostics.

This page displays the availability of the different processes that should be running, as well as general information on system-level performance.

Within the Current Processes tile is a list of some of the core processes that should be “OK” in a healthy system.

If any of these processes show as failed, please reach out to Netreo Support Services for assistance.

The Hardware Information tile provides a view into how much of the disk partitions are used, which is also very useful for troubleshooting.

If there are any warning notifications, please reach out to Netreo Support Services to review. You definitely don’t want to have any of these partitions filled up.

Missing Performance Data

The system timers widget is helpful in finding the underlying issue when performance data is missing. The System Timers tile shows how long the Netreo instance is taking to check the configured devices for availability, as well as the average time Netreo is taking to gather statistics from the configured devices.

Click on the Blue graph at the top right of the widget and hit Filter. From there, choose a timeframe that will include the time of the issue you’re troubleshooting.

Latency should be no more than 180 seconds, and Queue size should never be more than 300 seconds. If those values have been reached, reach out to Netreo Support Services and provide details for them to review.

The last item of interest is the widget that shows Memory Statistics.

Similar to System Timers, click the Blue graph and choose a timeframe that covers the issue you’re investigating. It’s good to note that memory utilization hovering around 90 to 95 percent is common. However, if you see more than 95 percent, or a large spike in utilization at the time of the issue, you’ll want to reach out to Netreo Support Services for assistance.

If all of the above widgets are normal, move onto checking your Audit and Debug Log.

Audit and Debug Logs 

Audit Log 

Netreo captures any action taken by any user for your review in the Audit Log. This comes in handy in many situations. Below are some sample scenarios and examples of the information in the logs that apply to troubleshooting alarms not recovering.

To access the Audit Logs, go to Administration -> System -> Audit Log.

Devices Not Being Added Properly

Use the Audit Log to find out why a device is not getting added by entering the Device Name or IP in the Device Name field and configuring the timeframe around the time in question.

The above device did not get added because there are no working credentials in any of the templates. If you see a failure such as above, check out the Testing Credentials section in our recent How To post, Troubleshooting Configuration Backup Issues.

Missing Performance Data Issues

Another good example is using the Audit Log to investigate any changes in templates that may have prevented devices from polling correctly (ex. Changes in template credentials).

This can be done by entering the template name in the message field and configuring the timeframe around the time a change may have occurred.

In the above example, we confirm that user ‘netreo’ changed settings in the Cisco Firewall Template. Seeing the username netreo all in lower case indicates an automated process ran. Any other username will identify a user that has access to the system and made a change.

If you see that a template was recently applied by a user other than netreo, you can again refer to the Testing Credentials section in our recent How To post, Troubleshooting Configuration Backup Issues for further information.

Debug Log

The Debug Log is used to troubleshoot issues related to incident, alarms and polling. To access the Debug Logs, go to Administration -> System -> Debug Log.

Alarms Not Received or Not Recovering

You can review the alarms that have been processed for a device by searching for the device name in the Keywords field.

In the example above, we can see that the service check “TCP Check for Port 777” went critical and created an incident. It then recovered shortly afterwards.

There are several key macros to look for to determine the behavior. “NOTIFICATIONTYPE” will show whether the alarm is a Recovery or Problem (Warning or Critical). If troubleshooting why the alarm is not recovering, look for the following to confirm it exists.

  • For an alarm to create an incident and alert, there should be ‘ALERTSTATE’ => ‘CRITICAL’
  • For an alarm to recover, there should be ‘NOTIFICATIONTYPE’ => ‘RECOVERY’

Another good keyword to use to get information on incidents is to search for the Incident ID you are troubleshooting.

Conclusion

Following these steps should identify the cause of any alarm that fails to recover. Many experienced Netreo users will know how to use this information to correct the issue. However, please never hesitate to contact Netreo Support Services whenever you encounter challenges or at any step of your own troubleshooting efforts.

For those pondering a switch, check out details on how the Netreo Platform delivers maximum value as your infrastructure management solution. Better still, Request a Demo Today!

Ready to get started? Get in touch or schedule a demo.