“No downtime” is the usual mantra for network professionals, whose users demand 100% availability. But, at the Veterans Affairs (VA), there’s another requirement driving the team that manages the agency’s Enterprise Network, Systems and Applications : “Never Be Blind.”
Speaking at the recent SolarWinds Federal and Government User Group Conference in Washington, DC, Jamison C. (J.C.) Jennings, Visibility Systems Engineer for Salient CGRT, said the VA’s determination to always “be in the know” of the state of their Internet gateways led to what was, when first implemented, a unique approach to Network monitoring: having two instances of SolarWinds network management and monitoring tools running at all times.
This “Active/Active” solution, he explained, keeps end users from seeing interruptions while system upgrades are performed. The approach also protects against accidental deletions of critical information, if data is accidentally deleted in one instance, it will be available in the other.
An onsite VA contractor of 15 years, Jennings described the range of challenges he’s faced while helping to grow and evolve the VA’s critical systems, along with the best practices they’ve developed to ensure better outcomes.
Jennings described the issues that his team has been addressing: “We were lacking visibility into system and server health, lacking visibility into application or database performance.” Along with those challenges, the VA was dealing with limited automation for compliance reporting.
To tackle these concerns, the VA implemented multiple SolarWinds network management and monitoring tools. Jennings and his team manage 14 instances of SolarWinds at the VA’s NOC, and he also provides expert guidance to the Enterprise Management team, which is responsible for 9 additional instances of the tools. In addition to managing 9 separate regional instances, they’ve also consolidated visibility across regions using SolarWinds Enterprise Operations Console. This is important in light of the agency’s technical reorganization, which gives the LAN, WAN, and Platform teams national responsibility.
Dark Clouds Ahead?
To meet a COOP (Continuity of Operations) requirement, Jennings and his team were tasked with developing a process to cut off all connection to the Internet within 15 minutes. This mandate, designed to protect essential data – including the personally identifiable records of millions of veterans – required not only identifying the circuits that need to come down, but scripting that could make the process simple for a non-technical user. J.C. explained that this capability could be used by the VA to isolate the network in the case of a particularly nasty Zero Day attack.
The team developed the scripts to run on SolarWinds Network Configuration Manager (NCM); the solution they developed requires multiple layers of password security to run, protecting against accidental or malicious insider use. Of course, he added, they also developed the scripts to reconnect to the Internet quickly.
Best Practices
Jennings stressed the importance of after-action reporting in case of incidents. “It’s an opportunity for the engineers to identify what happened, were we monitoring it and if not, can we monitor it? Some of our best application monitoring that we’ve created has come from this kind of after-action report.”
“A constant struggle”, Jennings said, “is to keep the monitoring environment ‘clean’ and current.” For example, he said, “I don’t want to monitor resources that don’t exist anymore, so change control is essential. We constantly explain to end users the impact of making changes – for example, migrating hard drives or changing the Network interface on a device– without notifying the monitoring team results in 1, not monitoring a resource that needs to be monitored and 2, wasting resources on monitoring an element that no longer exists.”
For SolarWinds users with multiple instances, he also recommended using custom properties to standardize your controls, making it much easier to generate reports and alerts, create views, and set account limitations.
If you don’t have a good DBA (database administrator), get one, Jennings suggests, as a good DBA can figure out and fix errors quickly. And if you do have one, be sure to show your appreciation for the value they bring. His recommendation: “Give them chocolate.”
He also recommends that SolarWinds administrators take some SQL querying training themselves, to provide a foundation on how to generate reports and make it easier to understand what can and can’t be done.
But, Jennings offered one piece of advice to the User Group attendees that resonates through any career path, technical or otherwise: “Always be willing to learn something new.” In fact, J.C. strongly advocated THWACK, SolarWinds on-line community, where there are many opportunities to learn and share with fellow SolarWinds users.
Want to hear more from JC? Sign up for SolarWinds’ May 23rd webinar, it’s from 11am to 12 noon EDT.