Quantcast
Channel: THWACK: Message List
Viewing all articles
Browse latest Browse all 20019

Re: Success Stories of gaining operational value from LEM

$
0
0

I've had responding with more thoughts on my to-do list for a while, but never got around to it... so here's some thoughts on the examples I listed.

1. Company has a situation where downtime directly costs them money, but does not invoke any regulatory compliance issues. A virus, an outage, etc, means people are literally not spending money with them, and seconds tick by fast. A security issue causes at least an hour's worth of downtime and could put them out for the entire day. For all these reasons, they have service accounts that they have to share the passwords to (think whiteboard with passwords that have admin access to a set of servers) so that a set of operators can fix issues quickly. Their IT team had thought they restricted usage of these accounts via GPO, since they were highly privileged. Not so - we were able to audit usage of these accounts, find people logging on to them, and making unexpected changes to their own systems (like adding themselves to local admins, installing software, etc).

 

In this case, their accounts were named consistently - say "svc_XXXXX". We identified them by:

  1. Creating a filter looking for Auth Audit Events or Change Management Events with Source or Destination Account of "svc_*"
  2. Doing an nDepth search for "svc_*" (or "User Name = svc_*")
  3. Running the Resource Configuration and Authentication master reports and filtering to source or destination account of "svc_*" - sometimes we'd just run spot checks of things like UserLogonFailure or UserLogon by User instead of the full big reports that can take quite a while to run.

 

Since they did have a set of machines that this was allowed on, when we built rules we were more restrictive. With filters/searches/reports it was not so bad since there wasn't a TON of volume, but alerts had to be more specific. Since they had agents everywhere on servers, we created a Connector Profile of those core systems they COULD log on to (you could also use a User-Defined Group), then we had two things we were most interested in:

  1. Interactive User Logons by these users to machines that weren't in the approved list
    1. Criteria: UserLogon.LogonType = *Interactive AND UserLogon.DestinationMachine <> List of Approved Machines AND UserLogon.DestinationAccount = svc_*
    2. In their case they had the action set to email temporarily, then once they were comfortable they set it to use the Log Off User active response.
  2. Any changes made by these users to machines that weren't in the approved list
    1. Criteria: Change Managment Events.SourceAccount = svc_* AND Change Management Events.DestinationMachine <> List of Approved Machines
    2. In this case, the action was just to email, though they were thinking about logging off the user here too.

 

We looked at creating a rule for logons that would tell us when they might have logged on to other systems (like, their workstation), but didn't put it in production when I was working with them. In that case you'd create a second UserLogon rule that didn't have the LogonType restriction (you also can't do the Log Off User action, which is why we created a second rule - in their case they used filters pretty extensively so they set up a filter notification instead).

 

2. We used an example in the first SolarWinds Lab episode of a customer whose firewall kept going down, down, down, regardless of what they did. Their connections were being used up so quickly they thought there was a bug in their firmware. Interface utilization was off the charts. We were able to figure out it was actually a worm - almost every machine in their infrastructure was infected. They were a healthcare org so it wasn't business crippling entirely, but all of their remote sites/clinics were isolated (connected back via VPN - which couldn't connect or maintain) which affected patient care, access to records, etc. We were able to resolve it by identifying infected machines, cleaning them up, then continuing to filter and monitor for new infections.

 

In this case the "canary in the coal mine" was their firewall grinding to a halt and people complaining. Not the best warning, but that's reality, right?

 

What we did next was look at their Console and create a filter for ALL firewall events from their firewall (so Any Event.DetectionIP = <firewall's IP> - same idea as the stock "All Firewall Events" filter) and saw data just RIPPING through (we actually didn't even need to create the filter - it was pretty clear even LOOKING at the Console, but it was hard to tell if that was normal or not since we weren't working with them before... it just didn't smell right). After seeing that, we headed over to the "All Network Traffic" filter so that we could do some more digging. There's a stock "Network Event Trends by Source Machine" widget there, which pretty much showed a TON of Source Machines, not entirely useful since we didn't know their network beforehand. We created a similar widget that showed "Network Event Trends by Source Port" (and another for Destination Port) - for a line chart use Field: Event Name, Show: Count, Versus: Time, Split By: SourcePort (or DestinationPort). There was a huge gap for either the source or destination port (I can't remember which virus it was...) that showed ONE dominant port for most of their traffic. A quick google for that port, viola, malware.

 

What we did next was build a rule for several hits to that port and dropped the SourceMachine into a UDG. This was our "known bad machines" UDG and what they used for cleanup efforts.

  1. Criteria: Network Audit Events.SourcePort = <the bad port we identified before>, threshold: 5 in 30 seconds
  2. Action: Add to UDG, Network Audit Events.SourceMachine

 

This built a nice list of systems infected (it was bad - there were a lot of them at first). As they cleaned a system from the list, they removed it from the UDG, and slowly it dwindled. They also had a filter for Network Audit Events.SourceMachine = <known bad machines UDG> to show them whether there was still traffic coming to/from those systems. (You could also create one just for the known bad port.)

 

To figure out the history of what had happened, we identified the specific port and traffic, and used nDepth to dig back (e.g. a search for Network Audit Events.SourcePort = <the bad port>). We slid the window back far enough to find when it started. We also tried to dig for malware events from their firewall or AV, but we never spotted any (e.g. a search for VirusAttack events over the same timeframe).

 

3. When I managed IT for TriGeo before acquisition, I ran into all kinds of stuff that would have taken me forever without a system aggregating logs. However, the most amusing thing was everyone knowing we used it, and coming to me when issues happened before I really knew they were a problem - because they assumed that in a "big brother" sort of way I already knew (Sometimes I did, sometimes I didn't - yet.) I could probably drag out a bunch of stories of how logs saved my bacon or really sped up my job, I'm pretty sure if we didn't we'd have had to hire "real" IT people other than myself and a couple of people who helped on both helpdesk level support and our hardware burn-in/imprint process.

 

I used the console a lot, and a lot of our stock rules are really borne out of stuff that I used on a day to day basis. I'll have to rack my brain on what I was actually using, but it included:

  • Notifications on Account Lockouts (stock rule) - when I got notified of one, I'd check the machine name to see if it matched an expected system and usually just unlock them if I had time (we had an auto-unlock policy but people are impatient ).
  • Any interactive logons directly to my servers - since only a few people should actually be doing this, if I got an email and didn't know it was happening, I jumped on it immediately.
  • Any logons using domain admin accounts, especially the stock account - our domain admin was renamed and there were really only a few limited domain admin accounts, so we used runas or a logon directly to a server/DC to do account maintenance. When these accounts were used, it meant business. I would often look at the source account to see who was using it, especially if it were from a workstation and not a server (so they were using "runas").
  • Any viruses - even if they got cleaned, since it might indicate someone ... internet-promiscuous.

 

I had filters for....

  • Blocked web traffic (we had a web content filter, so I would watch for blocked content); allowed web traffic (here I'd use widgets that would break it down by hostname or username)
  • Network traffic, with widgets by event type, source/destination ports
  • USB-Defender stuff - who was using USB devices and what files/processes they were accessing (I could actually see if someone was running apps off their USB key, which often wasn't a big deal but a couple times we did see stuff that you could call "out of policy" )

 

If I think of more I'll have to add it.


Viewing all articles
Browse latest Browse all 20019

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>