Educause Security Discussion mailing list archives

ELK v Splunk, was Re: [SECURITY] SIEM Tools; extremely long

From: Kevin Wilcox <wilcoxkm () APPSTATE EDU>
Date: Mon, 22 Jan 2018 19:01:04 -0500

Warning: as the subject says, extremely long message follows.

To give folks who don't know us a little bit of background, Jeff and I
have had this conversation at the last two SPCs and probably will
again if we're both in Baltimore =)

I was trying to stay away from mentioning ELK in that response but
since you mention it by name...

On 22 January 2018 at 16:39, Collyer, Jeffrey W. (jwc3f)
<jwc3f () virginia edu> wrote:

I do love Splunk.  I’m not anti-ELK, but I know what works for me.


Out of curiosity (because I can't remember, and if I *could* remember
I wouldn't blurt it on a public list without your approval), what was
UVA using for log agg before Splunk?

Schema-on-search is a huge huge win to me.  I do not know all the fields I may care about in all my data prior to 
ingesting, and we’re adding more every day. We can argue about needing to know everything in your data prior to 
indexing, but thats not a Splunk failing if you consider it one.  Splunk solves that problem for me.


If you'd be so kind as to indulge me :)

Let's say I send you two million logs that look like this (assume
they're sent when they were generated):

1st Jan 2018 ron draco fail
3rd Jan 2018 hagrid hermione pass
5th Jan 2018 hagrid ron pass
2nd Jan 2018 hermione cho pass

I then tell you that you can get each user's current house assignment
from Active Directory, LDAP or <this web service>

What has to happen, from a Splunk perspective, if I want only the logs
where proxy instances passed traffic for students *except* I don't
want logs from the "cho" instance, and I need them sorted by the house
each student was in at the time?

Is that impacted if I tell you I need only the ones between six and
eight months old...but I need the results/dashboard for a presentation
in ten minutes?

Is that impacted if my data set spans fifty billion proxy logs from
ten proxy instances for 30k accounts assigned to one of twenty
houses...but that there is a hiccup and Splunk can't get to AD, LDAP
or the web service with house assignments right now?

Yes, this is a "cherry-picked" scenario but it's picked because it
brings together things I've brought up in other messages and it's a
simple example of something I'll mention at the end of the email about
enrichment (the 'E' in SIEM, in modern times, should stand for
Enrichment).

Having a company with paid support behind the product and a large community of supporters is invaluable.  I’m one 
guy.  I stood up a 2 search head, 3 indexer, 2 forwarder cluster by myself.  Managing it is not a full time job.  ELK 
is much more intensive.  Support from Splunk itself has always been top notch when I’ve needed it.  At a .conf2106 
(the splunk conference) the documentation group had a booth. I complained a them about something I had found unclear. 
 They took notes and actually thanked me for the feedback.  The docs were updated to be more clear the Monday 
following the conference.  They take feedback seriously.


For comparison:

I stood up six logstash nodes (equivalent to heavy forwarders), three
kibana nodes (search heads), two rabbitmq message buffers (which
should be part of *any* SIEM deployment) and five Elasticsearch nodes
(indexers) by myself with puppet. Currently I can re-deploy the entire
environment in about twenty minutes (tested when we pushed out two new
elasticsearch nodes and five new logstash nodes last month). The only
time I login on those nodes is to disable/re-enable shard allocation
(one command each) during system updates because we force a reboot as
part of the update process - and even then I only login on one system.

In a couple of months I'm replacing it with a forty-node ecosystem
that I plan to roll out in its entirety over the course of about three
days. The entire first day is allotted to just making sure the servers
have their RAID configured, the base OS installed, the IPs/networking
sorted and the TLS certificates issued. On the second day I'll push
the stack out and on day three I'll shutdown the eighteen nodes in the
existing ecosystem. Easy peasy.

By "push the stack out", I mean I will add the new hostnames/IPs to
the appropriate configs in puppet and leave for lunch. When I get back
with my sandwich a half hour later, I'll have:

o twenty+ Elasticsearch nodes clustered with certificates rolled out
for TLS everywhere and replicating our existing data to the new nodes
o two RabbiqMQ nodes joined to our additional RabbitMQ cluster and
serving log data
o three logstash nodes facing the Internet so authenticated systems
can send telemetry data, with failover
o ten more logstash nodes with additional services running to do
enrichment on logs between the RMQ cluster and Elasticsearch *using
our existing patterns and parsers*
o two or three 'machine learning' nodes joined to the cluster ready
for analysis jobs
o a handful of kibana nodes facing our internal network with local
users added, AD authentication ready for non-local accounts and the
dashboards from our currently-running instance available

Everything but RabbitMQ will have paid support and a one-hour
guaranteed response time from an engineer who knows my OS, my
configuration, etc. I plan to spend less than two hours per month on
maintaining that environment (parsers for new log formats, dashboards,
etc., aren't factored in because I would still front Splunk with
logstash or nifi and users can create their own dashboards).

Deployment and configuration management should be mostly automated. It
should be fairly trivial to re-deploy the environment on new hardware
or grow the environment when load becomes an issue. There should be
avenues for support regardless of product. Those were commonalities I
didn't bring up initially because, well, they're common across all
platforms. There's nothing special about the automation we have in our
ELK environment (or will have in the new one) - I'd do the same thing
if we'd gone Splunk because I'm also a one-person SecOps group and as
much as I like USING log agg, I don't want to spend a lot of time
setting it up.

Ultimately it comes down to paying to be able to do more, faster with Splunk or devoting manpower/time to managing 
ELK.  You pay either way its just comes down to what you want to pay with.


I maintain that the management overhead is similar for all but the
largest (a couple of hundred nodes consuming hundreds of thousands of
events per second at multiple terabytes per day) deployments. I know
of one company with > 500 indexers and search heads in their Splunk
environment, they have forty full-time staff to manage Splunk *only*.
Those forty never touch the actual hardware (because it's sitting in a
cloud data centre). FireEye have separate Elastic clusters for each of
their TAP customers, some in excess of 125 ES nodes - I dare say they
have a VERY high server-to-admin ratio. With proper automation and
configuration management, it shouldn't matter - pushing two nodes is
only a little less work than pushing twenty or two hundred.

Orgs that are indexing a terabyte+ per day into Splunk or QRadar or
ArcSight and have lots of processes built around it are probably not
in a position to augment their SIEM because they've been in their
current offering long enough that they've *probably* built around its
deficiencies.

It's the others who have big decisions to make, the ones who need to
really think about what they want to accomplish and what they *can*
accomplish with presumably very little human capital. I've had two
dozen people tell me that "limited people resources means we need to
buy something like QRadar or Splunk + ES". No no no no no. What they
NEED is to understand the difference between schema-on-read and
schema-on-write, how each one ties in with basic full-text indexing,
searching and reporting, how vertical and horizontal scaling affect
their deployment and growth strategies, how enrichment will affect
their search, reporting and hardware expectations, and how a single
interesting network event can blow through their license (Splunk has
thankfully addressed this), knock their SIEM over for
<minutes/hours/days> or cause <x amount> of existing log data to be
lost. They NEED to consider whether they're going to buy something for
a half-million dollars that checks <x> boxes or if they can spend 1/3
of that and get 90% of those same boxes checked with a similar time
investment.

From a usability and efficiency perspective, it's pointless for me to

say "schema on write/read is better than read/write" unless I can show
specific examples of why or "make sure you enrich before you index"
unless I can show them where it will save them time (and analysis
money) down the line. They need to understand the difference between
giving their analysts and admins a dashboard with a log event that
looks like this:

{
  "query":"foo.com",
  "client":"10.1.2.3"
}

and something that looks like this within seconds of the logged query
being sent to the SIEM:

{
  "query":"foo.com",
  "added_query_time": "2017-01-01T12.32.123445Z",
  "added_response_time":"325 ms",
  "added_org_dns_server":"10.20.30.40",
  "added_auth_dns_server":"98.76.54.32",
  "added_auth_dns_asn_name":"Bulletproof Hosting Subsidiary",
  "added_domain_registrant":"Some Random Company",
  "added_domain_age_days":"12",
  "added_domain_entropy":"0.2",
  "added_domain_ip":"12.34.56.78",
  "added_domain_ip_asn":"Bulletproof Hosting",
  "added_domain_ip_geoip_city":"London",
  "added_domain_ip_geoip_country":"England",
  "added_domain_first_seen":"NEW",
  "added_domain_in_intel":"true",
  "added_domain_intel_reason":"coinminer malware"
  "client":"10.1.2.3",
  "added_client_os":"macOS",
  "added_client_type":"macbook pro",
  "added_client_ou":"IT - Linux Systems",
  "added_client_location":"Some building on your campus"
}

Furthering that understanding, and helping make sure folks make
informed architectural/implementation decisions that make life easier
for their analysts, is the whole reason for my pre-con workshop at SPC
in a few months and why I started the SIEM From Scratch project over
Christmas.

kmw

Current thread:

ELK v Splunk, was Re: [SECURITY] SIEM Tools; extremely long Kevin Wilcox (Jan 22)