Educause Security Discussion mailing list archives
ELK v Splunk, was Re: [SECURITY] SIEM Tools; extremely long
From: Kevin Wilcox <wilcoxkm () APPSTATE EDU>
Date: Mon, 22 Jan 2018 19:01:04 -0500
Warning: as the subject says, extremely long message follows. To give folks who don't know us a little bit of background, Jeff and I have had this conversation at the last two SPCs and probably will again if we're both in Baltimore =) I was trying to stay away from mentioning ELK in that response but since you mention it by name... On 22 January 2018 at 16:39, Collyer, Jeffrey W. (jwc3f) <jwc3f () virginia edu> wrote:
I do love Splunk. I’m not anti-ELK, but I know what works for me.
Out of curiosity (because I can't remember, and if I *could* remember I wouldn't blurt it on a public list without your approval), what was UVA using for log agg before Splunk?
Schema-on-search is a huge huge win to me. I do not know all the fields I may care about in all my data prior to ingesting, and we’re adding more every day. We can argue about needing to know everything in your data prior to indexing, but thats not a Splunk failing if you consider it one. Splunk solves that problem for me.
If you'd be so kind as to indulge me :) Let's say I send you two million logs that look like this (assume they're sent when they were generated): 1st Jan 2018 ron draco fail 3rd Jan 2018 hagrid hermione pass 5th Jan 2018 hagrid ron pass 2nd Jan 2018 hermione cho pass I then tell you that you can get each user's current house assignment from Active Directory, LDAP or <this web service> What has to happen, from a Splunk perspective, if I want only the logs where proxy instances passed traffic for students *except* I don't want logs from the "cho" instance, and I need them sorted by the house each student was in at the time? Is that impacted if I tell you I need only the ones between six and eight months old...but I need the results/dashboard for a presentation in ten minutes? Is that impacted if my data set spans fifty billion proxy logs from ten proxy instances for 30k accounts assigned to one of twenty houses...but that there is a hiccup and Splunk can't get to AD, LDAP or the web service with house assignments right now? Yes, this is a "cherry-picked" scenario but it's picked because it brings together things I've brought up in other messages and it's a simple example of something I'll mention at the end of the email about enrichment (the 'E' in SIEM, in modern times, should stand for Enrichment).
Having a company with paid support behind the product and a large community of supporters is invaluable. I’m one guy. I stood up a 2 search head, 3 indexer, 2 forwarder cluster by myself. Managing it is not a full time job. ELK is much more intensive. Support from Splunk itself has always been top notch when I’ve needed it. At a .conf2106 (the splunk conference) the documentation group had a booth. I complained a them about something I had found unclear. They took notes and actually thanked me for the feedback. The docs were updated to be more clear the Monday following the conference. They take feedback seriously.
For comparison: I stood up six logstash nodes (equivalent to heavy forwarders), three kibana nodes (search heads), two rabbitmq message buffers (which should be part of *any* SIEM deployment) and five Elasticsearch nodes (indexers) by myself with puppet. Currently I can re-deploy the entire environment in about twenty minutes (tested when we pushed out two new elasticsearch nodes and five new logstash nodes last month). The only time I login on those nodes is to disable/re-enable shard allocation (one command each) during system updates because we force a reboot as part of the update process - and even then I only login on one system. In a couple of months I'm replacing it with a forty-node ecosystem that I plan to roll out in its entirety over the course of about three days. The entire first day is allotted to just making sure the servers have their RAID configured, the base OS installed, the IPs/networking sorted and the TLS certificates issued. On the second day I'll push the stack out and on day three I'll shutdown the eighteen nodes in the existing ecosystem. Easy peasy. By "push the stack out", I mean I will add the new hostnames/IPs to the appropriate configs in puppet and leave for lunch. When I get back with my sandwich a half hour later, I'll have: o twenty+ Elasticsearch nodes clustered with certificates rolled out for TLS everywhere and replicating our existing data to the new nodes o two RabbiqMQ nodes joined to our additional RabbitMQ cluster and serving log data o three logstash nodes facing the Internet so authenticated systems can send telemetry data, with failover o ten more logstash nodes with additional services running to do enrichment on logs between the RMQ cluster and Elasticsearch *using our existing patterns and parsers* o two or three 'machine learning' nodes joined to the cluster ready for analysis jobs o a handful of kibana nodes facing our internal network with local users added, AD authentication ready for non-local accounts and the dashboards from our currently-running instance available Everything but RabbitMQ will have paid support and a one-hour guaranteed response time from an engineer who knows my OS, my configuration, etc. I plan to spend less than two hours per month on maintaining that environment (parsers for new log formats, dashboards, etc., aren't factored in because I would still front Splunk with logstash or nifi and users can create their own dashboards). Deployment and configuration management should be mostly automated. It should be fairly trivial to re-deploy the environment on new hardware or grow the environment when load becomes an issue. There should be avenues for support regardless of product. Those were commonalities I didn't bring up initially because, well, they're common across all platforms. There's nothing special about the automation we have in our ELK environment (or will have in the new one) - I'd do the same thing if we'd gone Splunk because I'm also a one-person SecOps group and as much as I like USING log agg, I don't want to spend a lot of time setting it up.
Ultimately it comes down to paying to be able to do more, faster with Splunk or devoting manpower/time to managing ELK. You pay either way its just comes down to what you want to pay with.
I maintain that the management overhead is similar for all but the largest (a couple of hundred nodes consuming hundreds of thousands of events per second at multiple terabytes per day) deployments. I know of one company with > 500 indexers and search heads in their Splunk environment, they have forty full-time staff to manage Splunk *only*. Those forty never touch the actual hardware (because it's sitting in a cloud data centre). FireEye have separate Elastic clusters for each of their TAP customers, some in excess of 125 ES nodes - I dare say they have a VERY high server-to-admin ratio. With proper automation and configuration management, it shouldn't matter - pushing two nodes is only a little less work than pushing twenty or two hundred. Orgs that are indexing a terabyte+ per day into Splunk or QRadar or ArcSight and have lots of processes built around it are probably not in a position to augment their SIEM because they've been in their current offering long enough that they've *probably* built around its deficiencies. It's the others who have big decisions to make, the ones who need to really think about what they want to accomplish and what they *can* accomplish with presumably very little human capital. I've had two dozen people tell me that "limited people resources means we need to buy something like QRadar or Splunk + ES". No no no no no. What they NEED is to understand the difference between schema-on-read and schema-on-write, how each one ties in with basic full-text indexing, searching and reporting, how vertical and horizontal scaling affect their deployment and growth strategies, how enrichment will affect their search, reporting and hardware expectations, and how a single interesting network event can blow through their license (Splunk has thankfully addressed this), knock their SIEM over for <minutes/hours/days> or cause <x amount> of existing log data to be lost. They NEED to consider whether they're going to buy something for a half-million dollars that checks <x> boxes or if they can spend 1/3 of that and get 90% of those same boxes checked with a similar time investment.
From a usability and efficiency perspective, it's pointless for me to
say "schema on write/read is better than read/write" unless I can show specific examples of why or "make sure you enrich before you index" unless I can show them where it will save them time (and analysis money) down the line. They need to understand the difference between giving their analysts and admins a dashboard with a log event that looks like this: { "query":"foo.com", "client":"10.1.2.3" } and something that looks like this within seconds of the logged query being sent to the SIEM: { "query":"foo.com", "added_query_time": "2017-01-01T12.32.123445Z", "added_response_time":"325 ms", "added_org_dns_server":"10.20.30.40", "added_auth_dns_server":"98.76.54.32", "added_auth_dns_asn_name":"Bulletproof Hosting Subsidiary", "added_domain_registrant":"Some Random Company", "added_domain_age_days":"12", "added_domain_entropy":"0.2", "added_domain_ip":"12.34.56.78", "added_domain_ip_asn":"Bulletproof Hosting", "added_domain_ip_geoip_city":"London", "added_domain_ip_geoip_country":"England", "added_domain_first_seen":"NEW", "added_domain_in_intel":"true", "added_domain_intel_reason":"coinminer malware" "client":"10.1.2.3", "added_client_os":"macOS", "added_client_type":"macbook pro", "added_client_ou":"IT - Linux Systems", "added_client_location":"Some building on your campus" } Furthering that understanding, and helping make sure folks make informed architectural/implementation decisions that make life easier for their analysts, is the whole reason for my pre-con workshop at SPC in a few months and why I started the SIEM From Scratch project over Christmas. kmw
Current thread:
- ELK v Splunk, was Re: [SECURITY] SIEM Tools; extremely long Kevin Wilcox (Jan 22)