BreachExchange mailing list archives

Needle in a Haystack: Harnessing Big Data for Security


From: Audrey McNeil <audrey () riskbasedsecurity com>
Date: Mon, 16 Sep 2013 13:55:23 -0600

http://www.technewsworld.com/story/Needle-in-a-Haystack-Harnessing-Big-Data-for-Security-78961.html


The process of validating suspicious hosts can be cost- and
resource-prohibitive. To validate threats across the entire Internet would
require an army of analysts. The good news is that there are thousands of
potential analysts in the security community, including security-savvy
customers. The bad news is that security vendors typically keep their
threat intelligence to themselves.

The combination of the polymorphic nature of malware, failure of
signature-based security tools, and massive amounts of data and traffic
flowing in and out of enterprise networks is making threat management using
traditional approaches virtually impossible.

Until now, security has been based largely on the opinions of researchers
who investigate attacks through reverse engineering, homegrown tools and
general hacking. In contrast, the Big Data movement makes it possible to
analyze an enormous volume of widely varied data to prevent and contain
zero-day attacks without details of the exploits themselves. The four-step
process outlined below illustrates how Big Data techniques lead to
next-generation security intelligence.

Information Gathering

Malware is transmitted between hosts (e.g. server, desktop, laptop, tablet,
phone) only after an Internet connection is established. Every Internet
connection begins with the three Rs: Request, Route and Resolve. The
contextual details of the three Rs reveal how malware, botnets and phishing
sites relate at the Internet-layer, not simply the network- or
endpoint-layer.

Before users can publish a tweet or update a status, their device must
resolve the IP address currently linked to a particular domain name (e.g.,
www.facebook.com) within a Domain Name System record. With extremely few
exceptions, every application, whether benign or malicious, performs this
step.

Multiple networks then route this request over the Internet, but any two
hosts never connect directly. Internet Service Providers connect the hosts
and route data using the Border Gateway Protocol. Once the connection is
established, content is transmitted.

If researchers can continuously store, process, and query data gathered
from BGP routing tables, they can identify associations for nearly every
Internet host and publicly routable network. If they can do the same for
data gathered from DNS traffic, they can learn both current and historical
Host IP Address/Host Name associations across nearly the entire Internet.

By combining these two Big Data sets, researchers can relate any host's
name, address, or network to another host's name, address, or network. In
other words, the data describes the current and historical topology of the
entire Internet -- regardless of device, application, protocol, or port
used to transmit content.

Extracting Actionable Information

While storing contextual details on a massive volume of Internet
connections in real-time is no easy task, processing this data in order to
extract useful information about an ever-changing threat landscape might be
nearly impossible. There is an art to querying these giant data sets in
order to find the needles in the haystack.

First, start with known threats. It's possible to learn about these from
multiple sources, such as security technology partners or security
community members that publicly share discoveries on a blog or other media
site.

Second, form a hypothesis. Analyze known threats to develop theories on how
criminals will continue to exploit the Internet's infrastructure to get
users or their infected devices to connect to malware, botnets and phishing
sites. Observing patterns and statistical variances regarding the requests,
routes and resolutions for malicious hosts is one of the keys to predicting
the presence and behavior of malicious hosts in the future.

Spatial patterns can reveal malicious hosts, since they often share a
publicly routable network (aka ASN) with other malicious websites -- for
example, same geographic location, same domain name, same IP address, same
name server host storing the DNS record or other objects. Infected devices
connect with these hosts more often than clean devices do.

Temporal patterns can be used to identify malicious hosts by showing
evidence of irregular connection request volume or new domains with sudden
high spikes in volume immediately after domain registration. Statistical
variances, such as a domain name with abnormal entropy (gibberish), can
also reveal malicious hosts.

Third, process the data -- repeatedly. On the Internet, threats are always
changing. Processing a constant flow of new data calls for a real-time
adaptable machine-learning system. It needs classifiers that are based on a
hypothesis. Alternatively, the data can be clustered based on general
objects and elements, and training algorithms can collect a positive set of
known malicious hosts as well as a negative set of known benign hosts.

Fourth, run educated queries to reveal patterns and test hypotheses. After
processing, the data becomes actionable, but there may be too much
information to effectively validate hypotheses. At this stage,
visualization tools can help to organize the data and bring meaning to the
surface.

For instance, a researcher may query one host attribute, such as its domain
name, but receive multiple scored features outputted by each classifier.
Each score or score combination can be categorized as malicious, suspicious
or benign and then fed back into the machine-learning system to improve
threat predictions.

When a host is categorized as "suspicious," there is a possibility of a
false positive, which could result in employee downtime for customers of
Internet security vendors. Therefore, continuous training and retraining of
the machine-learning system is required to positively determine whether a
host is malicious or benign.

Host Validation

The process of determining whether suspicious hosts are malicious or benign
can be cost- and resource-prohibitive. To validate threats across the
entire Internet would require an army of analysts. The good news is that
there are thousands of potential analysts in the security community,
including security-savvy customers. The bad news is that security vendors
typically keep their threat intelligence to themselves and guard it as core
intellectual property.

A different approach is to move from unidirectional relationships with
customers to multidirectional communication and communities. Crowdsourcing
threat intelligence requires an extension of trust to customers, partners
and other members of a security vendor's ecosystem, so the vendor must
provide dedicated support to train and certify the crowdsourced researchers.

However, the upside potential is significant. Given an anointed team of
researchers across the globe, the reach and visibility into real-time
threats will expand, along with the ability to quickly and accurately
respond, minute by minute, day by day, to evolving threats.

As for tactical requirements, the community needs access to query tools
similar to those used by the vendor's own expert researchers. The simpler
interface would display threat predictions with all the relevant security
information, related meta-scores and data visualizations, and allow the
volunteer to confirm or reject a host as malicious.

Applying Threat Intelligence

Threat intelligence derived from Big Data can prevent device infections,
network breaches and data loss. As advanced threats continue to proliferate
at an uncontrollable rate, it becomes vital that the security industry
evolve to stay one step ahead of criminals.

The marriage of Big Data analytics, science and crowdsourcing is making it
possible to achieve near real-time detection and even prediction of
attacks. Big Data will continue to transform Internet security, and it's up
to vendors to build products that effectively harness its power.
_______________________________________________
Dataloss Mailing List (dataloss () datalossdb org)
Archived at http://seclists.org/dataloss/
Unsubscribe at http://lists.osvdb.org/mailman/listinfo/dataloss
For inquiries regarding use or licensing of data, e-mail
        sales () riskbasedsecurity com 

Supporters:

# OWASP http://www.appsecusa.org
# Builders, Breakers and Defenders
# Time Square, NYC 20-21 Nov
o()xxxx[{::::::::::::::::::::::::::::::::::::::::>

Risk Based Security (http://www.riskbasedsecurity.com/)
Risk Based Security offers security intelligence, risk management services and customized security solutions. The 
YourCISO portal gives decision makers access to tools for evaluating their security posture and prioritizing risk 
mitigation strategies. Cyber Risk Analytics offers actionable threat information and breach analysis.

Current thread: