oss-sec mailing list archives

Re: Offset2lib: bypassing full ASLR on 64bit Linux


From: Loganaden Velvindron <loganaden () gmail com>
Date: Wed, 10 Dec 2014 09:03:20 +0400

On Wed, Dec 10, 2014 at 12:24 AM, Daniel Micay <danielmicay () gmail com> wrote:
On 09/12/14 11:18 AM, Steve Grubb wrote:

I studied this area 2 years ago for a gray hat talk and in preparation to help
set the policy going forward for Fedora and RHEL. The general reason I've
heard mentioned about why its not used as fully as possible is that it adds
memory pages that can't be coalesced or consolidated because they are not the
same.

AFAIK, it doesn't cause a significant increase in memory usage. The
whole point of position independent code is that it can be reused across
processes. Dynamic libraries are already fully position independent.

Windows uses a relocation-only model rather than PIC, so their full ASLR
has zero runtime cost after start-up. The downside is that use an awful
hack to work around the reuse issue. It caches the random base and
reuses it for all instances of the library / executable.

For Fedora and RHEL 7, the intended policy is PIE for all daemons, privileged
apps, network facing, and parsers of untrusted media. Enforcement is not so
easy. How do you identify in an automated way parsers of untrusted media? I
have a script that can grade an installed system that uses rpms:

http://people.redhat.com/sgrubb/files/rpm-chksec

It has options to grade the system or an individual rpm for compliance with
the intended policy.

Why not use it across the board on x86_64? The cost is in the range of
0-5% at the moment (usually ~1%) and will be reduced to nearly 0% in
every case when the next GCC / binutils versions are released since it
won't add indirection to global accesses.

During my research, I found a couple interesting things. These ASLR related
tidbits below are pulled from the speech I gave about when open source
security mechanisms don't work as intended (all measured on a 64 bit system):

1) On non-PIE applications, the heap doesn't get much randomization. Just 14
bits.

It seems that the dss section (sbrk) isn't randomized at all on a
non-PaX kernel.

    #include <stdio.h>
    #include <stdlib.h>

    int main() {
        printf("%p\n", sbrk(0));
        return 0;
    }

2)  Also non-PIE applications seems to be some bias in the numbers chosen.
This could be an effect of 14 bits of randomization. It did follow a bell curve
such that guessing some addresses was much luckier than others. (I did not get
a sample size large enough on PIE apps to see if the same bias could be
measured.)

3) When using PIE, you pretty much got 29 bits of randomness everywhere. That
lead to the question of why the heap on non-PIE is so limited in address
scope. As I remember, there was some coupling with sbrk() that caused
this....which might need revisiting.

Linux puts the dss section near the start of the address space because
it grows up and the mmap base is chosen near the end of the address
space. In an ideal world, dss wouldn't have existed at all. PaX has
*much* higher entropy for mmap and I think it occasionally inserts a gap
rather than just using a random base.

4) Then I started wondering about the heap when you use other memory manager
libraries such as jemalloc. This turned out to be interesting. You get about
19 bits of randomness using it. Its not as bad as non-PIE glibc but not as
good as PIE glibc. You also got the same amount of randomness whether the app
was PIE or not. This is an area ripe for more experimenting, exploiting, and
patching. Supposedly some of these heap managers use mmap as the underlying
allocator. So, why aren't they getting 29 bits, too? :-)

In jemalloc, *all* memory is allocated via chunks and in a default build
it never unmaps them. It does virtual memory management via a global
red-black tree tracking extents of chunks. The default is 4M chunks
aligned to 4M boundaries. It can obtain chunks from sbrk until failure
and then use mmap, but the default is the opposite.

It can distinguish a huge allocation (>=4M) from small/large ones
managed inside a chunk with a header by checking for 4M alignment and
can then find the metadata for the small/large allocs via an offset -
which is how it has ~2% metadata overhead even for small sizes.

Reducing the chunk size wouldn't hurt much for small/large allocations,
but it will wipe out large size classes and turn them into huge allocs
where global locking is required.

You've made me realize that jemalloc has a potential security issue
here. It uses getenv without an issetugid check for the MALLOC_CONF env
variable so an attacker could control some aspects of the allocator for
a setuid / setcap binary. I'll send a pull request fixing this.

OpenBSD malloc does ASLR at an allocator level, which is quite neat. It
uses a similar chunk allocation model but they're always page size and
the number of cached chunks is limited.


OpenBSD's malloc is awesome. The code is very clean, and easy to
understand. I hope that other Operating Systems would consider
borrowing from it !




-- 
This message is strictly personal and the opinions expressed do not
represent those of my employers, either past or present.


Current thread: