WebApp Sec mailing list archives

Re: Combatting automated download of dynamic websites?


From: Javier Fernandez-Sanguino <jfernandez () germinus com>
Date: Tue, 30 Aug 2005 13:24:53 +0200

Matthijs R. Koot wrote:

Thanks for your reply zeno! But actually, referer-based anti leeching
won't do it for me and mod_throttle isn't suitable for Apache 2. I'm in
need of a throttling function based on something more advanced like a
'request history stack' to check the order in which pages were
requested, probably within a certain time period, et cetera. Maybe it'd
be better to move such security measures into the actual web application
itself, but I'm still hoping someone knows of a service-based solution
(i.e. like the beforementioned Apache module).

Several web-oriented proxy firewalls implement a "request history stack" like you mentioned to prevent IP address from going directly against a given resource without following the "flow" established by the webapp programmer.

You could implement this by way of session handling, tying session identifiers to the client (through IP or user-agent) and then checking, application-wise, if the session is being handled as you would normally expect. Don't use referer information, stick session information to some kind of finite state machine that tells you if the user went through your defined procedure. In your Amazon example: first look at the book, then at the book details and then allow him to browse contents.

Of course, a user can try to reuse his session ID and spoof the identifiers (User-Agent) in alternative download technologies to be able to retrieve the content in the end. But it might raise the bar somewhat. I'm not aware of the capabilities of Teleport Pro or other software but I would defeat those checks by implementing a targeted web crawler with Perl's LWP::UserAgent.

If you want to stop even a determined (malicious?) user from retrieving the content then you will want to impose resource limits as suggested in the thread. Problem is, you can only tie that to the IP address (all other browser presented information is spoofable) but then you have the issue that some IP address (dynamic ranges from ISPs) only have one "client" behind while others (ISP's transparent proxies, companies proxies) might have more than one "client" behind. So either you monitor that, investigate deviations and tailor it for those IP address that might be more resource intensive or you might be blocking legitimate users from accessing the content in the second situation (i.e. proxies being used by a large number of users).

My 2c.

Javier


Current thread: