Feb. 5, 2010

A GIL Adventure (with a happy ending)

I just halved the running time of one of my test suites.

The tests in question are multi-threaded, and while they perform a lot of IO they still push the CPU pretty hard. For some time now, nose has been reporting a happy little message along these lines:

Ran 35 tests in 24.893s

I wouldn't have though anything of it, but every so often this number would drop dramatically – often down to as little as 15 seconds. After a lot of puzzling, I realised that the tests would run faster whenever I had another test suite running at the same time. Making my computer work harder made these tests run almost twice as fast!

Could it be? Yes, I was finally seeing a manifestation of Python's dreaded Global Interpreter Lock - a.k.a. the "GIL of Doom". Because I'm running on a dual core system, the different threads in this test suite were spreading themselves over both processors and engaging in an epic GIL Battle that bogged down the whole process.

The typical response to this awful multi-core behaviour is "just use multiprocessing". That's not an option here, not least because these tests are supposed to be checking the thread safety of my code!

Continue reading...

Sept. 9, 2009
[Python]

Mimetypes and Threading don't mix

I've just spent weeks (yes, weeks) battling a bug that turns out to have been caused by everyone's favourite broken stdlib module, mimetypes. I'm far from the first to be bitten by this module's strangeness – Jacob Rus has compiled a long list of reasons why the mimetypes module is pathologically broken, while Armin Ronacher recently got a 1000% speedup just by changing the way he imported things from the module (yes, 1000%).

So consider this another little heads-up about the mimetypes module: it doesn't play nice with threads.. If two threads call mimetypes.guess_type at the same time, and the module happens to need to initialise its internal database, then one of the threads will go into an infinite recursive loop and blow your stack. What fun!

To be fair, the mimetypes module is slowly being converted into a healthy state, and this particular bug will be fixed in the next release. But in the meantime, if you need to do mimetype guesswork in Python, make sure you do it very carefully.

Continue reading...

Aug. 16, 2009
[Django]

More Django Paranoia

As Ryan pointed out in response to my previous post on django-paranoid-sessions, the only way to truly prevent sniffing or man-in-the-middle attacks is to operate over a secure connection. Fair enough, but HTTPS ain't free. The general consensus seems to be that a secure connection is too much overhead for anything but the high-value or high-risk sections of your website (login submissions, payment processing, nuclear launch codes, etc).

Ideally, it should be possible to place selected sections of your website behind a secure connection and gain added attack-resistance for those sections, while still sharing session data with the rest of the site. Using a recommendation from the OWASP session management guide, the latest release of django-paranoid-sessions now lets you do exactly that.

The idea is to maintain a second randomly-generated session key that is only sent when the client connects over a secure channel. Unencrypted requests within your session are oblivious to the second key, but if a secure request doesn't provide both valid session keys then it is rejected. You can think of this extra key as a second "security enhanced" session that transparently piggybacks on top of the standard session data.

Continue reading...

Aug. 15, 2009

Announcing: django-paranoid-sessions

Like most web frameworks, Django provides a convenient mechanism for storing data across requests in a persistent "session" object. Like most web frameworks, Django implements sessions using a simple mapping from a "session key" to a session object stored on the server. And like most web frameworks, Django's default session implementation is trivially vulnerable to session hijacking attacks.

Django's session implementation is quite similar to that provided by PHP; for all the gory details here is an excellent article on The Truth about Sessions, but the simplified version is as follows. When you first visit a Django-powered site, the server generates a random "session key" and returns it to your browser in a cookie. Any data that the server wants to remember about you (say, whether you have logged in and under what username) is stored in a giant dictionary indexed by the session key. On each subsequent visit you browser sends the key back to the server, which looks up your data in this dictionary and proceeds merrily on its way. The interaction looks something like the following:

  • You login at the (hypothetical) Django-powered website http://www.my-todo-list.com/.
  • The server stores your login details in its session database, and sends back a session key of "123456".
  • You send a request to update your todo list, presenting a session key of "123456".
  • The server looks up "123456" in its session database, checks that the session is correctly logged in as you, and proceeds with the requested update.

It's a simple and convenient mechanism, but it has an important security issue: anyone who knows your session key can impersonate you to the server! Consider what happens next:

Continue reading...

Aug. 11, 2009

Making a Movable Django Project

Deploying Django projects is in general a straightforward affair, but it still suffers from a pain-point that's as old as web apps themselves: deploying at an arbitrary root URL. In my ideal world, I would push my shiny new Django project to the server, instruct Apache to mount it at "/my/shiny/app", and everything would just work – all URLs would magically have "/my/shiny/app" stripped off on their way into Django and prepended again on their way out. In the real world, Django comes pretty close to this ideal but stops just far enough short to be annoying.

First, here's what Django gets right: reverse(), permalink() and {% url %} are awesome. They introspect Django's runtime environment to translate an application-level name or object into a deployment-level URL. Your applications have no excuse for hard-coding URLs or even URL fragments. In theory, these two functions should be enough to make Django completely agnostic about its deployment location.

Now here's what Django gets wrong: some of its core components don't use them. Instead they use hard-coded URLs defined in the settings module, such as settings.ADMIN_MEDIA_PREFIX and settings.LOGIN_URL. Attempts to patch these components to avoid hard-coded URLs have been closed wontfix, so I guess we're stuck with them for a while.

Continue reading...