- PyPy's powerful just-in-time compiler, which can optimize the hot loops of your program into efficient native code.
By translating the PyPy interpreter into asm.js code, and by having its JIT backend emit specialized asm.js code at runtime, it should theoretically be possible to have an in-browser Python implementation whose hot loops perform within a factor of two of native code.
I'm excited to report a small but important milestone on the road to making this a reality.
It's certainly not a full Python interpreter, and it comes with many caveats and question-marks and todos, but I have been able to produce a simple demo interpreter, with JIT, that approaches the theoretical factor-of-two comparison to native code under some circumstances. There's a long way to go, but this seems like a very promising start.
TL;DR? Feel free to jump straight to the important graph.
The inspiration for this project was the article A JIT for Regular Expression Matching by PyPy developer Carl Friedrich Bolz. It demonstrates how to use the PyPy interpreter-building toolchain to generate a simple regular-expression matcher, and how to plug in PyPy's JIT compiler to give it a free speed boost.
My test code, adapted from that article, is available here: rematcher.py.
Essentially this file is a small "interpreter" for regular expressions, written using the same RPython toolchain as the full PyPy interpreter. I will highlight a couple of details below, but please see the linked article for a full explanation, it is well worth the read.
The program's internal state is a tree of objects that represent a particular regular expression, and it matches input strings by feeding them into this tree one character at a time. The main loop of the "interpreter" looks like this:
def match(re, s): """Match input string 's' against the given Regex object.""" # Empty strings only match empty regexes. if not s: return re.empty # Feed each character into the regex. result = re.shift(s, 1) i = 1 while i < len(s): result = re.shift(s[i], 0) i += 1 # Reset the regex to be used on the next string. re.reset() return result
$> rpython --backend=js --opt=2 ./rematcher.py [...lots and lots of compiler output...] $> $> js ./rematcher-js warning: successfully compiled asm.js code (total compilation time 203ms) Generating 1000 strings of length 50 ... Matching all strings against the regex... Done! Matching time for 1000 strings: 0.083000 Performed 12048.200800 matches per second. $>
The key output here is "matches per second", which gives the mean time taken to match a string against the regex. Higher numbers are better, and we'll be looking at this metric a lot.
To speed things up, I implemented a very basic asm.js JIT backend for RPython. The code is on my github fork of PyPy and is far from complete – it implements just enough of the RPython JIT functionality to successfully run this example.
From the point-of-view of our regex interpreter, we need only add a couple of hints to enable the JIT compiler. Specifically, we have to mark the entry-point of its main loop, like so:
jitdriver = jit.JitDriver(reds="auto", greens=["re"]) def match(re, s): if not s: return re.empty result = re.shift(s, 1) i = 1 while i < len(s): jitdriver.jit_merge_point(re=re) # <-- like this result = re.shift(s[i], 0) i += 1 re.reset() return result
This annotation does not affect the semantics of the code, but it tells RPython to generate all the extra hooks and control-flow to enable just-in-time compilation over this loop. Again, please see the original article for all the details. When compiled with this JIT backend enabled, the following will happen:
- The matcher will start running normally, but the JIT machinery will keep track of how many times the marked loop is executed.
- Once the execution count passes a certain threshold, the matcher will run for a single iteration in "tracing mode" to record each low-level operation that it performs.
- The JIT machinery analyzes and optimizes the trace, and passes it into my new asm.js JIT backend.
- The JIT backend renders the trace into a string of asm.js source code and invokes "new Function()" to compile it.
- Subsequent iterations of the loop execute the new specialized function rather than the generic matcher code.
Here's the result:
$> rpython --backend=js --opt=jit ./rematcher.py [...even MORE compiler output...] $> $> js rematcher-js warning: successfully compiled asm.js code (total compilation time 1450ms) Generating 1000 strings of length 50 ... Matching all strings against the regex... rematcher-jit.js:1827:0 warning: successfully compiled asm.js code (total compilation time 2ms) rematcher-jit.js:1827:0 warning: successfully compiled asm.js code (total compilation time 0ms) Done! Matching time for 1000 strings: 0.078000 Performed 14705.865441 matches per second. $>
Notice two additional lines about "successfully compiled asm.js code" when compared to the non-JIT version. This is the JIT machinery generating and compiling additional asm.js code at run-time.
The JIT-compilation does indeed result in a speedup for this test, but frankly it's pretty modest.
$> rpython --backend=c --opt=2 ./rematcher.py [...yet more compiler output...] $> $> ./rematcher-c Generating 1000 strings of length 50 ... Matching all strings against the regex... Done! Matching time for 1000 strings: 0.036899 Performed 27100.939483 matches per second. $>
And the JIT-enabled version:
$> rpython --backend=c --opt=jit ./rematcher.py [...yet more compiler output...] $> $> ./rematcher-c Generating 1000 strings of length 50 ... Matching all strings against the regex... Done! Matching time for 1000 strings: 0.008123 Performed 123108.423833 matches per second. $>
JIT compilation can be a delicate tradeoff in practice. The code spends more time up-front in order to monitor, trace, and compile frequently-executed loops, but we expect that this time will be more than compensated for by the increased performance of the compiled code. How much of the performance difference here is due to JIT compilation overhead, and how much to the performance of the generated code itself?
One simple way to determine this is to just run the program for longer. The performance cost of the JIT compilation is fixed, and doing more iterations of the loop lets you extract proportionally more benefit from that cost. Here is how the matches/second numbers look for a range of different running times:
As expected, the performance of the non-JIT versions is essentially constant regardless of the number of iterations. Both JIT-enabled versions start off performing worse than their respective counterparts, since they don't run for long enough to pay back the costs of compilation. But the matches/second figure steadily improves as the cost of compilation is amortized over more and more iterations.
(By the way, the code for running and graphing these benchmarks is here: jitbench.py)
There's a second knob we can twist on this particular test: the length of each individual string that is fed into the matcher. When the input strings are longer, the program will spend proportionally more of its time inside the main loop, rather than in the test harness or other supporting code.
Increasing the string length from 50 to 100 gives the following:
If we go all-out and try it on strings of length 1000, we get the following very exciting graph:
- The JIT-compiled asm.js code is indeed within a factor of two of the performance of native code.
It's a promising start. Compilation overheads I can work on and chip away at over time; raw performance of the generated code I can't do much about.
This really highlights the difference made by SpiderMonkey paying special attention to asm.js code. The node-js version still sees a performance improvement, but it is much less and grows at a much slower rate. It would be interesting to compare the latest version of V8 here as well; I simply haven't got around to that yet.
It's really pretty neat to consider the kind of nested interpretation and optimization that is taking place here:
- The hosted interpreter is in turn running its own "program" in the form of a regular expression, and is tracing and monitoring its execution.
- The host interpreter in turn will recognize this as hot code and JIT-compile it down to optimized machine code.
You'll pardon me if I can't resist a bit of a meme:
So where to from here?
There's certainly a long road still to travel before these preliminary results can apply to a full python interpreter. In roughly the order I plan to work on them, the next steps for PyPy.js will be:
- Implementing support for more JIT opcodes. This includes a lot of trickier things like garbage-collection, guard invalidation, and calling back into functions in the interpreter.
- Improving the performance of non-JIT code. I suspect there's low-hanging fruit here, especially with regards to memory-management.
- Making it run in the browser. Hopefully I can lean on the awesome work done by projects like Skulpt and Brython in producing a friendly and familiar interpreter interface.
blog comments powered by Disqus