Thursday, December 15, 2011

Box2D as a Measure of Runtime Performance

For those unfamiliar with it, Box2D is a great 2D physics library written by Erin Catto, which is at the core of a large number of casual games on consoles and mobile devices. Angry Birds is one you might have heard of, but there are many, many others.

It's also not a simple library by any means. When porting Angry Birds to HTML5, we found that in some cases Box2D performance could be the limiting factor in the game's frame-rate (on the more complex levels). It turns out this little library is doing a lot of work under the hood. And the work it's doing isn't limited to any one tight loop or hotspot. Rather, its work is distributed all over the place -- matrix and vector math, creation of lots of small objects, and general object-oriented logic distributed over a complex code base.

Motivation

Box2d makes a great general benchmark -- it's a bottleneck on real-world apps, and there's no one thing that a VM or compiler can optimize to "fix it". It also has the nice property that it's been ported to lots of different platforms -- the original is written in C++, and it's been ported to ActionScript, Java, Javascript, and several other systems. So I took it upon myself to put together a little benchmark that measures the time it takes to simulate one frame of a reasonably complex Box2D world.

The goal of this little experiment is not to add fuel to the flames of the Internet's already-tiresome "Compiler and VM Wars" -- rather, my intention is to get some hard data on what behavior real-world performance-sensitive code can actually expect to see in practice on various platforms. Measuring the performance of virtual machines in isolation is particularly tricky, but this benchmark has the nice property that, if a VM or compiler improves it, then real-world problems are actually solved in the wild, and everyone wins.

The Platforms

My intention is to measure the performance of various platforms, not particular Box2D ports. The ports themselves necessarily vary somewhat from one-another, but they should still be broadly equivalent. The reason Javascript VMs are represented four times in this list is that I wanted to make sure that we compared Javascript VMs, at their best, to the JVM, NaCl, and native code.

  • Native : This is the standard Box2D code, compiled via gcc or clang/llvm (the latter on my test machine, as described below).
  • NaCl : The same code, compiled via the NaCl SDK's custom gcc build, and running within Chrome.
  • Java : The JRE (1.6), as currently shipped by Apple on Mac OS 10.6.
  • Box2D-web : The hand-written Javascript Box2D port, on various browsers.
  • Emscripten : The original C++ code, compiled via Emscripten to Javascript.
  • Mandreel : The original C++ code, compiled via Mandreel to Javascript.
  • GwtBox2D : The Java port, compiled via GWT to Javascript.

The Test

World

Picking the right world structure for this kind of benchmark is a bit tricky, because it needs to have the following properties:

  • A high per-frame running time.
  • Not settle quickly: simulations that settle eventually stop doing work, as the physics engine skips calculations for settled objects.
  • Stable: Subtle variations in the behavior of floating-point math can cause the behavior on different VMs to diverge badly, invalidating comparisons.

I eventually settled on a simple pyramid of boxes with 40 boxes at the base, for a total of about 800 boxes. I manually verified that this simulation is stable on all the systems tested, and that it doesn't settle within the number of frames simulated for each test. It takes at least 3-4ms to simulate in native code, and only gets slower from there, so the running time is sufficient to avoid problems with timer resolution.

Code

All the test code is available on GitHub. Some of it requires a bit of care and feeding to get running (especially the C++ version), but should allow you to reproduce these results if so desired. There are also copies of the Mandreel and Emscripten Javascript output checked in, so you won't have to go through the pain of building those yourself.

Test System

I used my MacBook Pro 2.53 GHz Intel Core i5 as a test machine. It seems a fairly middle-of-the road machine (maybe a bit on the fast side for a laptop). As always, your mileage may vary.

Data

The raw data I collected is in the following spreadsheet. I ran each test several times, to mitigate spikes and hiccups that might be caused by other activity, and to give each system a fair chance. Each run warms up over 64 frames, and then runs for 256 frames -- I've confirmed that the simulation is stable over this duration on all the platforms under test.

First, let's look at all the results together, on a log-scale graph:

The first thing you'll notice is that there are three clusters of results, centered roughly around 5ms, 10ms, and 100ms. These clusters are associated with native code, the JVM, and Javascript VMs.

Now let's compare the best of the best of each of these three groups.

This is similar to the first graph, except that we've removed the noise of NaCl (which is within 20-30% of raw native performance) and all but the best of the various Javascript approaches. This looks a bit better for Javascript, clocking in at a mean of around 50ms (interestingly, this number is achieved by the Mandreel C++ cross-compiler running on Chrome 17; more on this below).

Analysis

Looking at the last graph above, I believe the most important observation is that, while Javascript implementations have improved drastically in recent years, the best result achieved by any Javascript implementation is still more than an order-of-magnitude slower than native code. This is perhaps an unsurprising result to many, but some have begun suggesting that modern Javascript VMs are "nearing native performance". While this may be true for some workloads, they still have a long way to go in this case.

This also demonstrates that native code compiled via NaCl stays within 20-30% of the performance of un-sandboxed native code, which is in line with what I've been told to expect.

Finally, it's somewhat interesting to see that the JVM is running 3x slower than native code. I've been told, anecdotally, that OpenJDK on Mac OS X is producing numbers closer to 6ms/frame, which would be ~50-60% slower than native, but I need to confirm this. I don't think it can be proven, but I suspect the JVM's performance can be taken as a rough lower-bound on what one can expect from any dynamic VM. Thus, Javascript VM implementors can take the JVM's performance as a reasonable target to shoot for.

Appendix: Javascript Implementations and Compilers

While it wasn't a primary goal of these benchmarks, they do give some interesting data about Javascript VMs, and the various compilers that use Javascript as a target language.

The following is a graph of the Javascript VMs in Chrome, Safari, Firefox, and Opera (IE9's Chakra VM is not yet included because it seems that Box2D-Web is using Javascript properties, which it doesn't support yet).

First off, all the VMs tested are well within 3x of each other, which is wonderful, because wildly variant performance across browsers would make it exceedingly difficult to depend upon them for any heavy lifting. V8 and JSCore are quite close to one-another, but JSCore has an edge in variance. It's not immediately obvious what's causing this, but GC pauses are a likely culprit given the regularly-periodic spikes we see in this graph.

Now we move to the various Javascript compilers. Here we have Box2D-Web (representing "raw" Javascript), Emscripten, Mandreel, and GWT (Closure Compiler should give performance roughly in line with "raw" Javacript on any modern Javascript VM).

The real shocker here is that Mandreel outperforms all the others fairly consistently, given that it's translating C++ to Javascript (!). Note that the Emscripten results are not as good as they could be -- the author is currently finishing up a new optimizer. I also believe the GWT compiler should be doing a better job than it is; there are a couple likely culprits in the compiled output involving static initializers and array initialization overhead. It should be possible to optimize these out and improve the results.

Note: See the update below about Emscripten performance

Caveats

As with all benchmarks, especially ones as fuzzy as this, there are a lot of caveats.

The code's not identical

These are all ports of the same C++ source code, so by their nature they must vary from one another. It may be the case that a particular port is unfairly negatively biased, because of particular idioms used in the code. If you suspect this is the case, please say so and preferably offer a patch to the maintainer. These aren't being used in the same way as, e.g., the V8 or Kraken benchmarks, so it's entirely fair to optimize the code to get better numbers.

I may have made mistakes

I suppose this goes without saying, but there are a lot of knobs to be tweaked here, and there could easily be something sub-optimal in my configuration, makefiles, or what-have you. If you notice anything amiss, please say so and I'll try to address it.

This is just one machine

As described above, I ran these tests on my MacBook Pro. The relative performance of these tests might not be the same on different machines.

Update

Based on feedback from comments and elsewhere, I've updated a few of the numbers in the linked spreadsheet, along with the graphs.

  • Emscripten: I got an updated build with a better compiler, and the numbers are now much more in line with the other Javascript implementations.
  • Java: I updated the JVM implementation to use a faster sin()/cos() implementation (there was a flag for this in the JBox2D code that I simply missed. It's no about 2.5x the speed of the native implementation. I also wasn't entirely clear about what I meant by the "JVM" -- so to be clear, this means the standard Mac OS X JDK 1.6 server JVM (there is no standard client JVM on the Mac).
None of this changes the core conclusions in any substantive way, however.

36 comments:

Dave Wolfe said...

Any chance we'll see Flash and Silverlight added to the mix? I don't know if the C# port of box2d works with Silverlight, but there are a couple box2d ports for Flash.

Joel Webber said...

Dave: I'd like to add both, but haven't gotten around to it yet (and I don't have the Flash tools installed at the moment). There are several Flash Box2D ports out there, and Mandreel can also generate Flash output, which would be an interesting data point as well.

.NET/Silverlight would be worthwhile as well -- I suspect they'll be in the same ballpark as the JVM, though you never know until you try.

In either case, if you (or anyone else) wants to take a stab at it, please feel free to do so and send a pull request to the benchmark repo. The more data, the merrier!

azakai said...

Thanks for mentioning that there is a bug in Emscripten that slows it down right now, I'm working on fixing it.

To be honest I am surprised it is as fast as GWT with that bug present...

- azakai

René said...

Very interesting.

I would be nice to see .NET in there as well. A suggestion would be - Farseer Physics Engine 3.0 which is built on Box2D 2.1.2 (Acutally, the Box2D.XNA port of Box2D).

http://farseerphysics.codeplex.com/

David Masover said...

Out of curiosity, which JVM? You mention "JVM" and "JRE", but not a version number and, perhaps more importantly, not whether you used the server or "hotspot" VM. I'm especially curious whether there's a meaningful difference between running with "-server" and without on the latest JVM.

John said...

I would also be curious about the code size of the various ports

Toby Reyelts said...

FYI Joel,

I ran hprof on the Java version and it shows 40% of the runtime in Math.sin/cos. These are notoriously "slow" on the JVM, because they care about accuracy compared to the implementations you get in your C runtime. (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4857011).

For an apples to oranges comparison, you'd probably want to have jbox2d use approximations for these functions.

If I had to hazard a guess, it would be that the JS ports probably have a similarly loose implementation of the transcendental functions as the C runtime.

Unknown said...

If you made a test case for Flash, you could use hand-written port of Box2D (http://www.box2dflash.org/) or C++ version compiled with Alchemy (https://github.com/jesses/wck/wiki/box2d-flash-alchemy-port).

Markus said...

In relation to graph 3 you write that JSCore has an edge in variance maybe because of GC pauses given the regularly-periodic spikes.

In graph 3 Chrome 17 has the huge spikes, not Safari?

Do I get something wrong?

Joel Webber said...

@azakai: I commented on the emscripten bug as well -- looks like it's a good bit better now. I'll update the numbers soon.

Joel Webber said...

@René: I'd like to do that as well, but am not currently setup to do so (and need to get back to my "day job"!). As noted above, if anyone has the time to deal with .NET or Flash, please feel free to do so and send a pull request to the github repo.

Joel Webber said...

@David Masover: Sorry, wasn't clear about this -- it's the standard JRE 1.6 on my MacBook Pro, which means that it's always --server (there's no client VM on the Mac, and I've not mucked about with OpenJDK).

So it is running Hotspot, and I gave it 64 frames of warmup time, which seems to be enough to make Hotspot happy (I gave all the Javascript VMs the same warmup time; the native/NaCl variants don't really need to warm up at all).

Joel Webber said...

@Toby: Damn, somehow I had no idea that this was the case on the JVM! That sucks very much badly; I'll find some reasonable fast-transcendental implementations and update the numbers in a followup edit.

-1 to pedantry. If I wanted slow arithmetic I would have asked for it :(

Joel Webber said...

@Unknown: Note as above -- if you or anyone else has time to setup a Flash version of the test, that would be greatly appreciated.

w.r.t. Alchemy -- Mandreel actually has a Flash target as well. I haven't set it up yet, but given its performance on V8, I'm guessing it would show the AS3 VM in the best light. I'll see if I can find time to add it to the mix.

Joel Webber said...

@Markus: That's actually what I meant -- JSCore (Safari) has a lower variance, *possibly* because those spikes in the V8 numbers are GC pauses. But I haven't yet verified whether they're actually GC pauses, or something else entirely.

gsnedders said...

What version of Opera are you testing there? 11.60? There were some fairly major changes made between 11.50 and 11.60.

Joel Webber said...

@gsnedders: 11.60. Just updated and re-ran the tests to be sure. Results are pretty consistent.

Osvaldo Doederlein said...

"I suspect the JVM's performance can be taken as a rough lower-bound on what one can expect from any dynamic VM. Thus, Javascript VM implementors can take the JVM's performance as a reasonable target to shoot for."

The limitations of Java performance come not much from its dynamic (JIT) aspect, but from other things like simple typesystem (.NET's CLR is much better here and still managed), memory model (safe, GC'd, no control over layout / locality etc.), limited use of some HW facilities from CPU's vector units to GPU's programmable shaders, and portability (btw this is the reason for the expensive transcendental functions: the spec demands well-defined results independent from OS/CPU/FPU).

OTOH, Java has some intrinsic efficiency aspects that JS will hardly ever match, enabling most classic compiler optimizations and them some. IMHO Java will always be a very distant target for JS, almost as much as native code. Hoping though that Dart can fix at least a big part of this gap.

Isaac Gouy said...

@Joel Webber -Damn, somehow I had no idea that this was the case on the JVM!

This came up a few years ago in the benchmarks game, see the FastMath workaround in this Java partial-sums program.

Joel Webber said...

@Osvaldo: Yes -- i was using "dynamic VM" as rough shorthand for "a safe virtual machine with garbage collection and all that good stuff". I doubt it will ever be possible for Javascript VMs to match JVM or CLR performance, but it's still a useful target, even if it's an asymptote.

I brought this up primarily because I think it would be simply unrealistic to expect Javascript VMs to ever reach native performance (except perhaps on very narrow synthetic benchmarks). JVM/CLR performance is a bit more reasonable to expect.

Joel Webber said...

@Isaac: Thanks for the link. I just noticed that JBox2D has a flag to turn on LUT-based transcendentals. Turning this on yields better performance -- just under 2.5x native.

I remain stunned that there's no built-in fast-sin implementation in Java.

azakai said...

@Osvaldo:

> OTOH, Java has some intrinsic efficiency aspects that JS will hardly ever match, enabling most classic compiler optimizations and them some

What in your opinion is a classic compiler optimization that *cannot* be done in an SSA-optimizing JS JIT like CrankShaft?

Osvaldo Doederlein said...

@azakai: In theory, none; but in theory Java can also be as fast as 100%, which often happens but not always. Compiler optimization is not magic, for one thing most of the time it can reduce overheads but not completely remove them. Method dispatch is a good example, you can devirtualize and even inline but you may need to insert some guard code to bail out when optimistic assumptions are violated. Now in a language like Java, the JIT can often avoid that guard code, or use classloading for that, because the static-typed typesystem and the rigid class structure are critically helpful to the compiler. But JS is wildly more difficult, anything can change anytime, so optimized code will always be shock-full of guard code, and general bloat like fast- plus slow-paths, deopt metadata, instrumentation etc. And the very best optimization is only achievable when optimizing huge basic blocks built by very aggressive inlining; which is great for benchmarks that spend 90% of the time in a few "hot spots", but doesn't map nicely to a next-gen app with a MLOC of JS code and a mostly flat profile.

kripken said...

@Osvaldo

> But JS is wildly more difficult, anything can change anytime, so optimized code will always be shock-full of guard code, and general bloat like fast- plus slow-paths, deopt metadata, instrumentation etc.

Not necessarily, analysis methods like SpiderMonkey TI can detect what things cannot change and do not need guards at all.

In theory anything can change, but it usually won't, especially in high-performance code, and that behavior is detectable both statically and dynamically. Again, this isn't just theoretical, it's going into a stable browser next week (FF9).

- azakai

jpvincent said...

Is there a way you provide the benchmark as a standalone page ? I would like to compare JS engines on windows 8

Joel Webber said...

@jpvincent: I don't have much spare time to do so, but it shouldn't be hard to just check out the repository and stick it in a local HTTP server. The Mandreel and Emscripten compiled output is included in the repo, and you can run the Javascript version from source.

Do be aware that they all seem to have problems on IE9/10. Box2d-web appears to use some of the newer get/set Javascript syntax, and Mandreel/Emscripten really need TypedArrays to perform reasonably.

Jeff said...

Hey Joel,

I was doing some searching today and found

http://code.google.com/p/dartbox2d/

I haven't tried it yet, but was wondering if you were interested in seeing how dart vm preformed with this aswell.

Joel Webber said...

Jeff: I'm pretty sure the author is about to do just that, but feel free to take a crack at it. I suspect it will still be behind V8 by a pretty wide margin at this point (still very early days for the optimizing Dart VM). But the existence of the benchmark should give the team something to work from in their optimization efforts. I hear they really like to make things fast :)

Luis said...

Would love to see how azakai's new one compares to the others.

http://mozakai.blogspot.com/2012/02/box2djs-box2d-on-web-is-getting-faster.html

Nathan Hammond said...

Luis, Alon included a JS version of the bench2d test code against the JS bindings. Take a look at: https://github.com/kripken/box2d.js/blob/master/test.js

It is quite simple to run that test, but it isn't quite an even-handed comparison to Joel's existing Emscripten version–those numbers measured the performance of a transpiled LLVM BC/C++ benchmark.

However, I feel like this is a much more interesting approach: it is effectively generating a dynamically-linkable JS library.

I'm very curious as to the performance impact of using JS to interface with the lib as opposed to transpiled JS linking.

(However, I really don't feel like massaging all of the tests to run on my machine so that I can have consistent values. So Joel, can you simply run test.js on the same machine you've run all of the others on? Pretty please?)

Joel Webber said...

@Nathan: I just merged your pull request, and am getting something on the order of 90ms/f on the same MacBook Pro I used for the other benchmarks. I've asked Alon to have a look to make sure I'm not screwing anything up.

Nathan Hammond said...
This comment has been removed by the author.
Nathan Hammond said...

Here is the comparison I was talking about in my last comment:

JS Linked vs. C++ Compiled

The JS linked version is the one running a version of the benchmark written in JavaScript dynamically linked to the emscripten library output.

The C++ compiled version runs a C++ version of the benchmark which was compiled along with the box2d library by emscripten.

On my machine the JavaScript version is either faster on the whole or exhibits a better garbage collection/loop tracing profile. Also worth noting is how different the performance is between V8 and SpiderMonkey.

nornagon said...

I wonder if ChipmunkJS would go faster -- it's a hand port, with more JS-specific optimisations than box2dweb.

jam said...

I'm coming to this late, but something for future blog posts...

It would have been a lot easier to read the graphs if you had picked a color for a given implementation, and then kept that color between the various graphs.

Eg, you used Blue for the C code in graphs 1 and 2, but then re-used Blue in the third graph for Box2d-web (safari nightly). At first glance, I thought the second graph was just a zoom-in on the bottom 3 lines, not a zoom-in on the bottom 4 lines with one omitted.

I'm guessing these were just the default colors for graphs, but it would have made it easier to read.

Joel Webber said...

@nornagon: I wouldn't mind trying Chipmunk, but I'm simply more familiar with box2d. Also, there appears to be no Java port of Chipmunk -- not a showstopper, but I wanted to get a clear picture of performance across a range of [V]Ms. I'm hoping that we'll see a better Javascript Box2D port in the near future, which we can use to get better numbers.

@jam: I know the graphs aren't beautiful. I was in a hurry, and used whatever came out of Google Spreadsheets by default. Maybe next time around I'll choose colors more explicitly.