Housemarque

View Original

Taming the Sasquatch


Imagine it’s 1993, and you just bought a new shiny car: not just any car, but your dream car.

It has all the features one could wish for and a lot more, from the turbocharger and air conditioning to fuzzy dice and a DVD jukebox. 

Everyone on the street turns their head when they see it, and the salesman claims it is the top of the line in the automobile industry. 

Then, one day, you need to pop the hood for whatever reason - add more coolant, who knows - and what you see is your worst nightmare: a jumble of cables, pipes, wires, and chains. It is a mess and, after some digging, you realize that the air conditioning is actually in what you thought was the ski box, and the turbocharger is under the passenger seat: which explains why it tends to get rather hot, and asbestos is mixed in the seats’ upholstery. On top of that, if you remove the fuzzy dice, the car just won’t start, and nobody can tell you why; all they can tell you is that you should have never touched the fuzzy dice, and you just suddenly voided your warranty.

That car is comparable to the sight I usually get when I look at the rendering system of a commercial game engine: a giant hairy monstrosity that ordinary people are not supposed to touch. (I’m using the Hacker’s Dictionary definition of ‘hairy’ here: complicated, entangled, and fragile.)

The rendering system used in many Housemarque games was internally named Sasquatch because it needed a distinct name. Ironically, it remained clean and straightforward over the years, with reasonably small memory and performance footprint. 

True to the purpose it was built for, the engine was in constant development, with roots going back to the PS2 concept demo The Trader. At the same time, Super Stardust Delta, Resogun, Alienation, and Nex Machina used it in its final form.

The Core Concepts

To begin with, we wanted to figure out which is the smallest element on which rendered frames are built. The result was one of the defining features of Sasquatch: the draw call.

A second defining element came into play later, namely the compute dispatch, which defined the primary function of Sasquatch: to dispatch draw calls as efficiently as possible. 

However, draw calls are meaningless on their own: they require context (or state) and also resources - such as vertex buffers and textures-, and parameters - such as vertex count and primitive type -. The distinction is not always clear cut, though: for instance, a constant buffer used in a draw call is a resource, but its content can be considered a state.

Once we established a solid foundation, the rendering engine kept growing, focusing on being as data-driven as possible, using resources optimized for the platform, and requiring little load-time processing. The low-level concepts were universal and largely independent of the platform the engine was running on. We simply wrapped them into opaque data types specific for each platform, and restricted the low-level implementation into opaque functions in some platform-specific source files.

Data-Driven into Modularity

While the rendering system had plenty of C++-usable modularity already, the PS3 title Dead Nation broke the camel’s back of the render loop: the top-level Render() function of that game grew up to a thousand lines or more and used several sub-functions that were of comparable size. 

Dead Nation (2010)

Inspired by the use of Lua as DDL for the game-side state machines of Dead Nation, I decided that a complete top-level rewrite of the rendering engine was now necessary. It meant breaking down the functionality of Dead Nation’s renderer and all its utilities, figuring out the common patterns and use cases, and distilling those into modules or, as they were known in the code, render nodes. Since Lua was used only as constructor DDL, the performance of using an interpreted language was not an issue. Running the initializer script was only a few hundred milliseconds that disappeared into load time latency.

The render nodes fell into three major categories:

  • GPU resource nodes, such as the render targets

  • CPU resource nodes, such as scene PVS nodes or cameras

  • Rendering nodes that did the actual rendering work, such as rendering scene objects, doing full screen passes, and custom HUD

Adding a new node type was relatively straightforward, as it just required implementing two classes: a factory class from Lua, and the node class itself, with only a handful of functions each. This did not mean the nodes had to be simple, but “one node for one task” was the primary guideline, and the result was a better multithreaded performance.

The initial version for Super Stardust Delta was single-threaded as the PS Vita graphics’ API was also single-threaded. We still had good occupancy since the game code ran multithreaded on the other cores while the renderer used just one. 

Scheduling Made Simple

We often hear that Resogun was a spiritual successor of Defender, but it inherited its bones and blood from Super Stardust Delta. The multithreaded scheduling on the game side was one of the bones, but the PS4 has an exponentially more significant CPU power than the PS Vita, and Resogun’s gameplay was a breeze. A single-threaded renderer did not fly, though, so it was time to follow form.

Resogun (2013)

The render node system turned out to be quite timely: the nodes were self-contained C++ objects that called to do their thing in sequence. GPU resource nodes do nothing, and CPU resource nodes don’t do that much either except for a view frustum node that picks objects to render from various sources. 

The problematic “resource” was the implicit rendering device context itself - but really, all it needed was to be made explicit, turn it into an actual GPU resource, add a couple more for management, and voilá! 

Faster said than done, though: modifying all the existing render nodes was an easy but large and tedious task. 

Of course, having just one device context doesn’t help a lot. On PS4, the core of the context was a GPU command buffer that the console GPU processes natively. The end goal was to fill several of these in parallel and only dispatch them onto the GPU in order.

A big part of the scheduling system was turning the full ordering (from the Lua file) into partial ordering: this allowed us to parallelize it. 

Graph theory to the rescue! The resources from the node configurations became DAG edges from the previous writer to the next reader. While the device context was the most prevalent dependency, other CPU resources were just as necessary. 

This process resulted in way more edges than needed, though: the overhead was far from negligible.

Graph theory to the rescue again! To eliminate the excess, I computed the transitive closure of each node and, when adding new edges, I also removed all edges pointing to the nodes in the transitive closure, producing a minimum spanning tree of the original DAG. This process reduced the number of edges to one for most of the nodes and just a few for the rest, and the edge overhead nearly vanished.

The final part of the graph preprocessing was to count dependencies and then reverse the graph: once a worker thread finalized a node, it would atomically decrement the unprocessed dependency count of the dependants and, once that number reached zero, the dependant node was ready to run.

Optimizer’s Dream

It was straightforward to tweak the Lua script, reset the renderer, and profile again. The scripting interface could already re-run the script file, making the iteration instant. If there was an error in the script, the game only went blank until a fix was in: our debug overlay was not part of the script, so it remained visible regardless, and an external tool managed the script uploading.

However, optimizing was not an easy task, as there were two things to look after: the CPU time for multithreading and the GPU time. For the former, this meant looking for gaps in the CPU profile and figuring out how to get rid of them, and looking into the visualized DAG was very useful for the purpose. It was a similar process for the latter, though a touch more challenging. 

Starting with Resogun, we have been extensively using async computing in all our games whenever possible. Our particle simulation was one of the first things that had to run to produce the data for opaque particles, which left a gap for rasterizer utilization. It needed something not dependent on the depth buffer or anything else in the scene to fill it. There were only two such items, the HUD and menu layers, as they had to be on top of everything else. It did not fill the hole fully, but the HUD was practically free on GPU as it vanished into the gaps in async compute. In the end, Alienation with its numerous shadow casting lights took 4.5ms of CPU time on four worker threads, and Nex Machina just 1.4ms on PS4.

Not All Honey and Sunflowers

Communication between game and render nodes was complex because the render nodes had to be self-contained and isolated to ensure thread safety. 

Concurrency bugs did happen, mainly because the node implementor did not correctly register all the CPU resources in use. We had different approaches to different problems, but none seemed to be a satisfactory universal solution. Sometimes, we used Lua variables to carry information; sometimes named objects or tweakables, occasionally even (gasp!) global variables. 

Another weakness came up with dynamic rendering

While Lua was an easy choice for DDL as we already had integrated it into the engine, it did not easily allow us to add or remove nodes at runtime. This became an obvious problem with lighting and shadows. 

In-engine footage of Alienation (2016)

Our spotlights in Alienation were feature-rich (for the time) with shadows and volumetrics and similar elements, but the Lua script generated the maximum number of shadow rendering nodes, which then wasted some microseconds to check and notice they had nothing to do. The number was a semi-hard constraint in the script and could not be dynamically adjusted. It was a working solution but far from elegant.

All That Is, Well, Ends … 

The decision to use Unreal Engine for Stormdivers development and Returnal marked the end of the Sasquatch rendering system. It was a sensible decision: we did not have enough tools or system programmers to ramp up the technology for the project, and our indie background meant we did not yet have the technical capital to continue the use of Sasquatch. 

We made games, great games - but when you laser focus on making said game, it is likely that you are not thinking about, or don’t have the overhead to, making the game engine and reusable tools as well. We had tools for making games similar to Nex Machina, but those did not translate readily into making anything else.

There were some wild plans - in retrospect, laughably unrealistic goals - for somehow conjuring tools and systems out of nowhere while the content team and the designers were prototyping Returnal on Unreal Engine - but we chose against that before wasting development time. 

Regardless, I consider the Sasquatch rendering system a success story. During its ten-year life span, it shipped many critically acclaimed titles, ran on platforms varying from handhelds such as PS Vita to powerhouse consoles such as PS4 Pro, and powered up amazing visuals and wildly different art styles from the apocalyptic voxelscapes of Resogun to the windswept snowbanks and steamy jungles of Alienation

Not too bad, really.

Written by: Seppo Halonen