We released Murderous Pursuits a few months ago, and I wanted to share a bit of info on what I was working on. This blog post will look at the audio solution for one of the core features, the “Vignette” system. This will cover:
– What the original concept was and audio requirements were
– How we handled the dialogue recording and editing
– The systems design and implementation using Wwise and Unity.
If you haven’t played or seen it yet, Murderous Pursuits is a kill-or-be-killed Victorian stealth-em-up for 1-8 players in which you must hunt and kill your quarry before your hunters do the same to you, all while avoiding witnesses. You can buy it on Steam right now!
Firstly, what are Vignettes? As part of Murderous Pursuits’ stealthy gameplay you can use various spots around the level to blend into the environment or join crowds to either hide from hunters or stalk your quarry, and strike without warning. It looks like this in action:
We use Vignettes to bring more life to the world as well as serving a gameplay purpose. When people group together we want them to strike up conversations, make the levels feel (and sound!) busier, and pepper in a little world and character building to boot. The characters also have different talking animations based around three states: positive, negative and neutral, and incidental reactions such as nodding or disagreeing, to stop things looking and sounding too samey. Basically, we wanted to emulate actual group conversation.
You’re probably reading that and thinking “That sounds like a lot of VO needed there” and you’re not wrong, especially if we were looking to avoid the dreaded looping dialogue lines!
As a compromise and budget/time friendlier solution our Creative Director, Kitkat, floated using Simlish dotted with the occasional word or phrase to make it feel like the characters were discussing something instead. It could also play into the game’s comedic edge, with them going into random tangents about misadventures they may have had. A bit like the Drunk Guy sketch from the Fast Show:
This would reduce the writing required for conversation vignettes down to about 200 words/phrases per character as opposed to… a lot more, and pushed towards a more technical solution – setting up a playback system that can string random lines of Simlish together will occasionally throwing a word in. That is, of course, after we worked out how we wanted the Simlish to sound.
Simlish, for those who are unaware, is the language used in The Sims series of games. It was designed to be as universal as possible to provide a degree of context to what a player’s Sims were saying, while also acting as a practical solution to VO repetition and localisation. Currently it has an alphabet, rough but expansive phrase book, and even real world pop songs translated into the language. You can read a bit more here.
Planet Coaster also has a similar solution with Planco, their own in-world language. However, that was designed to be a working, functional language and even has its own official dictionary. You can read more about the design process of it here.
Unfortunately, we had neither the time nor the budget to do something quite as detailed, so we opted for a slimmer solution: having our actors voice different lengths of Simlish in each of the 3 tones. We also grouped these requirements into 4 lengths: Short, Medium, Long and Questions. Even within these sets, we have some variation in the length of the phrases for each character to help establish their quirks.
For example: The Brute’s short phrases are around 1 to 2 seconds, and his longer ones around the 8 second mark. He’s a pretty to-the-point kinda guy.
The Admiral, on the other hand, has shorter phrases around the 3 to 5 second mark and his longer ones reaching up to and beyond 20 seconds in length, reflecting his more blowhard, longwinded nature.
We also had to consider conversation flow and the fact that any word or phrase, Simlish or otherwise, could potentially tie into another.
But what about the actual recording process? While The Sims and Planet Coaster went for a more unified world language, each individual person in those games is practically a blank slate, with whatever the player projects or builds into them forming their background. Murderous Pursuits has characters that have some degree of backstory, not to mention varying nationalities, so we opted to give our actors a bit more freedom in terms of what they performed so long as it met our phrase length and tone criteria. All of them took the ball and ran with it, resulting in several different approaches. Kim Allan used news articles as a base, scrambling the words and making it sound more like Gaelic for the Scottish Duchess, while Jay Britton’s take on the Dodger involved him making up short stories, garbling the words while maintaining the ebb and flow of his tales that played into his Cockney cheeky chappy character. Here’s an example phrase:
Sounds about right! In terms of scope, each character archetype has roughly 600 individual VO clips in game, which is about 4800 files. When we add in Mr. X and the guards that comes to around 5100, which was trimmed down from over 12,000 takes in total. This covers everything from Simglish, spoken words and phrases like reactions, grunts like attacking and dying, and a whole other variety of weird requests that never made it into the game.
Which is a lot of talking, and even more editing. Especially for a single person! So, a special shout out to Stephen, who was a QA tester at the time before making the jump to marketing, who took on the grunt work of chopping up the Admiral’s lines when a deadline was looming. Here’s the how the final session of the Duchess looked:
As far as clean up went, I used Izotope’s RX suite to remove some of the pops and crackles that occur when people speak, like when your lips smack or open. You don’t really notice them in real life conversation, but in a quiet space close to a mic they can stand out and sound unnatural. After that, I used a gate and volume automation to cut out some of the background noises and control the tail offs of words during longer takes, and some De-essing to control plosive sounds (like… “ess”es). There was also some compression and EQ to even things out too.
One handy tip that I’ve seen a few other dialogue editors share is to record a few extra takes and pronunciations of certain troublesome letters and word endings that are plosives or stops (like, ess’s, t’s and f’s to name a few) in case you need to do some further edits and repairs. While I didn’t do this at the time around, there was enough content recorded that I was able to stitch together takes that would have been otherwise un-usable, and stretch out the number of variations we have even further.
Below is a quick example, where a rogue “th” got a little lost post-processing, so instead of trying to automate volume and EQs I grabbed a clean one from another take before the processing was applied and dropped it in. It might look a bit odd to have a mono clip in between two stereo ones. Ableton froze the tracks to stereo despite the original source being mono, and everything was summed to mono afterwards, so no weird spatial stuff going on in the end!
Now let’s talk about how it works inside Wwise, the audio middleware solution we used. Firstly, here’s the hierarchy for the conversation container to give you a quick overview of what it contains.
Then at the top is a Switch Container, VOX_CharConvo, which is used to set which character should be talking, and the named Switch Container selects which type of Simlish or conversation audio we want to be played. In this case, I’ve branched out the Brute’s: VOX_BruteConvo. Each of the conversation types are contained in a Random Container, which allows us to randomly select an audio file from a group of sounds when an audio event is triggered.
However, as you might have guessed with the references to them above, the Neutral, Positive and Negative containers are a little different. We need these to not just pick a file at random, but also between Simglish phrases of varying lengths and words, while maintaining the flow of conversation. Picking a random phrase type or word is pretty simple, we just use more Random Containers. As for the actual flow, that took a little bit more refinement. Here’s the Property Editor of the Brute’s Neutral Convo container:
There are a few things going on here. First is the Initial Delay setting.
This is used as a basic starting offset for the looping animation, with the randomizer being used to pick a value. Nothing fancy.
Next, the Play Type.
This is set to randomly pick one of the 6 possible options in the Neutral Container. In normal usage we’d want to avoid repeating previous audio clips that have been played to avoid the “machine gunning” repetition effect. Here though, we want to hit any of the sub-containers even if they’ve already been used. This helps mask the transitions between each type, and allows for more dynamic and varied conversations.
Play Mode is set to Loop Continuously.
They’ll keep talking until we trigger an event to tell them to stop, such as someone else piping up to speak, or reacting to someone being murdered nearby. We don’t want monologues after all, even if the Admiral could spout one easily. The Transition Delay is set between 0 and 0.4 seconds when the random offsets are accounted for, giving the characters a little bit of a breather outside of any natural gaps and pauses in their phrases.
We initially tried cross fading between each clip, but that felt more like a stream of audio and not quite natural enough in practice. There’s also a little bit of silence at the end of each Question snippet of Simglish which, combined with the actors’ inflections, helps create a more natural flow to their conversations by letting it hang in the air a little.
Finally, the weightings.
Characters are slightly more biased towards Simlish, as we wanted each word to pop and minimise the potential for a string of words occurring while not removing it entirely.
Outside of the playback settings each Random container has two silent audio clips using Wwise’s Wwise Silence. These are also varied in length depending on which container they’re in and introduce additional pauses in speech to reduce fatigue and make the conversation flow more naturally. There’s also a chance for multiple silence clips to be triggered in a row, but that’s okay. Not everyone has the stamina to talk for a while.
Here’s how the whole thing sounds for a single character in Wwise:
It’s not too bad on its own. But what about in game?
That’s a bit better. The little snippets of conversation as you walk past groups of people help the world feel a bit more populated and alive. We also made use of Wwise’s Shareset function to control how the sound changes over distance. Sharesets are rolloff or attenuation values over distance that we can attach groups of sounds to instead of setting those values individually. Here’s how the Conversation one looks:
The curves control the volume drop off, and introduces some high and low pass filtering at further distances. The green curve is for Spread, which determines how much of the soundfield a sound takes up. At its minimum, it’s a point source and easy to determine the direction in which it’s coming from. At its maximum it fills the soundfield, regardless of its position around you. I have it set to around 75% when you’re next to someone as it still gives a sense of directionality to the speaker, while reinforcing their proximity to you and making you feel like part of the conversation by filling out the soundfield.
I also used the cone attenuation feature, which can attenuate the volume and introduce filtering based on which way a sound source is facing. As sound is projected from a person’s mouth on the front of their body, it made sense to have things a little quieter if you’re behind them. Here’s a quick little video demo (watch the bottom right corner to see what’s changing):
In game playback relatively simple too, at least on my end. The coders made a custom Anim Event in Unity that allowed me to attach Wwise events to an animation, with looping animations having a separate start up animation so that they don’t constantly retrigger the audio. The Wwise events contain a switch action to select the type of conversation and a play action for the VOX_CharConvo container. We handle the character selection through code. We also have a generic Stop event with a very slight fade out for all dialogue types that’s triggered when characters start moving and other animations (such as attacks). It’s not pretty but it works.
The Anim Event takes in a String argument, in this case the Wwise event name, and uses that in an AKSoundEngine.PostEvent call. We use this for every animation that requires audio outside of locomotion, and have all of the required actions baked into a single Wwise event instead of multiple Anim Events scattered across the animation timeline. We discovered that using that method lead to some slowdown and misfires and was generally unreliable, so the folks as Audiokinetic advised us to go down the single event route which worked a treat (thanks Max!). As a quick example, here’s the event for a Bludgeon Attack, which has 14 different Actions going on:
And let’s stop there. Audio in games can be a bit of a black box at times, so I hope this peek behind the curtain gave you a bit of an insight into what goes into part of the process and all the things we need to work out into order to make it to your ears! If you’re interested in Wwise itself, it’s free to download and comes with a lot of tutorials and resources. You can grab it here.
If you’re looking to try it with Unity then check out Berrak Nil Boya’s videos, which start with Unity’s own tutorials then go into more interesting things. Some layout things may have changed since they were made, but everything still works the same way!
Also a big thank you to our actors, be sure to look them up:
Don’t forget to check out Murderous Pursuits on Steam too!
Thanks for reading,