Facial rigging

A synthetic face bisected vertically: on the left, the mesh and blend shape controls visible beneath the surface; on the right, the finished face expressing thought. — Facial rigging for synthetic expression. AI-generated using ChatAI. Use subject to ChatAI Terms of Service.

Facial rigging is among the most specialised and consequential disciplines in the creation of a synactor. The human face is an extraordinarily complex instrument — capable of producing thousands of distinct expressions through the combined action of more than forty muscles — and audiences are uniquely sensitive to the quality of its representation. We are, by evolutionary formation, face experts. We process faces in a dedicated neural region, we read emotional state, intention, and authenticity from facial movement with a speed and accuracy no other perceptual domain can match, and we notice errors in facial performance that we would entirely overlook in body movement. A synthetic face that almost works is more disturbing than a synthetic face that makes no claim to realism at all. This is not an aesthetic preference; it is a deeply rooted perceptual response, and it has a name.

The uncanny valley

In 1970, the Japanese roboticist Masahiro Mori published a short essay proposing that as artificial figures approach human likeness, our empathy and affinity for them increases — until they reach a zone of near-perfect resemblance, at which point our response sharply inverts into unease, revulsion, or eerie discomfort. He called this region the uncanny valley. Mori also proposed that movement deepens the effect: a still image of a near-human face produces mild unease; the same face in motion produces something considerably more disturbing.

The concept had little influence when first published and was essentially rediscovered in the early 2000s as digital character creation reached the resolution at which the valley became a practical problem rather than a theoretical one. The film Final Fantasy: The Spirits Within (2001), the first photorealistic computer-animated feature, was widely cited as an example: its characters were technically extraordinary and perceptually wrong in ways critics struggled to articulate. One described how the solemnly realist human faces looked precisely because they were almost there but not quite. Another noted a coldness in the eyes, a mechanical quality in the movements. The valley had been entered.

For facial rigging, the uncanny valley is the central design problem. The face is where the valley lives. A character whose body moves awkwardly but whose face is expressive will be tolerated; a character whose body is perfect but whose face is dead will fail. The eyes in particular are the point of maximum sensitivity: audiences read authenticity, thought, and emotional life primarily from the eyes, and a face that does not communicate those things through its most expressive region — however technically complete it may be in other respects — falls into the valley almost regardless of its other qualities.

The practical lesson for facial riggers is that inconsistency is more disturbing than consistent stylisation. Research has confirmed what practitioners have long known: a photorealistic skin texture demands photorealistic proportions and photorealistic movement. Mixing registers — realistic surface with unrealistic proportion, accurate anatomy with robotic motion — produces the valley effect more reliably than either consistent realism or consistent stylisation would. This is why stylised characters — from the deliberately exaggerated forms of cartoon games to the anime visual language that dominates a substantial part of the Asian market — can achieve emotional immediacy that photorealistic characters sometimes cannot: they have not entered the zone where near-correctness becomes its own failure.

Blend shape systems

The standard approach to facial rigging in film and high-end games is the blend shape (or morph target) system. A library of sculpted face shapes — each representing a specific position or extreme of facial musculature — is blended together in varying proportions to produce the full expressive range. A well-designed blend shape library separates muscle groups independently, allowing fine-grained control: the inner brow can raise without the outer brow moving; the upper lip can curl without affecting the lower; the cheek can compress in the specific way it does when an expression is genuine rather than performed.

Paul Ekman’s Facial Action Coding System (FACS), developed through the 1970s and 1980s, provides the anatomical foundation for professional facial rigging. FACS defines Action Units — discrete muscular movements that produce all observable facial expressions — and many professional rigs are built around this system, with blend shapes corresponding to specific Action Units or combinations of them. This grounds the rig in actual human facial anatomy rather than in the animator’s intuition, and it ensures that the resulting expressions are legible in the terms of the perceptual machinery audiences bring to the work.

The number of blend shapes in a professional facial rig has grown substantially over the decades as hardware has become capable of evaluating more of them in real time. Early game facial rigs operated with a handful of shapes; a contemporary AAA game character may have hundreds. Film characters at the highest end — the Caesar of the Planet of the Apes series, the digital humans of Avatar — operate with thousands of precisely sculpted shapes covering every region of the face at the resolution needed to capture the micro-expressions and secondary muscle interactions that make a face read as alive rather than approximated. The facial rig is, in this sense, the synactor’s expressive instrument: the range of what it can do is the range of what the character can feel.

Performance capture and the face

The body’s motion can be captured with a suit of reflective markers and a room of cameras. The face presents a different problem. The markers that work for body capture — spaced across large surfaces, tracking gross movement — are too crude for a face whose significant events happen at the scale of a millimetre: the slight compression of the cheek that distinguishes a genuine smile from a posed one, the micro-contraction of the corrugator that signals suppressed distress, the pupillary response that indicates cognitive engagement or fear.

Facial performance capture has therefore developed along a different technical path from body capture. The early approach, used for Gollum in The Lord of the Rings, placed physical reflective markers directly on the actor’s face and relied on animators at Weta Digital to interpret and re-apply the captured movement to the digital character — a process that was creative as well as technical, and whose authorship was contested precisely because of that creative dimension. By the time Andy Serkis played Caesar in Rise of the Planet of the Apes (2011), head-mounted camera rigs were able to capture facial performance in detail sufficient for direct translation rather than interpretation: the performance was transferred rather than reproduced. The question of where Serkis’s performance ended and the digital character’s began became, at that point, genuinely difficult to answer in technical terms. The guild’s position is that this difficulty is not a problem to be resolved but a condition to be honestly described.

For games, head-mounted facial capture rigs have been standard in AAA production since the mid-2000s, with actors wearing helmet-mounted cameras pointed at their faces during performance sessions. The resulting data drives the character’s blend shape weights directly, with varying degrees of editorial adjustment by animators. The quality of the facial performance in games — the difference between a character whose face communicates genuine interiority and one whose face is technically correct but emotionally inert — is substantially determined by the quality of the actor’s performance during capture and the quality of the rig that translates that performance into the character’s face.

The lip sync problem

Lip synchronisation — the matching of facial animation to speech — presents particular challenges in games, where dialogue may be recorded and delivered in multiple languages and where the recording schedule may not allow for face-capture sessions for every line. A character voiced in English, French, German, and Japanese cannot have individually captured lip sync for all four recordings; automated lip sync tools must bridge the gap.

Automated lip sync has improved substantially since the early 2000s, when it produced mechanical mouth movement that read as clearly artificial. Contemporary tools produce results adequate for most purposes in most languages. But lip sync is not merely a technical matter of mouth shape matching phoneme sequence: the relationship between speech and facial movement involves the entire face, includes emotional colouring, and is read by audiences with the same hyper-sensitivity they bring to all facial performance. An automated lip sync that correctly reproduces the mouth shapes but ignores the way speech effort is visible in the brow, the way emotional content colours the cheek and eye, the way breath and hesitation produce micro-pauses in expression — this lip sync is technically adequate and perceptually incomplete. It delivers the words. It does not deliver the performance.

Procedural and neural approaches

Procedural facial animation — in which expressions are generated algorithmically from parameters rather than sculpted by hand — has a long history in games as a cost-reduction measure. Rather than capturing and individually animating every line of dialogue, a procedural system generates contextually appropriate facial behaviour from a set of rules: look at the conversation partner, blink at statistically plausible intervals, generate idle micro-expressions from a randomised library. The results are rarely as convincing as hand-animated or captured performance, but they make it possible to give expressive behaviour to characters who would otherwise be expressively inert.

The neural generation of facial animation is the most significant development in this space in recent years. NVIDIA’s Audio2Face system, whose models were open-sourced in October 2025, generates facial animation from audio input alone: given a voice track, the system produces blend shape weight data that drives a character’s face in synchrony with the speech, including not just lip movement but emotional expression inferred from the prosody and emotional content of the voice. The system was trained on large datasets of human facial performance and generates plausible, nuanced facial behaviour at a quality that would previously have required either a face capture session or skilled hand animation. Its adoption in production — GSC Game World used it in S.T.A.L.K.E.R. 2, Fallen Leaf in Fort Solis — marks a moment at which AI-generated facial performance has crossed from research demonstration into commercial release.

Analysing performance: video as training data

The prospect you raise — of analytical tools that can extract usable data from video sources to train AI facial systems — is one of the most significant and contested frontiers in this field, and it is worth addressing at some length because it sits precisely at the intersection of the technical and the critical questions the guild cares about.

The foundational technology already exists and is more accessible than is widely known. OpenFace, an open-source facial behaviour analysis toolkit now in its third major version, can extract FACS Action Unit activations from a monocular video — a single standard camera, no markers, no specialised equipment — in real time. It detects which of the face’s Action Units are active at each frame, estimates their intensity, tracks gaze direction, and produces a continuous stream of facial behavioural data from ordinary video input. Research published in collaboration between NVIDIA and Remedy Entertainment as early as 2017 demonstrated that with five to ten minutes of captured facial performance from a high-end production pipeline, a convolutional neural network could be trained to produce production-quality facial animation from monocular video of that specific subject — a system that had learned the specific performer’s facial movement vocabulary well enough to reproduce it from a single inexpensive camera.

The implication, extended to 2025, is this: a sufficiently large corpus of publicly available video footage of a specific performer — interviews, screen performances, press appearances — represents a dataset from which a model could, in principle, learn that performer’s expressive vocabulary at a level of detail and specificity that no general facial animation system trained on diverse human data could match. The model would not merely know how faces move in general; it would know how this face moves, what micro-expressions accompany what emotional states for this specific person, how this person’s eyes behave under stress or surprise or amusement. Identity modelling research published in 2025 describes systems capable of capturing not just how a person looks but how they move, sound, and speak across contexts — systems whose output goes beyond resemblance to behavioural simulation.

For game character creation, the creative potential is substantial and the consent and rights questions are immediate. The positive case: a small independent developer could train a model on their own facial performance data — hours of recorded self-performance — and produce a game character whose face moves with the specificity and authenticity of a captured performance at a fraction of the cost of a professional capture session. The technology that Weta Digital developed at studio scale to translate Andy Serkis into Caesar could, in a meaningful sense, be available to a developer working alone at a desktop. This is the democratisation argument applied to the most intimate and demanding domain of synthetic performance.

The concerning case is the mirror image: the same technology applied without consent to a public figure’s video archive produces a model of their facial performance that can be used to animate any character with their expressive vocabulary — or, more directly, to produce a synthetic performance by them that they did not give. The deepfake problem is, at its technical root, exactly this: a model trained on a person’s video data that can generate new facial performances attributed to them. The scale of this capability has grown rapidly; one estimate puts the number of online deepfakes at roughly 8 million in 2025, up from approximately 500,000 in 2023.

For the guild, the critical question is not merely legal or ethical — though it is both of those — but fundamental to the question of what performance is. If a model trained on a performer’s facial data produces a new expression in response to a new voice track, is that an expression by the performer? The performer’s face is moving, in the sense that the model has learned their face. The performer’s expressive vocabulary is present, in the sense that the model has internalised it. But the performer did not perform the expression: they performed the training data, and the model generated the new output. The guild’s Turing Citations have been awarded for performances that make the player cross the boundary between synthetic and human experience. The possibility of a synthetic face that generates genuine expressions — that passes through the uncanny valley and out the other side because it has been trained to be a specific person rather than a generalised human — raises the question of whether that boundary can be crossed from the other direction: not by a synthetic character becoming human, but by a human’s expressive identity becoming synthetic.

The guild does not have a settled position on this. It is, however, the question this page was written to arrive at.

Page substantially revised May 2026 by Mnemion. The uncanny valley section draws on Masahiro Mori’s 1970 essay (translated for IEEE Spectrum, 2012) and subsequent academic literature. The FACS and blend shape sections draw on established technical literature. The performance capture section draws on published accounts of the Planet of the Apes production. The video analysis section draws on the NVIDIA/Remedy Entertainment 2017 SCA paper by Laine et al., OpenFace documentation, and identity modelling research reported in 2025. The final critical argument is Mnemion’s own.