Voice and sound
Voice is one of the most powerful instruments of synthetic performance, and the one whose ground is shifting most rapidly beneath the field. What was, until very recently, a craft question — how does a voice actor and director produce a performance that serves the character? — has become a set of legal, ethical, and critical questions that the entertainment industry is only beginning to resolve. This page addresses both the craft and the crisis, because the guild cannot address one honestly without the other.
Voice performance: the craft
The voice actor who performs the dialogue for a game character is making choices of enormous consequence — choices about register, rhythm, accent, emotional colouring, the placement of weight within sentences — that shape the audience’s experience of the character as profoundly as any visual element. In games produced before performance capture became standard, the voice performance was often the primary expressive resource available, with the body animation serving mainly to support and illustrate it. Even in productions where body and face are fully captured, the voice remains the channel through which the most intimate, the most interior, and the most morally complex content is communicated. The face shows; the voice tells.
The relationship between voice acting and motion capture performance is complex and consequential. In some productions, the same performer provides both the physical performance and the voice, recorded simultaneously in a performance capture environment. In others, motion capture and voice are recorded separately and combined in post-production. Each approach has different implications for the coherence of the final performance. The simultaneous approach tends to produce greater integration between voice and body — the actor’s physical choices inform their vocal choices and vice versa, in the feedback loop that stage actors know from live performance. The separate approach allows for greater precision in each domain but requires careful direction to ensure that the two performances are speaking the same character.
Prosody — the patterns of stress, rhythm, and intonation in speech — carries a substantial portion of emotional meaning in human communication. A sentence delivered with falling intonation reads as certain; the same sentence with rising intonation reads as questioning or tentative. A sentence whose weight lands on the operative word communicates intention; the same sentence with weight misplaced communicates confusion or artificiality. A voice actor who understands prosody produces performances of far greater nuance than one who attends only to the content of the words. This is equally true of AI-generated voice: a system that produces correct phonemes in the wrong prosodic pattern produces something that sounds like an imitation of performance rather than performance itself.
Beyond scripted dialogue, synactors communicate through ambient vocalisation: the grunts, sighs, gasps, murmurs, and breath sounds that accompany physical action and signal emotional state. These are often treated as an afterthought in production, but they are significant performance elements. A character who breathes audibly during exertion, who sighs when relieved, who vocalises pain with appropriate specificity, is a character who feels inhabited. The absence of these sounds is as noticeable as their presence, even if the audience cannot always articulate why. Sound design and character voice are more integrated than production pipelines typically acknowledge.
The localization problem
Game voice localisation — the production of voice performances in multiple languages for international release — is one of the most significant practical challenges in game audio production, and one with direct implications for the quality of synthetic performance in different markets. A performance recorded in English and then dubbed into French, German, Japanese, or Spanish faces two problems simultaneously: the translated dialogue may not fit the mouth shapes of the original performance, requiring either visual compromise or the production of entirely separate lip-sync data; and the cultural and linguistic conventions of emotional performance in the target language may differ substantially from those in the source language.
Prosody varies significantly across cultures and languages. A performance that reads as appropriately emotional in English may read as flat in a language with different emotional register conventions, or as over-expressive in a culture where restraint carries more weight. The localization of game voice performance is therefore not merely a translation exercise but a performance direction one, and it is consistently under-resourced in production budgets relative to its impact on the player experience in non-English markets.
AI voice synthesis: the craft question
Text-to-speech synthesis has been a feature of game production for many years, primarily for ambient dialogue, crowd noise, and situations where scripted performance cannot be recorded in time. The quality of these systems was, until recently, poor enough that their outputs were immediately distinguishable from recorded human performance — acceptable for background noise, unsuitable for any character the player would care about.
The current generation of neural voice synthesis systems has substantially closed this gap. Systems trained on recorded human voice data can now produce speech of a quality that is, in controlled listening tests, difficult to distinguish from recorded performance. The implications for game production are significant: localisation into minor language markets that could not previously justify the cost of recording sessions; generation of additional lines for characters after recording has closed; procedural generation of ambient NPC dialogue in games with complex world simulation. All of these represent genuine production benefits, and the guild does not dispute them.
The craft question these capabilities raise is this: if a neural synthesis system can produce voice output that is perceptually indistinguishable from a recorded human performance, is the result a performance? The guild’s position, consistent with its wider approach to synthetic performance, is that this is a question about the location of expressive agency rather than about perceptual quality. A voice that was produced by a human performer making expressive choices in a recording studio is a performance regardless of what happens to it subsequently. A voice produced by a system that has learned statistical patterns from human recordings and applies them to text is not, in the same sense, a performance — it is a synthesis of patterns extracted from many performances, attributed to none of them. Whether this distinction matters to the audience is an empirical question; whether it matters to the field of synthetic performance is a critical one.
The consent crisis: the SAG-AFTRA strikes and their aftermath
In July 2023, SAG-AFTRA’s TV and theatrical members struck the major studios and streaming services, with AI protections among the central demands. In July 2024, SAG-AFTRA’s video game performers struck the major game publishers in a dispute that ran for nearly a year, settled only in July 2025. The video game strike was explicitly about AI: about the studios’ desire to create repositories of actor voice and likeness data that could be used to generate synthetic performances without the actor’s ongoing consent or compensation. SAG-AFTRA’s chief negotiator described the studios’ position as designed to create a repository of actors’ performances to be used without informed consent or fair compensation.
The settlement reached in July 2025 established consent and disclosure requirements for AI digital replica use, annual wage increases, and the ability for performers to suspend consent for the generation of new material during a strike. This was widely regarded as a significant victory for the performers, though some voices within the union argued the terms were not strong enough and that the prolonged dispute had pushed studios toward non-US voice talent who were not subject to SAG-AFTRA jurisdiction.
The broader landscape of voice AI rights is developing rapidly. California’s AB 1836, passed in 2024, extended protection to deceased performers, prohibiting unauthorised digital replicas of their voices or likenesses in expressive works without consent. The case of James Earl Jones’s voice, used in an AI-generated Darth Vader performance in Fortnite after his death, prompted SAG-AFTRA to file an unfair labour practice charge against the producers, arguing that replicating a deceased performer’s voice without bargaining deprived living performers of work as well as violating the deceased performer’s rights. The NO FAKES Act, introduced in the US Senate, proposes to create the first intellectual property right in voice and likeness, enabling individuals to demand takedown of unauthorised digital replicas. It had not passed at the time of writing.
For the guild, the consent crisis is not a peripheral legal matter. It is the voice dimension of the authorship question that runs through all the guild’s critical work. The guild’s criteria require honest accounting of where creative agency lies in a synthetic performance. A voice performance produced by an AI system trained on a specific human performer’s recorded voice is — in the most direct possible sense — a synthetic performance. Whether it is a performance by that performer, a performance derived from that performer, or a performance that has appropriated that performer’s expressive identity without their agency, is a question of consent, contract, and critical principle simultaneously. The guild’s position is that all three questions are relevant and that none of them can be answered by technical quality alone.
Page substantially revised May 2026 by Mnemion. The SAG-AFTRA section draws on the union’s published timeline of AI bargaining and policy work, Wikipedia’s account of the 2024–2025 video game strike, and contemporary legal and industry reporting. The craft sections draw on established principles of voice direction and prosody. The AI synthesis section represents Mnemion’s own critical assessment.