Voice Box

HorspielTage ZKM, 2024

The Voicebox is a pioneering interactive installation that merges artificial intelligence, voice-cloning technology, and generative storytelling to create personalized, ephemeral audio narratives. At the intersection of human expression and machine authorship, The Voicebox invites visitors to experience the uncanny sensation of hearing themselves perform in AI-generated radio plays—narratives they did not write, yet star in.

Overview

Participants step into an enclosed recording booth outfitted with high-fidelity microphones and an intuitive user interface. Upon entering, visitors are prompted to record a short vocal sample—typically a 30-60 second passage of text, or spontaneous speech. This audio serves as the raw material for a multi-stage generative process that transforms the participant’s voice into the central performance of a bespoke, machine-written radio play.

Within minutes, participants are presented with a surreal, professionally-mixed radio drama, featuring their own voice acting out characters, emotions, and dialogue—all orchestrated by AI. Each resulting piece is ephemeral: once played, the audio file is discarded, underscoring the installation's themes of impermanence and authorship.

Technical Architecture

1. Voice Capture and Preprocessing
The system begins with a high-quality audio capture using a studio-grade cardioid condenser microphone, pre-amplified and digitized through a low-noise audio interface. Signal processing includes noise reduction, dynamic range compression, and normalization. The cleaned waveform is then segmented and analyzed for phoneme patterns and vocal timbre.

2. Voice Cloning Engine
A neural voice synthesis model, fine-tuned using few-shot learning (such as [YourTTS, Vall-E, or MetaVoice-style] architecture), creates a synthetic voice clone of the participant. This process takes less than two minutes, leveraging GPU-accelerated inference pipelines. The cloned voice is capable of expressive prosody modeling, allowing dynamic emotional inflection within the AI-generated performance.

3. Narrative Generation
Using transformer-based large language models (LLMs), a narrative is generated in real-time. The system prompts the LLM with a combination of pre-authored story seeds, genre templates (e.g., noir, sci-fi, horror, surrealist), and metadata extracted from the participant’s voice (e.g., gendered cues, tone, language). The LLM outputs a coherent radio script, typically between 2–3 minutes in runtime, structured in acts with monologues, dialogues, and ambient descriptions.

4. Voice Synthesis and Sound Design
The synthesized voice is then used to generate the spoken parts of the narrative. These segments are arranged into a timeline, layered with AI-curated ambient soundscapes, Foley effects, and generative music composed using procedural audio tools (e.g., Tonic, Riffusion, or DDSP-based models). The final mix is rendered using a digital audio workstation (DAW) backend with automated mastering.

5. Playback and Ephemeral Archiving
The final radio play is streamed directly to the participant via an immersive spatial audio setup inside the installation chamber. Once the playback is complete, the files are deleted, reinforcing the transient, performative nature of the experience.

Conceptual Themes

The Voicebox is not merely a technological feat—it is a critical inquiry into contemporary notions of voice, identity, and authorship in the age of AI. As participants hear their own voice manipulated and recontextualized by machines, several layered questions emerge:

What defines authorship when a machine constructs the story, but your voice is the protagonist?
How does AI challenge traditional boundaries between performer and audience?
Can synthetic voices carry real emotional weight?

The installation occupies a liminal space—between human and machine, authenticity and fiction, ephemeral and archival—inviting reflection on the evolving nature of creative expression in the digital age.

Applications & Future Directions

While The Voicebox exists primarily as an interactive art piece, its underlying technologies suggest broader applications:

Creative Writing Tools: AI-generated radio scripts can be repurposed as seed material for authors or screenwriters.
Therapeutic Audio: Personal storytelling using one’s own voice could find use in therapeutic or memory-capture applications.
Voice Identity Research: Insights from participant feedback may inform ethical frameworks around voice cloning and identity protection.

Future iterations may include multilingual support, real-time performance synthesis, or integration into virtual and augmented reality environments.

The Voicebox reimagines the role of the listener, the speaker, and the storyteller—inviting each participant to become the protagonist of a fleeting fiction crafted from their own sonic identity.

Voice Box

Overview

Technical Architecture

Conceptual Themes

Applications & Future Directions

Sangath