Son of Sara — LLM-Based Embodied Conversational Agent

Overview

Son of Sara is the successor to SARA — a new LLM-based Embodied Conversational Agent (ECA) developed at CMU ArticuLab and INRIA (National Institute for Research in Digital Science and Technology, Paris). The system is designed to support natural and effective interaction with human users, relying on both rapport and task effectiveness to ensure good collaboration between human and agent.

Building on the lab’s tradition of socially-aware ECAs, Son of Sara elevates naturalism and adaptability using the robust generalization capabilities of large language models and cutting-edge deep learning — while maintaining the modular/cascaded pipeline architecture of prior systems.

My Role

Sole developer on the Unity avatar client. Designed and implemented:

JSON API interface for external control of the avatar’s gestures, expressions, and audio
Avatar animation state machine (Unity Mecanim)
Lip sync system from audio input
Blend shape-driven facial expression pipeline
Modular behavior architecture for extensibility in research experiments

System Pipeline

The full Son of Sara pipeline processes the user’s voice through to a synchronized embodied agent response:

Input → Understanding:

Microphone + ASR (Automatic Speech Recognition) — captures and transcribes user voice in real time
VAD (Voice Activity Detection) — handles dialogue dynamics and turn boundaries
Dialogue Manager — manages conversation state and flow

Reasoning → Generation:

LLM — generates a contextually appropriate text response
TTS (Text-to-Speech) — converts to spoken audio
NVB Generation — produces synchronized non-verbal behaviors (gestures, face, gaze)

Output → Unity Avatar:

JSON API → Unity avatar body, blendshape expressions, lip-sync

Predictive Turn-Taking

The system implements Voice Activity Projection (VAP) — a predictive turn-taking model that monitors ongoing conversations in real time and predicts future voice activity patterns for both speakers within a 2-second window.

When VAP predicts an imminent turn yield, the system begins generating a response before the user’s turn concludes, then aborts and restarts if the user introduces substantially new information. This significantly reduces response latency and enhances conversational rhythm.

Developed in collaboration with Prof. Koji Inoue from Kyoto University — jointly building one of the first predictive turn-taking models for French dyadic interactions.

Non-Verbal Behavior

Gesture generation uses a retrieval-based approach (transformer encoders BGE/MiniLM) selecting from a gesture library based on verbal and syntactic cues. Current development is transitioning to a generative model capable of producing:

Deictic (pointing) gestures
Iconic (depicting concrete concepts) gestures
Metaphoric (representing abstract ideas) gestures

Evaluated frameworks include diffusion models (DiffSHEG, STARGATE), token-based autoregressive models (VQ-VAE + GPT-style generation), trained on BEAT/BEAT2 motion capture and reconstructed 3D motion from monocular video.

Collaborators

Justine Cassell — Principal Investigator, CMU ArticuLab
INRIA (Paris) — National Institute for Research in Digital Science and Technology
Prof. Koji Inoue — Kyoto University (turn-taking collaboration)