AR / VR / XR 2025 Shipped

Son of Sara — LLM-Based Embodied Conversational Agent

An LLM-powered embodied conversational agent and successor to the SARA virtual agent platform, developed at CMU ArticuLab and INRIA. Features a modular real-time pipeline with ASR, LLM, TTS, predictive turn-taking (VAP), and a Unity-based virtual body with JSON-driven gesture, facial expression, and lip-sync control.

UnityC#AvatarLLMJSON APIGesturesFacial AnimationINRIAArticulabCMUNLPTurn-TakingDeep LearningVAP

Overview

Son of Sara is the successor to SARA — a new LLM-based Embodied Conversational Agent (ECA) developed at CMU ArticuLab and INRIA (National Institute for Research in Digital Science and Technology, Paris). The system is designed to support natural and effective interaction with human users, relying on both rapport and task effectiveness to ensure good collaboration between human and agent.

Building on the lab’s tradition of socially-aware ECAs, Son of Sara elevates naturalism and adaptability using the robust generalization capabilities of large language models and cutting-edge deep learning — while maintaining the modular/cascaded pipeline architecture of prior systems.

My Role

Sole developer on the Unity avatar client. Designed and implemented:

  • JSON API interface for external control of the avatar’s gestures, expressions, and audio
  • Avatar animation state machine (Unity Mecanim)
  • Lip sync system from audio input
  • Blend shape-driven facial expression pipeline
  • Modular behavior architecture for extensibility in research experiments

System Pipeline

The full Son of Sara pipeline processes the user’s voice through to a synchronized embodied agent response:

Input → Understanding:

  • Microphone + ASR (Automatic Speech Recognition) — captures and transcribes user voice in real time
  • VAD (Voice Activity Detection) — handles dialogue dynamics and turn boundaries
  • Dialogue Manager — manages conversation state and flow

Reasoning → Generation:

  • LLM — generates a contextually appropriate text response
  • TTS (Text-to-Speech) — converts to spoken audio
  • NVB Generation — produces synchronized non-verbal behaviors (gestures, face, gaze)

Output → Unity Avatar:

  • JSON API → Unity avatar body, blendshape expressions, lip-sync

Predictive Turn-Taking

The system implements Voice Activity Projection (VAP) — a predictive turn-taking model that monitors ongoing conversations in real time and predicts future voice activity patterns for both speakers within a 2-second window.

When VAP predicts an imminent turn yield, the system begins generating a response before the user’s turn concludes, then aborts and restarts if the user introduces substantially new information. This significantly reduces response latency and enhances conversational rhythm.

Developed in collaboration with Prof. Koji Inoue from Kyoto University — jointly building one of the first predictive turn-taking models for French dyadic interactions.

Non-Verbal Behavior

Gesture generation uses a retrieval-based approach (transformer encoders BGE/MiniLM) selecting from a gesture library based on verbal and syntactic cues. Current development is transitioning to a generative model capable of producing:

  • Deictic (pointing) gestures
  • Iconic (depicting concrete concepts) gestures
  • Metaphoric (representing abstract ideas) gestures

Evaluated frameworks include diffusion models (DiffSHEG, STARGATE), token-based autoregressive models (VQ-VAE + GPT-style generation), trained on BEAT/BEAT2 motion capture and reconstructed 3D motion from monocular video.

Collaborators

  • Justine Cassell — Principal Investigator, CMU ArticuLab
  • INRIA (Paris) — National Institute for Research in Digital Science and Technology
  • Prof. Koji Inoue — Kyoto University (turn-taking collaboration)