I gave an invited talk at Utrecht University in September 2025, titled “Speech and Gesture Interaction in Face-to-Face Dialogue: Analysis and Generation,” as part of the Joint Special Interest Group NLP@UU.

Abstract: Human language is mainly used in face-to-face interactions, where we convey meaning and build common ground through a variety of signals, including speech, gestures, facial expressions, and gaze. In this talk, I present our recent work on modelling speech and gestures in such interactions, covering both analysis and generation of gestures accompanying speech. First, I show that self-supervised pre-training learns stronger gesture representations when they are grounded in co-occurring speech. The resulting multimodal representations align with human judgements of gesture similarity in dialogue and improve object reference prediction, including when speech is unavailable at inference. Second, I introduce a speech-driven gesture generator that integrates semantic, acoustic, and prosodic cues, trained to generate semantically coherent gestures. The model produces gestures better matched to the content and timing of what is said, improving gestural realisation in virtual avatars. Together, these studies point to practical directions towards multimodal, gesture-aware conversational systems that analyse human gesturing and generate context-appropriate gestures for naturalistic human-computer interaction.