Esam Ghaleb

In brief: I study and model how verbal and non-verbal cues work together in a variety of human behaviors. Since September 2024 I have been a researcher in the Multimodal Language Department at the Max Planck Institute for Psycholinguistics, where I model multimodal communication for both human insight and machine applications.

Trained in computer science and engineering, I work across AI, cognitive science, psycholinguistics, psychology and healthcare to computationally model human behaviour for both fundamental and applied research. My work focuses on multimodal interaction, particularly in the context of dialogue. My research spans various domains, including gesture generation, multimodal dialogue systems, and previously affective computing, with applications in healthcare, human-computer interaction, and social robotics.

During my PhD and post-doc at Maastricht University, I developed explainable multimodal emotion-recognition techniques; at the Institute for Logic, Language & Computation (University of Amsterdam) I investigated linguistic–gestural alignment and automatic gesture segmentation in dialogues. My applied projects include two EU-funded studies (200+ participants) and a work package that combined clinicians’ expertise with machine intelligence for socio-economic contexts.

News

Jun 27, 2025	Paper accepted at ICCV on Semantics-Aware Co-Speech Gesture Generation!
Jun 23, 2025	Plenary Talk and Workshop on Multimodal Interaction at Summer School with Raquel Fernández
May 16, 2025	Two paper accepted at the Annual Meeting of the Association for Computational Linguistics (ACL 2025)
Dec 12, 2024	I gave a talk on “Understanding and Modelling Multimodal Dialogue Coordindation” at the Max Planck Institute for Psycholinguistics.
Oct 21, 2024	I gave a talk on “Learning Representations in Dialogue through Contrastive Learning: An Intrinsic Evaluation” at the UvA SignLab.

Selected Publications

SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning

Lanmiao Liu, Esam Ghaleb, Aslı Özyürek, and 1 more author

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Oct 2025

Abs Bib

Creating a virtual avatar with semantically coherent gestures that are aligned with speech is a challenging task. Existing gesture generation research mainly focused on generating rhythmic beat gestures, neglecting the semantic context of the gestures. In this paper, we propose a novel approach for semantic grounding in co-speech gesture generation that integrates semantic information at both fine-grained and global levels. Our approach starts with learning the motion prior through a vector-quantized variational autoencoder. Built on this model, a second-stage module is applied to automatically generate gestures from speech, text-based semantics and speaker identity that ensures consistency between the semantic relevance of generated gestures and co-occurring speech semantics through semantic coherence and relevance modules. Experimental results demonstrate that our approach enhances the realism and coherence of semantic gestures. Extensive experiments and user studies show that our method outperforms state-of-the-art approaches across two benchmarks in co-speech gesture generation in both objective and subjective metrics. The qualitative results of our model can be viewed at \hrefhttps://semgesture.github.io/https://semgesture.github.io/. Our code, dataset and pre-trained models will be shared upon acceptance.
@inproceedings{liu202SemGes, title = { SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning }, author = {Liu, Lanmiao and Ghaleb, Esam and {\"O}zy{\"u}rek, Asl{\i} and Yumak, Zerrin}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, year = {2025}, month = oct, publisher = {CVF/IEEE}, address = {Honolulu, Hawai'i, USA}, url = {https://semgesture.github.io/}, numpages = {10}, keywords = {Gesture generation, semantics, generative AI}, location = {Honolulu, Hawai'i, USA}, series = {ICCV '25}, }
Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, and 8 more authors

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) Jul 2025

Abs Bib PDF

There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments, raising questions about the validity of these evaluations, as well as their reproducibility in the case of proprietary models. We provide JudgeBench, an extensible collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show substantial variance across models and datasets. Models are reliable evaluators on some tasks, but overall display substantial variability depending on the property being evaluated, the expertise level of the human judges, and whether the language is human or model-generated. We conclude that LLMs should be carefully validated against human judgments before being used as evaluators.
@inproceedings{bavaresco2024llms, title = {Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks}, author = {Bavaresco, Anna and Bernardi, Raffaella and Bertolazzi, Leonardo and Elliott, Desmond and Fern{\'a}ndez, Raquel and Gatt, Albert and Ghaleb, Esam and Giulianelli, Mario and Hanna, Michael and Koller, Alexander and others}, year = {2025}, booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)}, month = jul, address = {Cedarville, Ohio, 45314, United States}, location = {Vienna, Austria}, publisher = {Association for Computational Linguistics}, url_github = {https://github.com/dmg-illc/JUDGE-BENCH} }
I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue

Esam Ghaleb, Bulat Khaertdinov, Aslı Özyürek, and 1 more author

In Proceedings of the of the 63rd Conference of the Association for Computational Linguistics (ACL Findings) Jul 2025

Abs Bib PDF

In face-to-face interaction, we use multiple modalities, including speech and gestures, to communicate information and resolve references to objects. However, how representational co-speech gestures refer to objects remains understudied from a computational perspective. In this work, we address this gap by introducing a multimodal reference resolution task centred on representational gestures, while simultaneously tackling the challenge of learning robust gesture embeddings. We propose a self-supervised pre-training approach to gesture representation learning that grounds body movements in spoken language. Our experiments show that the learned embeddings align with expert annotations and have significant predictive power. Moreover, reference resolution accuracy further improves when (1) using multimodal gesture representations, even when speech is unavailable at inference time, and (2) leveraging dialogue history. Overall, our findings highlight the complementary roles of gesture and speech in reference resolution, offering a step towards more naturalistic models of human-machine interaction.
@inproceedings{ghaleb-etal-acl-2025, author = {Ghaleb, Esam and Khaertdinov, Bulat and {\"O}zy{\"u}rek, Asl{\i} and Fern{\'a}ndez, Raquel}, title = {I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue}, booktitle = {Proceedings of the of the 63rd Conference of the Association for Computational Linguistics (ACL Findings)}, publisher = {Association for Computational Linguistics}, address = {Cedarville, Ohio, 45314, United States}, month = jul, year = {2025}, location = {Veinna, Austria}, url_github = {https://github.com/EsamGhaleb/MultimodalReferenceResolution}, }
Learning Co-Speech Gesture Representations in Dialogue through Contrastive Learning: An Intrinsic Evaluation

Esam Ghaleb, Bulat Khaertdinov, Wim Pouw, and 4 more authors

In International Conference on Multimodal Interaction Jul 2024

Abs Bib PDF

In face-to-face dialogues, the form-meaning relationship of co-speech gestures varies depending on contextual factors such as what the gestures refer to and the individual characteristics of speakers. These factors make co-speech gesture representation learning challenging. How can we learn meaningful gestures representations considering gestures’ variability and relationship with speech? This paper tackles this challenge by employing self-supervised contrastive learning techniques to learn gesture representations from skeletal and speech information. We propose an approach that includes both unimodal and multimodal pre-training to ground gesture representations in co-occurring speech. For training, we utilize a face-to-face dialogue dataset rich with representational iconic gestures. We conduct thorough intrinsic evaluations of the learned representations through comparison with human-annotated pairwise gesture similarity. Moreover, we perform a diagnostic probing analysis to assess the possibility of recovering interpretable gesture features from the learned representations. Our results show a significant positive correlation with human-annotated gesture similarity and reveal that the similarity between the learned representations is consistent with well-motivated patterns related to the dynamics of dialogue interaction. Moreover, our findings demonstrate that several features concerning the form of gestures can be recovered from the latent representations. Overall, this study shows that multimodal contrastive learning is a promising approach for learning gesture representations, which opens the door to using such representations in larger-scale gesture analysis studies.
@inproceedings{Ghaleb2024le, title = {Learning Co-Speech Gesture Representations in Dialogue through Contrastive Learning: An Intrinsic Evaluation}, author = {Ghaleb, Esam and Khaertdinov, Bulat and Pouw, Wim and Rasenberg, Marlou and Holler, Judith and {\"O}zy{\"u}rek, Asl{\i} and Fern{\'a}ndez, Raquel}, booktitle = {International Conference on Multimodal Interaction}, volume = {26}, number = {26}, year = {2024}, isbn = {9798400704628}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3678957.3685707}, doi = {10.1145/3678957.3685707}, pages = {274–283}, numpages = {10}, keywords = {Gesture analysis, diagnostic probing., face-to-face dialogue, intrinsic evaluation, representation learning}, location = {San Jose, Costa Rica}, series = {ICMI '24}, }
Analysing Cross-Speaker Convergence in Face-to-Face Dialogue through the Lens of Automatically Detected Shared Linguistic Constructions

Esam Ghaleb, Marlou Rasenberg, Wim Pouw, and 4 more authors

In Proceedings of the Annual Meeting of the Cognitive Science Society Jul 2024

Abs Bib PDF

Conversation requires a substantial amount of coordination between dialogue participants, from managing turn taking to negotiating mutual understanding. Part of this coordination effort surfaces as the reuse of linguistic behaviour across speakers, a process often referred to as \textitalignment. While the presence of linguistic alignment is well documented in the literature, several questions remain open, including the extent to which patterns of reuse across speakers have an impact on the emergence of labelling conventions for novel referents. In this study, we put forward a methodology for automatically detecting shared lemmatised constructions—expressions with a common lexical core used by both speakers within a dialogue—and apply it to a referential communication corpus where participants aim to identify novel objects for which no established labels exist. Our analyses uncover the usage patterns of shared constructions in interaction and reveal that features such as their frequency and the amount of different constructions used for a referent are associated with the degree of object labelling convergence the participants exhibit after social interaction. More generally, the present study shows that automatically detected shared constructions offer a useful level of analysis to investigate the dynamics of reference negotiation in dialogue.
@inproceedings{ghaleb2024an, title = {Analysing Cross-Speaker Convergence in Face-to-Face Dialogue through the Lens of Automatically Detected Shared Linguistic Constructions}, author = {Ghaleb, Esam and Rasenberg, Marlou and Pouw, Wim and Toni, Ivan and Holler, Judith and {\"O}zy{\"u}rek, Asl{\i} and Fern{\'a}ndez, Raquel}, booktitle = {Proceedings of the Annual Meeting of the Cognitive Science Society}, volume = {45}, number = {45}, year = {2024}, url = {https://escholarship.org/uc/item/43h970fc} }