Esam Ghaleb

In brief: I study and model how verbal and non-verbal cues work together in a variety of human behaviors. Since September 2024 I have been a researcher in the Multimodal Language Department at the Max Planck Institute for Psycholinguistics, where I model multimodal communication for both human insight and machine applications.

Trained in computer science and engineering, I work across AI, cognitive science, psycholinguistics, psychology and healthcare to computationally model human behaviour for both fundamental and applied research. My work focuses on multimodal interaction, particularly in the context of dialogue. My research spans various domains, including gesture generation, multimodal dialogue systems, and previously affective computing, with applications in healthcare, human-computer interaction, and social robotics.

During my PhD and post-doc at Maastricht University, I developed explainable multimodal emotion-recognition techniques; at the Institute for Logic, Language & Computation (University of Amsterdam) I investigated linguistic–gestural alignment and automatic gesture segmentation in dialogues. My applied projects include two EU-funded studies (200+ participants) and a work package that combined clinicians’ expertise with machine intelligence for socio-economic contexts.

News

Oct 9, 2025	Preprint on the Visual Iconicity Challenge
Oct 7, 2025	PhD Opportunities Available at Max Planck Institute for Psycholinguistics via MP-AIX
Sep 5, 2025	We Are Hiring a Postdoc at Max Planck Institute on Gesture Generation in Face-to-Face Dialogue
Jul 23, 2025	NWO XS Open Competition - Domain Science Grant Awarded for Multimodal Gesture Generation Project
Jun 27, 2025	Paper accepted at ICCV on Semantics-Aware Co-Speech Gesture Generation!

Selected Publications

The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

Onur Keleş, Aslı Özyürek, Gerardo Ortega, and 2 more authors

In Pre-print at arXiv Oct 2025

Abs Bib PDF

Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the Visual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On phonological form prediction, VLMs recover some handshape and location detail but remain below human performance; on transparency, they are far from human baselines; and only top models correlate moderately with human iconicity ratings. Interestingly, models with stronger phonological form prediction correlate better with human iconicity judgment, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.
@inproceedings{keles2025visual, title = {The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping}, author = {Keleş, Onur and {\"O}zy{\"u}rek, Asl{\i} and Ortega, Gerardo and G{\"o}kg{\"o}z, Kadir and Ghaleb, Esam}, booktitle = {Pre-print at arXiv}, year = {2025}, month = oct, publisher = {arXiv}, keywords = {Vision-language models, sign language, iconicity, visual grounding, multimodal learning}, }
SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning

Lanmiao Liu, Esam Ghaleb, Aslı Özyürek, and 1 more author

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Oct 2025

Abs Bib PDF

Creating a virtual avatar with semantically coherent gestures that are aligned with speech is a challenging task. Existing gesture generation research mainly focused on generating rhythmic beat gestures, neglecting the semantic context of the gestures. In this paper, we propose a novel approach for semantic grounding in co-speech gesture generation that integrates semantic information at both fine-grained and global levels. Our approach starts with learning the motion prior through a vector-quantized variational autoencoder. Built on this model, a second-stage module is applied to automatically generate gestures from speech, text-based semantics and speaker identity that ensures consistency between the semantic relevance of generated gestures and co-occurring speech semantics through semantic coherence and relevance modules. Experimental results demonstrate that our approach enhances the realism and coherence of semantic gestures. Extensive experiments and user studies show that our method outperforms state-of-the-art approaches across two benchmarks in co-speech gesture generation in both objective and subjective metrics. The qualitative results of our model can be viewed at \hrefhttps://semgesture.github.io/https://semgesture.github.io/. Our code, dataset and pre-trained models will be shared upon acceptance.
@inproceedings{liu202SemGes, title = { SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning }, author = {Liu, Lanmiao and Ghaleb, Esam and {\"O}zy{\"u}rek, Asl{\i} and Yumak, Zerrin}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, year = {2025}, month = oct, publisher = {CVF/IEEE}, address = {Honolulu, Hawai'i, USA}, url = {https://semgesture.github.io/}, numpages = {10}, keywords = {Gesture generation, semantics, generative AI}, location = {Honolulu, Hawai'i, USA}, series = {ICCV '25}, }
Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, and 8 more authors

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) Jul 2025

Abs Bib PDF

There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments, raising questions about the validity of these evaluations, as well as their reproducibility in the case of proprietary models. We provide JudgeBench, an extensible collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show substantial variance across models and datasets. Models are reliable evaluators on some tasks, but overall display substantial variability depending on the property being evaluated, the expertise level of the human judges, and whether the language is human or model-generated. We conclude that LLMs should be carefully validated against human judgments before being used as evaluators.
@inproceedings{bavaresco2024llms, title = {Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks}, author = {Bavaresco, Anna and Bernardi, Raffaella and Bertolazzi, Leonardo and Elliott, Desmond and Fern{\'a}ndez, Raquel and Gatt, Albert and Ghaleb, Esam and Giulianelli, Mario and Hanna, Michael and Koller, Alexander and others}, year = {2025}, booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)}, month = jul, address = {Cedarville, Ohio, 45314, United States}, location = {Vienna, Austria}, publisher = {Association for Computational Linguistics}, url_github = {https://github.com/dmg-illc/JUDGE-BENCH} }
I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue

Esam Ghaleb, Bulat Khaertdinov, Aslı Özyürek, and 1 more author

In Proceedings of the of the 63rd Conference of the Association for Computational Linguistics (ACL Findings) Jul 2025

Abs Bib PDF

In face-to-face interaction, we use multiple modalities, including speech and gestures, to communicate information and resolve references to objects. However, how representational co-speech gestures refer to objects remains understudied from a computational perspective. In this work, we address this gap by introducing a multimodal reference resolution task centred on representational gestures, while simultaneously tackling the challenge of learning robust gesture embeddings. We propose a self-supervised pre-training approach to gesture representation learning that grounds body movements in spoken language. Our experiments show that the learned embeddings align with expert annotations and have significant predictive power. Moreover, reference resolution accuracy further improves when (1) using multimodal gesture representations, even when speech is unavailable at inference time, and (2) leveraging dialogue history. Overall, our findings highlight the complementary roles of gesture and speech in reference resolution, offering a step towards more naturalistic models of human-machine interaction.
@inproceedings{ghaleb-etal-acl-2025, author = {Ghaleb, Esam and Khaertdinov, Bulat and {\"O}zy{\"u}rek, Asl{\i} and Fern{\'a}ndez, Raquel}, title = {I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue}, booktitle = {Proceedings of the of the 63rd Conference of the Association for Computational Linguistics (ACL Findings)}, publisher = {Association for Computational Linguistics}, address = {Cedarville, Ohio, 45314, United States}, month = jul, year = {2025}, location = {Veinna, Austria}, url_github = {https://github.com/EsamGhaleb/MultimodalReferenceResolution}, }
Analysing Cross-Speaker Convergence in Face-to-Face Dialogue through the Lens of Automatically Detected Shared Linguistic Constructions

Esam Ghaleb, Marlou Rasenberg, Wim Pouw, and 4 more authors

In Proceedings of the Annual Meeting of the Cognitive Science Society Jul 2024

Abs Bib PDF

Conversation requires a substantial amount of coordination between dialogue participants, from managing turn taking to negotiating mutual understanding. Part of this coordination effort surfaces as the reuse of linguistic behaviour across speakers, a process often referred to as \textitalignment. While the presence of linguistic alignment is well documented in the literature, several questions remain open, including the extent to which patterns of reuse across speakers have an impact on the emergence of labelling conventions for novel referents. In this study, we put forward a methodology for automatically detecting shared lemmatised constructions—expressions with a common lexical core used by both speakers within a dialogue—and apply it to a referential communication corpus where participants aim to identify novel objects for which no established labels exist. Our analyses uncover the usage patterns of shared constructions in interaction and reveal that features such as their frequency and the amount of different constructions used for a referent are associated with the degree of object labelling convergence the participants exhibit after social interaction. More generally, the present study shows that automatically detected shared constructions offer a useful level of analysis to investigate the dynamics of reference negotiation in dialogue.
@inproceedings{ghaleb2024an, title = {Analysing Cross-Speaker Convergence in Face-to-Face Dialogue through the Lens of Automatically Detected Shared Linguistic Constructions}, author = {Ghaleb, Esam and Rasenberg, Marlou and Pouw, Wim and Toni, Ivan and Holler, Judith and {\"O}zy{\"u}rek, Asl{\i} and Fern{\'a}ndez, Raquel}, booktitle = {Proceedings of the Annual Meeting of the Cognitive Science Society}, volume = {45}, number = {45}, year = {2024}, url = {https://escholarship.org/uc/item/43h970fc} }