Abstract: Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence.
In this talk, I will introduce Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines.
Short bio: Yuhan Liu is a senior majoring in Data Science and Mathematics at NYU Shanghai. Her research interests focus on multimodal models and embodied AI. She works with Prof. Shengjie Wang at NYU Shanghai and Prof. Lianhui Qin at UC San Diego.
