Ai2 unveils Molmo 2, an open source video understanding model built to prove that smaller, transparent AI systems can rival closed, proprietary platforms in grounded video intelligence.
The Allen Institute for AI (Ai2) has released Molmo 2, an open source video understanding model aimed at demonstrating that smaller, open models can serve as credible alternatives to large proprietary systems for enterprise video analysis.
Molmo 2 is designed to challenge the dominance of closed models in grounded vision, a critical capability for video understanding that links visual elements directly to language and reasoning. In its press release, Ai2 stated that Molmo 2 “takes Molmo’s strengths in grounded vision and expands them to video and multi-image understanding.” The institute added, “One of our core design goals was to close a major gap in open models: grounding.”
Ai2 released three variants: Molmo 2 8B, a Qwen-3–based model described as its “best overall model for video grounding and QA”; Molmo 2 4B, optimised for more efficient deployments; and Molmo 2-O 7B, built on Ai2’s Olmo model. All variants support single-image, multi-image, and variable-length video inputs, enabling tasks such as video grounding, object tracking, and question answering.
According to Ai2, Molmo 2 surpasses earlier Molmo versions in accuracy, temporal understanding, and pixel-level grounding, and in some cases performs competitively with larger proprietary models such as Google’s Gemini 3. Despite their smaller size, Molmo 2 models outperformed Gemini 3 Pro and other open-weight competitors on video tracking benchmarks.
Ai2 noted that Molmo 2’s strongest gains appear in video grounding and video counting. However, the institute acknowledged remaining challenges, stating, “These results highlight both progress and remaining headroom — video grounding is still hard, and no model yet reaches 40% accuracy.”













































































