Alibaba Cloud unveils Qwen2.5-Omni-7B multimodal AI model

Fri, 28th Mar 2025

Alibaba Cloud has announced the release of Qwen2.5-Omni-7B, its latest multimodal artificial intelligence model capable of processing text, images, audio, and videos while generating text and speech in real time.

This model, despite its relatively small 7 billion parameter size, is designed to perform high-quality AI tasks and is optimised for use on edge devices such as smartphones and laptops. The company stated, "This sets a new standard for optimal deployable multimodal AI for edge devices like mobile phones and laptops." It is aimed at applications including assistive technologies for the visually impaired, AI-powered customer service, and smart cooking assistants.

The launch makes Qwen2.5-Omni-7B available on platforms like Hugging Face and GitHub, with further access provided through Qwen Chat and Alibaba Cloud's ModelScope. Over the years, Alibaba Cloud has made over 200 generative AI models open-source, supporting a wide range of AI development initiatives.

The model is described as delivering "remarkable performance across all modalities, rivaling specialized single-modality models of comparable size." It introduces an architecture that includes Thinker-Talker Architecture, which separates text and speech synthesis processes to enhance output quality; TMRoPE, a position embedding method to integrate video and audio inputs more effectively; and Block-wise Streaming Processing for efficient audio responses.

"Qwen2.5-Omni-7B was pre-trained on a vast, diverse dataset, including image-text, video-text, video-audio, audio-text, and text data, ensuring robust performance across tasks," Alibaba Cloud noted. It is said to excel in tasks involving multiple modalities with an emphasis on understanding and generating robust speech following voice instructions.

Reinforcement learning optimisation has further improved the stability and performance of the model, reducing issues such as attention misalignment and pronunciation errors during speech generation.

Recent launches by Alibaba Cloud have included Qwen2.5 and Qwen2.5-Max, with the latter recognised for its capabilities in comparison to other proprietary large language models. Previous open-source releases in the series include Qwen2.5-VL and Qwen2.5-1M, which focus on enhancing visual understanding and handling long-context input, respectively.

Share on: