We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.
Dialect Understanding |
Input:
Output: "[方言-粤语] 你在干什么, 是不是不想聊天"
|
Input:
Output: "[方言-上海话] 我们考试还没定下来呢"
|
Input:
Output: "[方言-闽南语] 宝贝, 早点休息, 晚安"
|
Input:
Output: "[方言-川渝方言] 我难受的很, 别人都睡了"
|
|
Voice Cloning |
Input1:
Input2: "全球每年有超过一百三十五万人,因交通事故而死亡"
Output:
|
Input1:
Input2: "The stained glass offered a hypnotic atmosphere"
Output:
|
Spoken Chatting |
Input:
Output:
|
Input:
Output:
|
@article{Mingomni2025,
title = {Ming-Omni: A Unified Multimodal Model for Perception and Generation},
author = {Inclusion AI, Ant Group},
journal = {Technical Report},
year = {2025}
}
@article{Mingunify2025,
title = {Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction},
author = {Inclusion AI, Ant Group},
journal = {Technical Report, arXiv:2505.02471},
year = {2025}
}