MiniCPM-V 4.5

MiniCPM-V 4.5 is an 8-billion-parameter open-source multimodal large language model (MLLM) designed for efficient image, video, and document understanding on local devices such as smartphones. It combines Qwen3-8B and SigLIP2-400M architectures to achieve GPT-4o-level performance while optimizing for mobile deployment.
The core value of MiniCPM-V 4.5 lies in its ability to deliver state-of-the-art multimodal capabilities with minimal computational overhead, enabling high-resolution visual processing, long-context video analysis, and complex document parsing directly on consumer hardware.

State-of-the-Art Vision-Language Performance: MiniCPM-V 4.5 achieves an average score of 77.2 on OpenCompass, outperforming GPT-4o-latest, Gemini-2.0 Pro, and Qwen2.5-VL 72B in vision-language tasks. It processes high-resolution images up to 1.8 million pixels (e.g., 1344x1344) using LLaVA-UHD architecture with 4x fewer visual tokens than competitors.
Efficient Video Understanding: A unified 3D-Resampler compresses 6 video frames (448x448) into 64 tokens, achieving a 96x compression rate and enabling high refresh rate (10FPS) video analysis. This supports benchmarks like Video-MME, LVBench, and MotionBench without increasing LLM inference costs.
Controllable Hybrid Fast/Deep Thinking: Users can toggle between fast thinking for low-latency responses and deep thinking for complex problem-solving, balancing efficiency and accuracy across scenarios like real-time OCR or detailed document parsing.

High Computational Costs for Multimodal Tasks: Addresses the inefficiency of processing high-resolution images and long videos by reducing token counts and enabling GPU-free inference through optimizations like llama.cpp and ollama support.
Mobile Deployment Limitations: Targets developers and researchers needing desktop-level AI performance on smartphones, with quantized models (int4, GGUF, AWQ) and iOS app optimizations for iPhone/iPad.
Specialized Use Case Gaps: Solves OCR, document parsing, and multilingual support challenges by outperforming GPT-4o-latest on OCRBench and achieving SOTA results on OmniDocBench for PDF analysis across 30+ languages.

Superior Efficiency-to-Performance Ratio: At 8B parameters, it surpasses 30B+ models like Qwen2.5-VL 72B in vision-language tasks while maintaining phone-compatible resource usage.
Innovative Token Compression: The 3D-Resampler reduces video token counts by 96%, and LLaVA-UHD cuts image tokens by 75%, enabling simultaneous processing of 180 video frames or 1.8MP images.
Commercial Accessibility: Free for academic use and available for commercial applications after registration, with enterprise-ready deployment options via SGLang, vLLM, and LLaMA-Factory fine-tuning.

Can MiniCPM-V 4.5 be used commercially? Yes, after completing a registration questionnaire, the model weights are free for commercial use under the Apache-2.0 license, with enterprise deployment support via quantized formats and inference frameworks.
How does it handle long videos? The 3D-Resampler dynamically compresses up to 180 frames into 64 tokens per segment, enabling efficient analysis of high-FPS (10FPS) and long-duration videos without GPU acceleration.
What makes its OCR capabilities superior? MiniCPM-V 4.5 achieves 85.7% accuracy on OCRBench, outperforming GPT-4o-latest through RLAIF-V training and LLaVA-UHD’s 1344x1344 resolution support for dense text extraction.
Is local phone deployment feasible? Yes, optimized iOS apps and llama.cpp/ollama integrations enable CPU-based inference on iPhones and iPads, with quantized models (16 sizes) reducing memory usage by up to 75%.
Does it support multilingual inputs? The model processes 30+ languages via RLAIF-V training, with benchmarks showing improved accuracy over GPT-4o-latest in non-English document parsing and video understanding tasks.

GPT-4o level vision model on the phone