Qwen3-VL
Alibaba Qwen's open-weights vision-language model family (2B–235B, Apache-2.0): image and document understanding, OCR, visual reasoning, and video.
Qwen3-VL is the vision-language model series from Alibaba's Qwen team: it reads images, documents, and video alongside text for OCR, visual reasoning, spatial grounding, and agentic use. Open-weights under Apache-2.0 (dense 2B–32B plus 30B-A3B and 235B-A22B MoE variants, Instruct and Thinking editions) on Hugging Face and ModelScope — a strong open VLM you can self-host.
Qwen3-VL is the open-weights vision-language model family from Alibaba's Qwen team — models that read images, documents, and video alongside text. It targets the full range of multimodal work: OCR and document understanding, visual reasoning, spatial grounding, video comprehension, and agentic use (driving tools from what it sees). Released under Apache-2.0, it's one of the strongest open VLMs you can download and run yourself.
It's aimed at teams who want capable multimodal understanding without sending documents or images to a proprietary API — for privacy, cost control at volume, offline operation, or simply control over the model.
Highlights
- Open weights (Apache-2.0) — free for research and commercial use; self-hostable.
- Document & OCR — reads layout, tables, charts, and handwriting; strong on document understanding, not just transcription.
- Visual reasoning & grounding — answers questions about images, with spatial grounding and long-context understanding.
- Video — temporal understanding for captioning, search, and event detection.
- A family of sizes — dense models from 2B to 32B plus 30B-A3B and 235B-A22B MoE variants, in Instruct and Thinking editions, so you can fit the model to your hardware and latency budget.
In an AI-assisted workflow
# pull the weights from Hugging Face and serve with a high-throughput engine
# e.g. Qwen/Qwen3-VL-8B-Instruct -> vLLM -> an OpenAI-compatible endpointYou can self-host the weights (Hugging Face or ModelScope) behind a serving engine, or call the models through Alibaba Cloud's hosted API if you'd rather not run infrastructure.
TIP
Right-size the variant to the task: a 2B–8B model often handles routine OCR and form extraction at a fraction of the cost and latency of the largest model — reserve the 32B/MoE variants for the hardest reasoning. Measure on your own documents before committing.
Good to know
Qwen3-VL is open source (Apache-2.0) and the weights are free; you provide the compute (your own GPUs or a hosted endpoint). To decide between self-hosting and a hosted API, see Self-Host vs API; to serve it efficiently, the llm-inference-engineer. For structured document extraction with it, use the multimodal-document-extractor skill, and for the broader picture see Using Vision-Language Models for OCR, Documents, and Video.
Related
- Using Vision-Language Models for OCR, Documents, and Video UnderstandingHow to use vision-language models for OCR, documents, and video: how they differ from traditional OCR, their failure modes, and getting reliable output.
- Multimodal Document ExtractorExtract structured data from documents and images with a vision-language model — define the target schema, prompt the VLM to fill it from the page (invoices, forms, receipts, statements, IDs), and verify critical fields against the source. Use when you need reliable structured output from messy, varied, or scanned documents that defeat template-based OCR.
- Self-Host vs API: When Does Running Your Own LLM Actually Pay Off?The real economics of self-hosting an LLM vs. calling a hosted API — GPU utilization, privacy, latency, and the hidden ops costs that decide the crossover.
- LLM Inference EngineerUse this agent to serve and optimize self-hosted LLM inference — sizing GPUs, configuring a serving engine like vLLM (continuous batching, PagedAttention, tensor parallelism), applying quantization, and tuning throughput and tail latency against a cost and p95 budget. Examples — "serve Llama-3-70B at p95 under 2s on our GPUs", "our self-hosted model is slow and the GPUs sit half-idle — raise throughput", "quantize this model to fit one GPU without wrecking quality".