Modal
Serverless AI infrastructure in pure Python — GPU functions with sub-second cold starts, secure sandboxes for agent code, batch jobs, and per-second billing.
Modal is serverless compute that feels like writing Python: decorate a function, declare its container image and GPU in code, and it runs in the cloud with sub-second cold starts and per-second billing. For agent builders, Sandboxes execute untrusted LLM-generated code in secure containers; for ML teams, it's GPU inference and massive batch jobs without Kubernetes.
Modal's pitch collapsed an entire DevOps stack into a decorator: infrastructure as Python. Container images, GPUs, autoscaling, schedules — all declared in the code that uses them, deployed in seconds, billed per second. It became a default substrate for AI teams — and, through its Sandboxes, for agents that need somewhere safe to run the code they write.
Highlights
- Functions with GPUs in one line —
@app.function(gpu="h100"); container images defined in Python, cold starts in sub-second territory. - Sandboxes for agent code — secure containers created at runtime:
sandbox.exec(), timeouts from 5 minutes to 24 hours, readiness probes, tags, and reattach viafrom_id()— built for LLM-generated code execution. - Scale without ceremony — autoscaling inference endpoints, massively parallel batch jobs, scheduled functions, web endpoints.
- Storage that follows the code — Volumes (distributed filesystems), secrets, and env vars usable across functions and sandboxes.
- Beyond Python callers — define apps in Python, invoke from JavaScript/TypeScript or Go; GPU notebooks with live collaboration round it out.
In an AI-assisted workflow
pip install modal && modal setup
# @app.function(gpu="a100", image=image)
# def embed(batch): ...
# modal run pipeline.pyTwo agent-era fits: the sandbox tool (the agent's execute_code pointed at a Modal Sandbox), and the self-serve inference layer — serving open-weight models with vLLM on per-second GPUs is a canonical Modal workload, directly relevant to the self-host economics question.
TIP
The platform's killer property for spiky AI workloads is scale-to-zero with fast cold starts: experiments and bursty pipelines pay only for seconds used — the failure mode it eliminates is the idle GPU.
Good to know
The client SDK is Apache-2.0; the platform is proprietary SaaS. Python-first by design (3.10+). Momentum is unambiguous: an $87M Series B (September 2025) followed by a $355M Series C at $4.65B (May 2026, General Catalyst and Redpoint) with $300M+ annualized revenue claimed. Against the sandbox-pure specialists: Sandboxing AI-Generated Code.
Frequently asked questions
- What is Modal in one sentence?
- A serverless platform where infrastructure is Python code — @app.function(gpu='h100', image=...) deploys a GPU function with autoscaling, no YAML, no cluster — billed per second of actual use.
- How do Modal Sandboxes compare to E2B?
- Same job — secure containers for executing agent-generated code, with exec, timeouts up to 24h, and reattachment by ID — different center of gravity. E2B is sandbox-first with code-interpreter ergonomics and an open infra stack; Modal's sandboxes live inside a broader compute platform, which wins when the same team also needs GPU inference, batch pipelines, and scheduled jobs in one place.
- What does Modal cost?
- Per-second usage against vendor-listed rates (e.g. H100s by the second, CPU cores and GiB-seconds likewise), with plan credits softening it: the free Starter tier includes $30/month of credits, Team $100/month on top of its subscription. You pay for compute you use and nothing while idle.
Related
- Sandboxing AI-Generated Code: E2B vs Modal vs Daytona vs Vercel SandboxWhere should agent-written code run? The four sandbox platforms compared — isolation models, persistence, economics — plus the design rules that keep execution safe.
- E2bOpen-source Firecracker-microVM sandboxes where AI agents safely execute untrusted code — stateful code interpreters with full Linux, pause/resume, and desktop VMs.
- DaytonaSub-90ms agent sandboxes — isolated computers with snapshots, volumes, Git and LSP tools, on Linux, Windows, or Android; AGPL self-host or managed cloud.
- vLLMA high-throughput, memory-efficient inference and serving engine for LLMs, with PagedAttention, continuous batching, and an OpenAI-compatible API server.
- LLM Inference EngineerUse this agent to serve and optimize self-hosted LLM inference — sizing GPUs, configuring a serving engine like vLLM (continuous batching, PagedAttention, tensor parallelism), applying quantization, and tuning throughput and tail latency against a cost and p95 budget. Examples — "serve Llama-3-70B at p95 under 2s on our GPUs", "our self-hosted model is slow and the GPUs sit half-idle — raise throughput", "quantize this model to fit one GPU without wrecking quality".
- Self-Host vs API: When Does Running Your Own LLM Actually Pay Off?The real economics of self-hosting an LLM vs. calling a hosted API — GPU utilization, privacy, latency, and the hidden ops costs that decide the crossover.
- Vercel SandboxEphemeral Firecracker microVMs on Vercel for untrusted and AI-generated code — millisecond startup, Node and Python runtimes, persistent by default.