Zero-cost tokens: hosted free tiers and local runtimes.
Free API tier (LLM + audio)
Free tier, no card: ~30 RPM / 6,000 TPM / 1,000 requests-day on most models (small 8B models up to 14,400 RPD; Llama 4 Scout has 30K TPM; Llama 4 Maverick halved to 15 RPM / 500 RPD). Whisper large-v3 / v3-turbo: 20 RPM, 2,000 requests/day, 7,200 audio-seconds/hour, 25MB upload cap. Limits are per key per model; reset midnight UTC.
Speed-critical text and tool-calling (LPU hardware, 3-10x faster than GPU providers) and free Whisper transcription pipelines. OpenAI-compatible endpoint.
โ 6K TPM is the binding constraint for long prompts, not the daily cap. You hit whichever limit (RPM/TPM/RPD) arrives first โ read x-ratelimit-* headers and back off. Keys get revoked/dead occasionally; rotate and re-verify. Batch workloads outgrow it fast.
Free API tier (LLM + vision + embeddings)
No card, no expiry. Post-Dec-2025 quota cuts: Gemini 2.5 Flash-Lite ~15 RPM / 1,000 RPD; Flash ~10 RPM / 250 RPD; Pro ~5 RPM / 50-100 RPD; ~250K TPM across models. Multimodal (vision/audio input) and gemini-embedding included in free quota. Limits per project, reset midnight Pacific.
Highest-quality free model access, long context (1M tokens), free vision/multimodal calls, and free embeddings โ the volume workhorse of a free-provider stack.
โ Free-tier prompts may be used to improve Google products โ don't send sensitive data. Sources conflict on exact RPD (some still quote pre-cut 1,500 RPD for Flash; ~250 is the safer planning number as of early 2026 โ verify in your console). Quotas have been volatile; undocumented stricter limits on image generation.
Free API tier (LLM)
1,000,000 tokens/day free, no card, no waitlist. ~30 RPM, 60K-100K TPM depending on model. Llama, Qwen, GPT-OSS-120B, GLM families. OpenAI-compatible. Daily reset, no rollover.
Token-hungry batch jobs (classification, generation pipelines, daily reports) and anything latency-sensitive โ wafer-scale hardware pushes 1,000-3,000 tokens/sec.
โ Free tier context window capped at 8,192 tokens across all models ('temporary', up to 128K on request) โ kills long-context use cases. Preview models have much tighter caps (e.g. GLM 4.7 at 10 RPM / 100 RPD). Model lineup rotates.
Free API tier (aggregator)
Models with ':free' suffix: 20 RPM; 50 requests/day with <$10 lifetime credits purchased, 1,000 requests/day once you've topped up $10 (one-time, the $10 stays spendable). Dozens of free models โ DeepSeek R1/V3, Llama, Gemini Flash variants, etc.
Trying many models behind one OpenAI-compatible key; fallback routing when a primary free provider 429s. The $10 unlock is the best $10 in free inference.
โ Free variants are served by third-party providers (e.g. Chutes) that throttle free traffic to favour their own paying users โ 429s are common at peak and aren't OpenRouter's doing. Free model lineup churns weekly. Provider data policies vary โ check the model page before sending anything private. Negative balance = 402 even on free models. Multiple accounts don't help (global limits).
Free API tier (LLM)
Free trial with no time limit: the old 1,000-5,000 credit system was replaced by per-model rate limits (~40 RPM typical, shown top-right of build.nvidia.com; not published per model). Hosts Nemotron, Llama, Mistral, Microsoft and other open models. OpenAI-compatible.
Free access to big open models (e.g. Nemotron-3-Super-120B) for prototyping, research, and agent backends without self-hosting.
โ Licensed for trial/dev/test only โ production use formally requires an NVIDIA AI Enterprise license. Rate-limit increases need a forum request (40โ200 RPM requests routinely sit unanswered). Older accounts may still show the legacy credit system; personal-email signups got fewer credits. Limits unpublished, so budget conservatively.
Free API tier (LLM + code + embeddings)
Experiment plan: $0, no card (phone verification required). Per-model limits: ~1 req/sec, 500K tokens/min, 1B tokens/month โ and limits are per model, so multiple models multiply your quota. Includes proprietary models (Mistral Large/Medium, Codestral, Pixtral) plus mistral-embed.
Huge free token volume for prototyping; free Codestral for code completion; an EU-jurisdiction provider if data residency matters.
โ Free tier requires opting in to data training โ your prompts train their models. Explicitly for evaluation/prototyping, not production. 1 RPS concurrency is the real constraint for agent loops, not the monthly token cap.
Free API tier (LLM playground/API)
Free with any GitHub account: OpenAI GPT-4.1/o-series, Llama, Phi, Mistral, DeepSeek etc. via one GitHub PAT. Per-model tiered limits (roughly: 'low' tier ~15 RPM / 150 RPD, 'high' tier ~10 RPM / 50 RPD, reasoning models single-digit RPD) with small context/output caps (~8K in / 4K-8K out on free).
Zero-signup model comparison and CI/Actions-integrated LLM calls using credentials you already have.
โ Rate limit numbers above are from documented 2025 tiers โ GitHub has been folding this into Copilot premium-request billing, so re-verify current per-model RPDs (uncertain as of June 2026). Token windows are clipped versus the same model elsewhere. 403s can come from org policy or region blocks, not quota. Not for production.
Free API tier (edge: LLM + embeddings + image + STT)
10,000 'neurons'/day free, no card, on Workers Free. Catalog of 50+ open models: Llama 3.x, Gemma, Mistral, Stable Diffusion XL (image gen), BGE embeddings, Whisper STT. Runs on Cloudflare's 300+ edge locations; OpenAI-compatible via AI Gateway.
Edge-latency inference inside Workers apps, and the rare free tier that covers text + embeddings + image generation + speech-to-text under one quota.
โ 'Neurons' are an opaque compute unit โ per-model neuron cost is hard to find, so the 10K/day budget is unpredictable until you measure it. Models are small/open-weight only โ no frontier quality. 10K neurons is only a few hundred LLM responses in practice.
Free API tier (aggregator)
Free accounts get a small monthly credit bucket (~$0.10/month) routed through the HF router to Groq, Together, Fireworks, Hyperbolic, Cerebras etc. PRO ($9/mo) gets ~$2/month included credits (20x) plus ZeroGPU. No markup over provider rates.
One key + one SDK over many providers; quick smoke tests of any Hub model; BYO provider keys routed through HF for unified billing.
โ The free $0.10/month is exhausted in a handful of calls โ it's a taster, not an inference budget. No pay-as-you-go on free accounts (hard stop at the cap). Error messages name the downstream provider (e.g. 'limit for groq'), which confuses people into thinking it's a provider-side ban.
Free API tier (embeddings + rerank + chat)
Free trial API keys, no card: historically ~20 RPM and ~1,000 calls/month covering Chat, Embed (embed-v4), Rerank, and Classify endpoints.
Free reranking โ Cohere Rerank is the standout free citizen here for RAG pipelines โ plus solid multilingual embeddings at prototype volume.
โ Exact 2026 trial quotas unverified โ Cohere has tweaked trial limits several times; confirm in the dashboard before building on them. Trial keys are for evaluation only; production requires a paid key. Trial traffic may be used for model improvement.
Free API tier (embeddings + utilities)
Free API key on signup with a one-time token grant (historically ~1M tokens, promotional grants up to 10M) covering jina-embeddings-v3/v4, reranker, and the Reader API (r.jina.ai URL-to-markdown, generous free rate without a key).
Free long-context multilingual embeddings and the Reader API โ the easiest free way to turn web pages into clean LLM-ready markdown.
โ Token grant is one-time, not monthly โ it runs out silently. Exact grant size as of mid-2026 unverified; check the dashboard. Reader API rate limits tightened over time for keyless use.
Free API tier (embeddings + rerank)
Free token grant on signup โ historically 200M free tokens on most models (50M on some premium ones), which is enormous for embeddings. voyage-3.5 family + rerankers.
Embedding a large corpus once for free โ 200M tokens covers most personal/startup RAG indexes end-to-end. Top-tier retrieval quality.
โ Grant size/terms as of June 2026 unverified (MongoDB acquired Voyage in 2025 โ free tier may have shifted under MongoDB billing); confirm before relying on it. One-time grant, not recurring. Low default RPM until you add a card.
Local inference (server + CLI)
Fully free, open source, unlimited โ bounded only by your hardware. One-line model pulls (Llama 3.x, Qwen, DeepSeek-R1 distills, Gemma, Phi, embeddings like nomic-embed/BGE), OpenAI-compatible local server on :11434.
Default local backend for agents, bulk/recurring jobs, and private data โ zero marginal cost, no rate limits, works offline. Easiest local on-ramp.
โ Quality ceiling is your RAM/VRAM โ 32GB laptop โ 7B-14B models at useful speed (CPU/iGPU inference is several tokens/sec, not Groq-fast). Quantization (Q4 etc.) trades quality for fit. Ollama's cloud/web-search add-ons are NOT free โ local-only is the free path.
Local inference (engine)
Free, MIT-licensed. Runs any GGUF model on CPU, CUDA, Metal, Vulkan, SYCL/oneAPI (Intel iGPU/NPU paths). llama-server gives an OpenAI-compatible endpoint; also does embeddings, speculative decoding, grammar-constrained output.
Maximum control and best performance-per-watt on non-NVIDIA hardware (e.g. Intel Meteor Lake iGPU via Vulkan/SYCL); the engine underneath Ollama/LM Studio when you need flags they don't expose.
โ DIY ergonomics: you compile flags, pick quants, and tune context/offload yourself. Build options (Vulkan vs SYCL) materially change speed on iGPUs โ benchmark both. Moves fast; pin a release for anything durable.
Local inference (GUI app)
Free for personal AND work use (license relaxed 2024). GUI model browser/downloader for GGUF + MLX, chat UI, and a local OpenAI-compatible server with per-model load/offload controls.
Friendliest way to discover, download, and A/B local models before promoting one into a headless Ollama/llama.cpp deployment; good for demoing local AI to non-technical people.
โ Closed source (free โ open). Heavier than headless llama.cpp โ not what you want running as a daemon. Hardware ceiling caveats same as all local inference.
Local inference (audio)
Fully free, open source, unlimited local speech-to-text. whisper.cpp (GGML, CPU/iGPU-friendly) and faster-whisper (CTranslate2, ~4x realtime on CPU with int8) run Whisper large-v3/turbo and distil-whisper locally.
Private or bulk transcription with no per-minute API cost and no 25MB upload caps โ podcasts, meeting archives, voice pipelines that must work offline.
โ Slower than Groq-hosted Whisper (seconds-to-minutes vs near-instant) โ use hosted for interactive, local for batch/private. Accuracy on accents/noise depends on model size; large-v3 needs ~3-4GB RAM quantized. Diarization needs a separate tool (e.g. pyannote).