Agent Infrastructure Comparison
This project runs the same LangGraph agent across four different ways to deploy an LLM — self-hosted on a home GPU, a rented cloud GPU, an API service, and directly in the visitor's browser. Each one has different trade-offs in cost, latency, and how much you actually have to maintain. The browser version runs a smaller model since most people's devices can't handle an 8B parameter model client-side.
Running inference in the browser is more interesting than it sounds — not just for chatbots, but for things like in-page search, navigation hints, or lightweight automation that never needs to touch a server.
Shared Design Decisions
Chosen because it runs on consumer GPUs (RTX 3090 / 4080), is available through OpenRouter, and fits Ollama's GGUF format. All three server deployments use the same model for an apples-to-apples comparison. Ollama serves it as a Q4_K_M quantized GGUF, requiring ~5.2 GB VRAM — well within the 16 GB available on the RTX 4080.
Four providers evaluated. DuckDuckGo: scrapes HTML, triggers aggressive rate-limiting under normal load. Brave Search: raw JSON requires extra LLM parsing. Jina AI: moved from free to auth-required. Tavily: purpose-built for AI agents, returns clean LLM-optimized text, 1,000 req/month free tier.
Each query is capped at 2 searches to prevent runaway ReAct loops.
Per-IP abuse gate: 50 requests/IP/day with in-memory burst protection. Sits in front of all four server proxies, protecting Tavily quota and GPU compute regardless of which inference backend is handling the request.
This demo caps at 3 concurrent users — Ollama handles that fine and runs natively on Windows. For a real production system, vLLM would replace it: continuous batching, PagedAttention, prefix caching, and features like DFlash and TurboQuant give it far better throughput under actual load. The swap is trivial since both expose OpenAI-compatible APIs — only MODEL_BASE_URL changes.
vLLM requires Linux (or a Docker container on Windows). At 3 concurrent users the added complexity isn't worth it.
Provides SSE streaming, thread and session management, and concurrent request slots out of the box. The Nuxt/Nitro proxy posts a query and pipes the SSE stream to the browser — no custom streaming server needed.
Every response emits a metrics event (ttft_ms, total_ms, tokens_per_sec, cost_usd) — this is what makes the comparison concrete and not just theoretical.
The Four Deployments
Ollama is used here because the demo caps at 3 concurrent users — sufficient for a portfolio. For production, vLLM would replace it: continuous batching, PagedAttention, prefix caching, and FlashAttention 2 give dramatically higher throughput. The swap requires only changing MODEL_BASE_URL — both expose OpenAI-compatible APIs.
Cloudflare Tunnel exposes the local LangGraph Server over HTTPS with no port-forwarding or firewall changes. Free tier, one command to start.
GitHub Actions starts the instance at 06:30 and stops it at 17:00 Israel time, Sunday through Thursday. No point serving a portfolio demo at 3 AM or on a weekend.
Vast.ai offers serverless (scale-to-zero), but cold starts from zero GPUs take too long for a live demo — no visitor will wait 2–3 minutes for a model to load. A minimum of 1 GPU defeats the cost savings. Scheduled on/off achieves near-zero idle cost without cold-start latency.
A startup.sh script runs on every boot: pulls the repo, installs deps, starts Ollama, pulls the model, and launches LangGraph Server. Restart = redeploy.
Railway's free tier is sufficient for a backend that proxies HTTP requests. Auto-deploys from main on push. No GPU needed — the model runs on OpenRouter's infrastructure.
Per-query cost is higher than self-hosted, but infrastructure overhead is zero. At scale the math can shift: managed inference services can cost less than the engineers needed to build and maintain self-hosted serving — especially given the pace of change in quantization, memory management, and token optimization.
Mobile devices always use SmolLM2-135M over WASM. Desktop tries WebGPU first; if unavailable or device RAM is below 6 GB, it falls back to WASM with the smaller model.
The ReAct loop runs entirely in a Web Worker. The model outputs Action: web_search / Query: ... as text; the worker parses it, calls /api/agent/search (Nitro → Tavily), injects the result as an observation, and loops until Final Answer.
At 0.5B–135M scale, browser inference is more practical for navigation assistance, form automation, and local search than for general chatbots — and nothing leaves the device except the search query.