Portfolio — Allan Perez Feldman

archive Archived — no longer maintained

DEPL_2025.04.05

Agent Infrastructure Comparison

Self-Hosted vs. GPU Renting vs. API vs. Client Browser

This project runs the same LangGraph agent across four different ways to deploy an LLM — self-hosted on a home GPU, a rented cloud GPU, an API service, and directly in the visitor's browser. Each one has different trade-offs in cost, latency, and how much you actually have to maintain. The browser version runs a smaller model since most people's devices can't handle an 8B parameter model client-side.

Running inference in the browser is more interesting than it sounds — not just for chatbots, but for things like in-page search, navigation hints, or lightweight automation that never needs to touch a server.

Shared Design Decisions

Model — Qwen3 8B

Chosen because it runs on consumer GPUs (RTX 3090 / 4080), is available through OpenRouter, and fits Ollama's GGUF format. All three server deployments use the same model for an apples-to-apples comparison. Ollama serves it as a Q4_K_M quantized GGUF, requiring ~5.2 GB VRAM — well within the 16 GB available on the RTX 4080.

Search — Tavily

Four providers evaluated. DuckDuckGo: scrapes HTML, triggers aggressive rate-limiting under normal load. Brave Search: raw JSON requires extra LLM parsing. Jina AI: moved from free to auth-required. Tavily: purpose-built for AI agents, returns clean LLM-optimized text, 1,000 req/month free tier.

Each query is capped at 2 searches to prevent runaway ReAct loops.

Rate Limiting — Upstash Redis

Per-IP abuse gate: 50 requests/IP/day with in-memory burst protection. Sits in front of all four server proxies, protecting Tavily quota and GPU compute regardless of which inference backend is handling the request.

Why Ollama, not vLLM

This demo caps at 3 concurrent users — Ollama handles that fine and runs natively on Windows. For a real production system, vLLM would replace it: continuous batching, PagedAttention, prefix caching, and features like DFlash and TurboQuant give it far better throughput under actual load. The swap is trivial since both expose OpenAI-compatible APIs — only MODEL_BASE_URL changes.

vLLM requires Linux (or a Docker container on Windows). At 3 concurrent users the added complexity isn't worth it.

Why LangGraph Server

Provides SSE streaming, thread and session management, and concurrent request slots out of the box. The Nuxt/Nitro proxy posts a query and pipes the SSE stream to the browser — no custom streaming server needed.

Every response emits a metrics event (ttft_ms, total_ms, tokens_per_sec, cost_usd) — this is what makes the comparison concrete and not just theoretical.

The Four Deployments

01 — SELF_HOSTEDLocal desktop via Cloudflare Tunnel

GPU

RTX 4080 · 16 GB VRAM

Inference

Ollama · qwen3:8b · Q4_K_M

Tunnel

Cloudflare (free)

Concurrency

3 parallel slots

Cost

$0 / query

Ollama is used here because the demo caps at 3 concurrent users — sufficient for a portfolio. For production, vLLM would replace it: continuous batching, PagedAttention, prefix caching, and FlashAttention 2 give dramatically higher throughput. The swap requires only changing MODEL_BASE_URL — both expose OpenAI-compatible APIs.

Cloudflare Tunnel exposes the local LangGraph Server over HTTPS with no port-forwarding or firewall changes. Free tier, one command to start.

02 — VAST_AI_GPURented cloud GPU

GPU

RTX 3090 · 24 GB VRAM

Inference

Ollama · qwen3:8b

Schedule

06:30–17:00 IL · Sun–Thu

Billing

Per-hour while running

Cost

~$0.15/hr · $0 / query

GitHub Actions starts the instance at 06:30 and stops it at 17:00 Israel time, Sunday through Thursday. No point serving a portfolio demo at 3 AM or on a weekend.

Vast.ai offers serverless (scale-to-zero), but cold starts from zero GPUs take too long for a live demo — no visitor will wait 2–3 minutes for a model to load. A minimum of 1 GPU defeats the cost savings. Scheduled on/off achieves near-zero idle cost without cold-start latency.

A startup.sh script runs on every boot: pulls the repo, installs deps, starts Ollama, pulls the model, and launches LangGraph Server. Restart = redeploy.

03 — OPENROUTER_APIRailway + OpenRouter

GPU

None (API call)

Model

qwen/qwen3-8b via OpenRouter

Hosting

Railway · free tier · Docker

Deploy

Auto on git push

Cost

~$0.0005 / query

Railway's free tier is sufficient for a backend that proxies HTTP requests. Auto-deploys from main on push. No GPU needed — the model runs on OpenRouter's infrastructure.

Per-query cost is higher than self-hosted, but infrastructure overhead is zero. At scale the math can shift: managed inference services can cost less than the engineers needed to build and maintain self-hosted serving — especially given the pace of change in quantization, memory management, and token optimization.

04 — CLIENT_BROWSERIn-browser inference · Transformers.js + ONNX

Desktop model

Qwen2.5-0.5B · WebGPU → WASM

Mobile model

SmolLM2-135M · WASM only

Runtime

Transformers.js + ONNX

Proxied · Nitro → Tavily

Cost

$0 / query

Mobile devices always use SmolLM2-135M over WASM. Desktop tries WebGPU first; if unavailable or device RAM is below 6 GB, it falls back to WASM with the smaller model.

The ReAct loop runs entirely in a Web Worker. The model outputs Action: web_search / Query: ... as text; the worker parses it, calls /api/agent/search (Nitro → Tavily), injects the result as an observation, and loops until Final Answer.

At 0.5B–135M scale, browser inference is more practical for navigation assistance, form automation, and local search than for general chatbots — and nothing leaves the device except the search query.

Cost Comparison

Deployment

Per Query

Monthly

Notes

SELF_HOSTED

$0.00

~$4 electricity

Hardware is sunk cost. RTX 4080 draws ~150 W under inference load.

VAST_AI_GPU

$0.00

~$34

$0.15/hr × ~227 hrs/month (scheduled on/off only).

OPENROUTER_API

~$0.0005

< $1 at demo traffic

Scales linearly with query volume. No fixed infrastructure cost.

CLIENT_BROWSER

$0.00

Computation runs entirely in the visitor's browser.

Vast.ai runs on a fixed schedule (06:30–17:00 IL, Sun–Thu) — roughly 227 billable hours/month at $0.15/hr.

Notes

—Active sessions are polled every 10 seconds per deployment — visitors can see shared load as a live explanation for slower inference.

—The Vast.ai instance uses an RTX 3090 (24 GB VRAM). The RTX 4080 (16 GB VRAM) is the home machine running SELF_HOSTED.

—Outsourcing inference costs more per call but may cost less than the engineers needed to maintain self-hosted infrastructure at scale — especially given how fast compression, caching, and optimization patterns are evolving.