← PROJECTS / SWITCHBOARD
ACTIVE2026Solo

Switchboard

A FastAPI service that arbitrates a single 16 GB Tesla P100 — checking VRAM, evicting what's resident, and routing each request to the right backend so a stack of local-AI front-ends can share one GPU without fighting over it.

Role
Solo — design, code, deploy
Stack
Python, FastAPI, Ollama, KoboldCPP, CUDA, Tesla P100
Status
active
Abstract flat diagram of scattered points and crossing trend-lines resolving toward a single path, in the site's ink on cream.
FIG. 01 — 2026.

① The problem.

I have one GPU — a 16 GB Tesla P100, Pascal-era, cheap on the used market — and more jobs that want it than fit on it at once. SillyTavern wants a chat-and-narration model. ComfyUI wants an image model. An OCR pass wants a vision model. Sixteen gigabytes does not hold all of that at the same time, and swapping by hand — unload this, load that, remember what was where — got old in about a day.

② Approach.

A small FastAPI service in front of the GPU that owns the one question nothing else should have to answer: what's loaded right now. A request comes in, the switchboard checks VRAM, clears whatever's resident if it has to, routes the payload to the right backend — Ollama or KoboldCPP — and lets that model have the full card. Without that clear-first step you can't run more than one model in a single flow on 16 GB; with it, a pipeline can hand off between models that would never co-reside.

The models are picked per job: Gemma 4 (e4b) for image recognition and OCR, Mistral Nemo for SillyTavern chat and narration, and a general-purpose model on Ollama for everything else. The front-ends stay dumb about the GPU; the switchboard stays dumb about what the front-ends are for. That seam is the whole design.

③ What's in the box.

  • GPU — NVIDIA Tesla P100, 16 GB. Pascal, no tensor cores, surprisingly game for 7B–13B models with quantization.
  • Switchboard — a FastAPI service that checks VRAM, clears whatever's resident when a different model is needed, and routes each request to the backend that serves it. It serializes loads so two requests can't both try to claim the card at once.
  • Backends — Ollama and KoboldCPP. The switchboard hands each payload to whichever one serves the requested model.
  • Models — Gemma 4 (e4b) for image recognition and OCR, Mistral Nemo for chat and narration, and a general model on Ollama for everything else.
  • Front-ends — SillyTavern and ComfyUI point at the switchboard rather than the GPU directly. Each backend's native endpoint stays exposed too, so I can bypass the switchboard and talk to a model directly when I want.

④ What broke.

Pretty much everything VRAM-shaped, the first time. Asking it to keep two models warm at once was the first wall — two 13B models do not co-reside in 16 GB no matter how politely you ask. The fix was admitting the GPU holds one big thing at a time and making the clear-before-load step a first-class operation instead of an accident.

Then the queue: batched requests arriving faster than the slowest load could finish meant a fast request could sit behind a slow one for no reason. Serializing the loads — but letting inference run concurrent once a model was resident — bought most of that back. The fixes were boring, which is the goal.

⑤ Where it's going.

It's the quiet dependency under most of the local AI I run — including the hermes-agent dashboard worker on the days it wants a local model instead of a hosted one. The next move is to put it under a small career assistant I'm building for this site: its own bounded, retrieval-backed thing, built in public. The switchboard is what makes hosting that on a single closet GPU realistic.