Local model runtime

LLMs, vision, and model tools

Tater can run local models through llama.cpp, Hugging Face Transformers, and MLX Engine, or connect to remote OpenAI-compatible providers when you want an external server.

llama.cpp GGUF Transformers MLX Engine Vision Chat templates
Provider choices

Choose the local runtime that matches the model and machine.

The Base model controls normal chat, Hydra planning, Memory Core extraction, briefs, and most internal AI calls. Vision can use the Base model when supported or a dedicated vision model.

ProviderBest forModel shapeNotes
llama_cpp Fast GGUF text and vision on NVIDIA, Apple Metal, CPU, and other llama.cpp-supported backends. Single GGUF file, or GGUF plus matching mmproj for vision. Supports context, batch, micro-batch, GPU KV offload, Flash Attention, MTP settings, and chat template overrides.
hf_transformers Transformers models that need PyTorch, custom architectures, or Hugging Face-native loading. Full Hugging Face repo with config, tokenizer, and weights. Supports device, dtype, device map, attention implementation, trust remote code, context, and chat template overrides.
mlx_lm Apple Silicon local text and vision through Tater's MLX Engine path. Full MLX repo, including sharded safetensors, tokenizer, config, and processor files when needed. MLX Engine is always used for MLX text and vision. There is no fallback to the older plain MLX-LM or MLX-VLM runtime.
openai_compatible Remote or external local servers such as OpenAI-compatible chat endpoints. Provider model name served by the external endpoint. Useful when another service owns model loading, GPU scheduling, or hosted inference.
Model browser

Hugging Face downloads

The Hugging Face tab can browse, filter, and download local models for each runtime.

  • llama.cpp uses the Hub apps=llama.cpp filter and downloads individual GGUF files, including matching projectors when needed.
  • MLX downloads the full repo instead of one file, because tokenizer, config, safetensors shards, and processor files must stay together.
  • Transformers downloads the full repo for PyTorch/Transformers loading.
  • Download cards show repo/file progress, bytes, speed, ETA, and cancellation state across refreshes and tab changes.
  • A Hugging Face token can be saved through the Hugging Face integration for gated/private models and higher Hub rate limits.
Runtime state

Loading and monitoring

Save & Load warms selected local models. The runtime pill and popup show loaded models, CPU/GPU/RAM/VRAM, LLM calls, vision calls, context estimates, and recent activity.

  • The Debug mini-tab shows live prompt and generation events for local model calls.
  • Context cards estimate model fit without forcing dashboard refreshes or expensive reloads.
  • Local vision workers isolate crash-prone llama.cpp vision calls so Tater stays online if a native backend fails.
Local tuning

The settings are runtime-specific.

llama.cppGGUF

Context and GPU controls

Set text context separately from vision context, then tune eval batch, micro-batch, Flash Attention, GPU KV offload, and Multi-Token Prediction draft tokens.

TransformersPyTorch

Device and precision

Select device, dtype, device map, attention implementation, trust remote code, and context length for Transformers models.

MLX EngineApple Silicon

Engine-backed MLX

Tater routes MLX text and vision through MLX Engine with context length, lazy load, trust remote code, prefill step, and quantized KV settings.

Vision

Image understanding

Vision calls are separate from text chat. If a core, Verba, or chat attachment asks for image understanding, Tater routes the request through the configured vision path.

  • llama.cpp vision needs a compatible model GGUF and matching mmproj projector.
  • MLX vision uses the MLX Engine path and expects the selected model/repo to include the required vision processor and weights.
  • Dedicated vision models can be selected when the Base text model is not vision-capable.
Thinking control

Templates and response shape

Tater strips visible thinking blocks where possible, supports provider-specific chat template overrides, and keeps the prompt separate from the model chat template.

  • Use the Chat Template button beside local models to inspect embedded templates and save overrides.
  • Overrides persist in Redis and are reused after restart until reset to the embedded template.
  • For models whose template supports thinking flags, edit the template itself rather than injecting extra system prompt text.
How calls flow

Hydra and direct calls share the same configured model layer.

DirectBase model

Normal LLM calls

Dashboard briefs, Memory Core extraction, Guardian checks, direct chat, and API direct mode use the active Base model unless a feature explicitly selects a dedicated local model.

HydraTools

Reasoning and orchestration

Hydra uses the configured model to plan, validate tool calls, run Verbas, and return final answers. Beast Mode can assign different models to Hydra heads while Base remains available for normal AI calls.