Native ESPHome

ESPHome

Built-in ESPHome device runtime inside Tater for VoicePE, Sat1, and ESP32-S3-BOX-3 display devices, with firmware builds, browser USB recovery, remote openWakeWord, remote NanoWakeWord, voice intercom, live logs, display feeds, voice satellites, and the full voice pipeline on the main app port.

Native device runtime Bundled Settings -> ESPHome Voice device runtime
Why it matters

What native ESPHome unlocks

Tater now owns the ESPHome device experience directly: discovery, room-aware voice sessions, remote wake detection, intercom sessions, live device state, firmware flashing, display notification feeds, and playback routing all run inside the main app instead of a downloadable core.

Feature set

What makes the built-in ESPHome stack feel like a real device platform

  • Built into Tater itself, always on, and served from the main app port rather than a separate external voice service.
  • Settings -> ESPHome now owns Satellites, Firmware, Settings, and Stats so operators can manage discovery, pairing, rooms, firmware builds, logs, live entities, and voice metrics in one place.
  • Wake engine controls support device-local microWakeWord, remote openWakeWord, and remote NanoWakeWord with per-device server URLs, so switching can stay on the satellite while Tater or a standalone wake server handles detection.
  • Remote openWakeWord keeps a live WebSocket stream to Tater's /api/openwakeword/stream endpoint, uses model threshold, patience, and debounce tuning, and falls back to microWakeWord on firmware when the remote server is unavailable.
  • Remote NanoWakeWord uses the same satellite streaming pattern through /api/nanowakeword/stream, can run custom .onnx/.pt/.pth models downloaded from the NanoWakeWord trainer, and resets in-memory detectors after trainer downloads so replacement models load on the next stream.
  • Room-level wake arbitration keeps two satellites in the same room from running simultaneous turns, holding the room through STT, TTS, and follow-up mic reopen windows.
  • Voice intercom flows let Tater broadcast or target spoken messages across ESPHome satellites while preserving the normal voice pipeline and auto-reply behavior.
  • The firmware tab supports Tater VoicePE, Tater Sat1, and Tater S3Box Display targets, including device images, editable substitutions, Environment Core sensor dropdowns, reply playback options, update checks, and per-device update actions.
  • Browser USB flashing and USB logs let operators recover ESP32 devices from the browser, choose the USB device before building, erase flash for safe-mode recovery, and watch logs after flashing.
  • Tater S3Box Display firmware uses LVGL for Tater-themed status, weather bubbles, history bars, voice states, tool-call states, display brightness, and camera snapshot notifications.
  • Display feed and display event APIs let apps send compact sensor values, transient cards, camera snapshots, doorbell notices, and tool-progress states to ESPHome screens.
  • Shared speech backends live in Settings -> Models, with Faster Whisper, Vosk, Wyoming, Kokoro, Pocket TTS, Piper, and Home Assistant announcement TTS available where they make sense.
  • Runtime model files auto-download into agent_lab/models/stt and agent_lab/models/tts so rebuilds do not require hand-seeding models.
  • Speaker ID and Emotion ID can warm SpeechBrain models at startup and feed speaker/tone context into voice turns when enabled.
  • Reply playback targets can stay on the listening satellite, go silent/display-only, or route TTS to another media/announcement device without breaking mic reopening.
  • Live entity views expose sensors plus writable controls such as switches, buttons, numbers, selects, lights, wake engine, openWakeWord URL, NanoWakeWord URL, and RGB color when the device supports it.
  • Firmware update checks show per-device current and available versions, include connected older Tater devices with unknown firmware versions, and can update all matched devices one at a time.
  • Per-device logs, stats, room awareness, display refresh nudges, and direct playback make Tater hardware feel local to the room instead of remote to the browser.
Operator controls

How operators use it in Tater

ESPHome is configured through Settings -> ESPHome for Satellites, Settings, Stats, firmware, wake engine, openWakeWord URL, NanoWakeWord URL, intercom, and Tater Voice Extras. Shared STT/TTS, VAD, speaker/emotion ID, openWakeWord model choices, and NanoWakeWord model choices live in Settings -> Models. Firmware builds, browser recovery, update checks, live logs, display control, voice settings, and wake-word capture toggles live in the same native UI.

Voice experience

How native ESPHome makes Tater feel like a real device assistant.

These notes focus on the built-in ESPHome runtime, the live voice pipeline, shared speech backends, and the operator tools now living directly inside Tater.

Built inOne appMain port

Native ESPHome runtime

ESPHome is no longer a shop core. It is part of the main Tater app and starts with Tater.

  • The old external voice runtime has been folded into Tater's built-in ESPHome runtime so the voice stack no longer depends on a separate downloadable core or its own HTTP listener.
  • That keeps the device lifecycle simpler: discovery, session handling, playback URLs, and operator screens now all live inside the same main application.
  • This built-in shape also leaves room for future ESPHome device types beyond the current voice-pipeline hardware.
Firmware tabBrowser USBLive logs

Firmware manager and browser recovery

Tater can build, flash, recover, and log ESPHome firmware from the WebUI.

  • Firmware templates expose device-specific substitutions with safer controls instead of raw YAML edits for common setup.
  • Browser USB recovery mirrors the familiar ESPHome web flashing flow: choose the USB serial device, build, flash, erase safe-mode state when needed, and stream USB logs.
  • Update checks compare connected Tater devices against the current firmware package, including older flashed satellites that do not report a version yet.
  • Per-device update buttons and Update All run OTA uploads one at a time and advance only after each upload finishes.
  • OTA and USB log windows use the same Tater firmware session UI so operators can debug failed boots without leaving the app.
Tater S3BoxLVGLDisplay events

S3Box display platform

ESP32-S3-BOX-3 devices can run a Tater-native LVGL display firmware.

  • The Tater S3Box Display firmware shows assistant identity, online state, Environment Core sensor readings, weather/history bars, voice pipeline states, and tool-call activity.
  • Display sensor fields use dropdowns sourced from Environment Core readings, with source labels and a clean install warning when Environment Core is missing.
  • Apps and cores can publish display cards with text, image URLs, snapshot IDs, TTL, target display names, and event kinds such as notification, camera, doorbell, image, tool_call, voice, status, and alert.
STTTTSModels

Voice pipeline and shared models

The live voice loop uses shared STT/TTS choices from the Models tab while keeping ESPHome-specific controls in one native screen.

  • STT can use Faster Whisper, Vosk, or Wyoming depending on the install and hardware, while TTS can use Wyoming, Kokoro, Pocket TTS, Piper, or external announcement paths.
  • Runtime model files auto-download into agent_lab/models/stt and agent_lab/models/tts so rebuilds do not require hand-seeding speech models.
  • Hugging Face tokens saved in Integrations are passed into model download environments for speech models that need authenticated Hub access.
  • Shared model choices live in Settings -> Models, while satellite behavior, wake words, reply playback, Speaker ID, and Emotion ID live under Settings -> ESPHome.
  • The mic reopen path keeps follow-up turns room-aware after TTS finishes, without letting another satellite in the same room start a competing turn.
openWakeWordServer URLFallback

Remote openWakeWord

Satellites can use remote openWakeWord while retaining on-device microWakeWord fallback.

  • Firmware exposes a wake engine selector and openWakeWord server URL live entity so users can switch between microWakeWord and remote OWW from the device side.
  • When remote OWW is selected, satellites stream live wake audio to /api/openwakeword/stream on Tater or to the same WebSocket endpoint on the standalone Tater OWW Server.
  • Tater applies the configured model, framework, threshold, patience, and debounce settings before reporting a wake back to the device.
  • If the remote endpoint fails repeatedly, firmware falls back to microWakeWord and continues listening locally instead of leaving the room without a wake path.
  • Companion projects: https://github.com/TaterTotterson/Tater-OWW-Server and https://github.com/TaterTotterson/openWakeWord-Trainer.
NanoWakeWordTrainerServer URL

Remote NanoWakeWord

Satellites can switch to NanoWakeWord remote detection and use models trained by the Tater NanoWakeWord trainer.

  • Firmware exposes NanoWakeWord as a wake-engine option alongside microWakeWord and openWakeWord, plus a NanoWakeWord server URL live entity for remote streaming.
  • When NanoWakeWord is selected, satellites stream 16 kHz mono audio to /api/nanowakeword/stream on Tater or to the standalone Tater NWW Server.
  • Tater loads NanoWakeWord models from agent_lab/models/nanowakeword, including trainer-downloaded .onnx, .pt, and .pth artifacts.
  • Settings -> Models exposes NanoWakeWord enablement, model source, threshold, patience, debounce, trainer URL, trainer model catalog, and download action.
  • Downloading a trainer model overwrites the same local artifact path when the filename matches, saves that path as the active model source, and resets loaded detectors so the next audio stream uses the new model bytes.
  • The NanoWakeWord trainer now uses the stronger default recipe: synthetic positives, adversarial negatives, phoneme hard negatives, validation splits, augmentation, and optional Colab-style negative feature banks from feature_banks/.
  • Companion projects: https://github.com/TaterTotterson/Tater-NWW-Server and https://github.com/TaterTotterson/nanoWakeWord-Trainer.
ArbitrationRoomsFollow-up

Wake arbitration and room ownership

Multiple satellites can live in one room without double-answering the same wake.

  • Tater claims the room when a satellite starts a wake-driven session and rejects competing starts from other satellites in that room while the turn is active.
  • The claim is held through STT, assistant work, TTS playback, announcement-finished handling, and the short follow-up mic reopen window.
  • Arbitration still allows satellites in different rooms to run independently, and stale claims expire if a device drops or a session aborts.
  • This makes same-room VoicePE and Satellite1 installs practical without needing to disable one device manually.
IntercomRoomsAnnouncements

Voice intercom

Tater can use ESPHome satellites as targeted room intercom endpoints.

  • Intercom requests resolve Tater device names, rooms, and speaking targets before generating or routing the spoken message.
  • Announcements use the same speech backends and playback routing as normal assistant replies, so external media players and satellite speakers stay consistent.
  • Auto-reply and follow-up behavior can reopen the mic after the intercom message when the selected conversation flow calls for it.
  • The flow shares native ESPHome session tracking, so LED/display states and room arbitration stay aligned with normal voice turns.
Speaker IDEmotion IDSpeechBrain

Speaker ID and Emotion ID

Tater can identify enrolled speakers and optionally add voice-tone context to Hydra prompts.

  • Speaker ID is for recognizing who is talking after a voice turn is captured; it uses enrolled voice samples and reports the best speaker plus match score.
  • Speaker ID aliases can be linked in Settings -> People so the recognized voice maps to the same master user as that person's portal accounts.
  • Emotion ID is separate from Speaker ID: it classifies the user's tone after STT and can add a soft prompt hint when enabled, confident enough, and not filtered as neutral.
  • Both features are optional, and normal voice turns still work when either model is disabled, missing, warming, or unable to make a confident detection.
  • The Dashboard voice section shows last detection information so operators can see Speaker ID and Emotion ID behavior without digging through logs.
  • Settings -> ESPHome contains the SpeechBrain model controls, enable/disable toggles, confidence thresholds, neutral-tone handling, enrollment controls, and warmup actions.
  • Speaker ID and Emotion ID models are stored under agent_lab/models so container rebuilds keep the downloaded model cache when agent_lab is bind-mounted.
  • In the NVIDIA image, SpeechBrain models can use CUDA when configured, with CPU fallback if the GPU path is unavailable.
This deviceExternal playerSilent display

Reply playback routing

Satellites can listen locally while replies play somewhere else.

  • Each satellite can keep reply playback on its own speaker, go silent/display-only, or route the reply to another selected media or announcement target.
  • This is useful for S3Box units that should act as microphones and displays while another speaker handles the answer.
  • After external playback, the listening device reopens the mic so follow-up conversation behavior stays natural.
SatellitesStatsLive logs

Runtime observability

The ESPHome screen now separates devices, settings, and stats so tuning is based on real behavior instead of guesswork.

  • Satellites shows discovered devices, saved room assignments, live entity state, device facts, and an ESPHome-style live log console.
  • Stats surfaces wake behavior, no-op rates, false wakes, backend latency, fallback usage, and per-device voice summaries for tuning.
  • Writable entity controls are available inline for things like switches, lights, numbers, buttons, and select options.
Conversation flowLive progressEarly TTS

Tater Voice Extras

Tune the higher-level voice behavior that sits around the standard ESPHome pipeline.

  • Conversation Flow controls follow-up behavior, automatic mic reopen, external-player follow-up markers, and how long Tater keeps a room ready for the next turn.
  • Wake arbitration controls whether active voice turns are protected per room or more broadly across the home.
  • Live Tool Progress Speech can speak short Hydra tool-progress lines and drive updated VoicePE/Sat1 LED animations while tools run.
  • Partial STT can keep partial transcript state during live capture so the system gets earlier visibility into what the user is saying.
  • Early-Start TTS can begin speaking long replies sooner by preparing smaller response chunks before the whole answer is finished.
  • Wake word settings can use prebuilt microWakeWord models, trained microWakeWord models, remote openWakeWord, remote NanoWakeWord, or standalone wake-word server URLs.
Built-in APIs

HTTP endpoints exposed by this runtime.

GET /api/settings/esphome/runtime

Load the native ESPHome runtime view used by Settings -> ESPHome.

Returns the current Satellites, Settings, and Stats payload so the WebUI can render discovery state, device cards, voice metrics, and runtime controls.

POST /api/settings/esphome/runtime/action

Run a native ESPHome runtime action from the WebUI.

Handles refresh, connect/disconnect, save/forget satellite actions, live log lifecycle, and direct entity-control actions from the ESPHome settings screen.

GET /tater-ha/v1/voice/native/status

Inspect current voice-pipeline runtime state and backend availability.

Returns selected speech backends, effective fallback state, model roots, discovery state, selector sessions, and availability of local STT/TTS backends.

WS /api/openwakeword/stream

Accept live remote openWakeWord audio streams from satellites or the standalone Tater OWW Server contract.

Runs the configured OWW model with threshold, patience, debounce, and stale-frame handling, then sends wake detections back to the device over the same stream.

WS /api/nanowakeword/stream

Accept live remote NanoWakeWord audio streams from satellites or the standalone Tater NWW Server contract.

Runs the configured NanoWakeWord model with threshold, patience, debounce, per-detector reset handling, and optional diagnostic logging, then sends wake detections back to the device over the same stream.

POST /api/settings/nanowakeword/trainer-models

Load available NanoWakeWord artifacts from a trainer URL.

Reads /api/artifacts and /api/trained_wake_words/catalog from the NanoWakeWord trainer so operators can pick newly trained models from Settings -> Models.

POST /api/settings/nanowakeword/download-trainer-model

Download a NanoWakeWord trainer artifact into Tater's local model store.

Downloads the selected artifact into agent_lab/models/nanowakeword/trainer/{model}, saves it as the active model source, and resets NanoWakeWord detectors so a same-name replacement model is loaded on the next stream.

POST /tater-ha/v1/voice/esphome/entities

Fetch live ESPHome entity rows for one connected satellite.

Returns the live entity snapshot so verbas and operators can inspect sensors, buttons, numbers, switches, lights, wake-engine controls, openWakeWord/NanoWakeWord URL state, and other exposed device entities.

POST /tater-ha/v1/voice/esphome/entities/command

Command a writable ESPHome entity on one satellite.

Supports button, number, switch, select, text, and light-control actions so device-local flows can act directly on the speaking device.

POST /tater-ha/v1/voice/esphome/play

Queue direct audio playback on a selected ESPHome satellite.

Used for device-local playback flows such as announcements, generated audio, and other responses that should play on the speaking satellite itself.

GET/POST /tater-ha/v1/display/feed

Serve compact display sensor data for ESPHome screens.

Returns display-ready slot values, flat readings, text labels, online state, and clock data; firmware profiles can map slots to Environment Core readings instead of hard-coded Home Assistant entity IDs.

GET /tater-ha/v1/display/events

Poll queued display events for a target screen.

Returns transient notification/display cards after a sequence number, with optional target filtering so one display can receive a specific camera, doorbell, tool-call, voice, status, or alert event.

POST /tater-ha/v1/display/events

Publish a display event card.

Accepts display event payloads with kind, title, message, image_url, snapshot_id, target, TTL, and optional metadata such as tool phase/status and step counts.

GET /tater-ha/v1/display/snapshots/{snapshot_id}

Serve Redis-backed awareness snapshots to displays.

Allows ESPHome displays to show camera snapshots that were stored by Awareness Core, using the same display API token rules as the feed and event endpoints.