Vision input
Arducam 16MP Autofocus Camera Module
IMX519 16MP autofocus camera with ABS case for Raspberry Pi models.
View on AmazonJARVIS is intentionally split in two: a Raspberry Pi 5 sensor body for presence, and a local GPU brain for cognition, memory, voice, learning, and governance.
Purchase list
These are outbound Amazon links for the hardware stack. The brain machine should be treated as an offline AI processor, not a Windows desktop.
For the Senses
The physical JARVIS presence: camera, touch display, mic, speaker, Pi 5, and Hailo acceleration.
Vision input
IMX519 16MP autofocus camera with ABS case for Raspberry Pi models.
View on AmazonSensor computer
16GB Pi 5 kit with 128GB storage edition for the always-on senses node.
View on AmazonPresence display
800x480 IPS capacitive touchscreen over MIPI DSI for the Pi display surface.
View on AmazonVoice output
Compact stereo speaker with enhanced bass for local TTS playback.
View on AmazonAudio input
360 degree adjustable USB mic with mute button, LED indicator, and noise-canceling tech.
View on AmazonVision acceleration
Hailo-10H accelerator with 8GB on-board RAM and 40 TOPS class AI capability for Pi 5.
View on AmazonFor the Brain
The brain machine runs cognition, memory, models, governance, and self-improvement locally.
CPU backbone
High-thread-count Ryzen 9 class processor for the brain and CPU-resident model lanes.
View on AmazonPremium GPU brain
Strong baseline GPU for local STT, TTS, LLM residency, vision support, and fast iteration.
View on AmazonMore power
32GB GDDR7 option for heavier local model residency and a bigger extreme-tier brain.
View on AmazonComplete brain option
A ready-to-go tower option if you want the brain hardware assembled first, then reinstalled with Linux.
Prebuilt brain machine
Ryzen 9 9950X3D, RTX 5080, 96GB DDR5, 4TB NVMe class build. Install Pop!_OS over Windows before using it as the brain.
View on AmazonBrain OS requirement
Install Pop!_OS over the PC used for the brain. Do not run the brain as a Windows install. Treat that machine as the offline local AI processor for JARVIS.
Sensor body
The Pi is the senses. It does not own memory, personality, policy, self-improvement, or the LLM. It captures the room and streams events to the brain.
Thin sensor node
Runs the senses layer: camera capture, Hailo vision inference, audio capture/playback, WebSocket transport, and the particle display.
Vision accelerator
Runs YOLOv8s person detection, SCRFD face detection, and YOLOv8s-Pose locally on the Pi so the brain receives structured perception.
Vision input
Feeds Picamera2 for person detection, pose, facial expression, face crops, and scene summaries.
Audio input
Captures 44.1kHz audio, resamples to 16kHz int16, and streams raw PCM to the brain over local WebSocket.
Voice output
Plays back brain-synthesized TTS audio. The Pi does not run the language or speech intelligence.
Presence surface
Runs the JARVIS particle visualizer in kiosk mode and reflects bounded system state, not private dashboard internals.
Brain equipment
The brain auto-detects NVIDIA VRAM, CPU threads, and RAM at startup. It then chooses LLM size, STT model, TTS device, vision availability, model keep-alive, and whether ancillary ML should live on CPU or GPU.
Hardware tiers
The brain auto-detects GPU VRAM at startup and selects model sizes, compute types, and memory strategy from seven tiers. Local-first guarantee: all core capabilities run entirely on local hardware.
| Tier | VRAM | LLM | Fast | Vision | STT | TTS | Keep-alive |
|---|---|---|---|---|---|---|---|
| minimal | <4 GB | qwen3:1.7b | qwen3:1.7b | disabled | tiny/int8 | none | 5m |
| low | 4-6 GB | qwen3:4b | qwen3:1.7b | disabled | small/int8 | none | 5m |
| medium | 6-8 GB | qwen3:8b | qwen3:4b | disabled | medium/int8_fp16 | kokoro_cpu | 5m |
| high | 8-12 GB | qwen3:8b | qwen3:4b | qwen2.5vl:7b | large-v3-turbo | kokoro_cpu | 10m |
| premium | 12-16.5 GB | qwen3:8b | qwen3:8b | qwen2.5vl:7b | large-v3/int8_fp16 | kokoro_gpu | 30m |
| ultra | 16.5-24.5 GB | qwen3:14b | qwen3:8b | qwen2.5vl:7b | large-v3/float16 | kokoro_gpu | always |
| extreme | 24.5 GB+ | qwen3:32b | qwen3:14b | qwen2.5vl:7b | large-v3/float16 | kokoro_gpu | always |
Self-improvement coder — RAM tiers
The Qwen3-Coder-Next model is selected by system RAM, independent of GPU tier. It runs purely on CPU through llama-server and never contends with VRAM.
| System RAM | GGUF Quant | Model Size | Quality | Headroom |
|---|---|---|---|---|
| 56GB+ | UD-Q4_K_XL | ~46GB | Best | ~10GB+ for OS/JARVIS |
| 48-55GB | UD-IQ4_XS | ~38GB | Good | ~10GB+ for OS/JARVIS |
| 32-47GB | UD-IQ2_M | ~25GB | Acceptable | ~7GB+ for OS/JARVIS |
| <32GB | Disabled | would OOM | Do not force-enable | Not enough RAM |
CPU tiers
Strong and beast CPUs can offload ancillary ML from the GPU, freeing VRAM for STT, LLM residency, TTS, and vision.
| CPU Tier | Requirement | Typical Hardware | Effect |
|---|---|---|---|
| weak | <4 threads | SBCs / cheap VPS | Minimal CPU headroom |
| standard | 4-7 threads | Laptop i5 / older desktop | GPU carries ancillary ML when VRAM allows |
| strong | 8-15 threads + 8GB RAM | Desktop i7 / Ryzen 7 | Offloads emotion, speaker ID, embeddings, hemispheres to CPU |
| beast | 16+ threads + 16GB RAM | Ryzen 9 / Threadripper / Xeon | Best partner for premium+ GPUs and coder workflows |
Recommended serious build
Premium tier is the sweet spot: qwen3:8b warm, large-v3 STT, GPU TTS, speaker ID, emotion, embeddings, policy, memory, and governance without pretending a bigger LLM is free.