Run Local LLMs on Raspberry Pi 5: A Practical Guide Using the AI HAT+ 2
edge-aiiotdev-tutorials

Run Local LLMs on Raspberry Pi 5: A Practical Guide Using the AI HAT+ 2

bboards
2026-01-25
11 min read
Advertisement

Hands‑on guide to run local LLMs on Raspberry Pi 5 with the AI HAT+ 2 — install, quantize, deploy privacy-first micro-apps in 2026.

Cut context switching — run a private, local LLM on Raspberry Pi 5 with the AI HAT+ 2

Developers and IT admins building edge micro-apps face the same friction: multiple cloud services, slow iteration when prototyping, and privacy rules that forbid sending sensitive data offsite. If you want to prototype a local assistant, an on-premise data labeling tool, or an IoT device that reasons without networking latency or egress risk, the Raspberry Pi 5 plus the new AI HAT+ 2 make a compelling, low-cost platform for offline inference in 2026.

What you’ll get from this guide

  • Hands-on setup instructions for Raspberry Pi 5 + AI HAT+ 2 (hardware + OS tips).
  • Steps to prepare, quantize and run lightweight local LLMs (edge-optimized models <= 3B).
  • Example micro-app deploy patterns (FastAPI service, systemd auto-start, secure local API).
  • Performance and memory tradeoffs, troubleshooting, and security recommendations for production-ish deployments.

Why this matters in 2026: the edge-first AI and micro-apps moment

By late 2025 and into 2026, the landscape shifted decisively toward edge-first AI. Two macro trends matter for this guide:

  • Hardware democratization: low-cost NPUs and specialized HATs for ARM devices matured, making sub-3B model inference practical on single-board computers.
  • Micro-app renaissance: people and dev teams increasingly build short-lived, privacy-first micro-apps that run locally or on-device — reducing cloud dependency and speeding iteration.

That combination makes the Pi 5 + AI HAT+ 2 a practical platform for developers who need low-latency, private, and cost-effective LLM inference.

Quick hardware and software checklist (what you need)

  • Raspberry Pi 5 (64‑bit OS recommended) with adequate cooling.
  • AI HAT+ 2 installed on the 40-pin header (firmware latest from vendor).
  • Power supply: 5V/6A USB-C recommended (Pi 5 + HAT + NPU can draw more).
  • SSD for model storage (USB 3.0 NVMe or USB-C SSD) — models and quantized files are large; use an SSD to avoid SD wear and I/O bottlenecks.
  • Ubuntu 24.04 LTS (64-bit) or Raspberry Pi OS 64-bit (2026 builds) — 64-bit userland improves memory usage for AI runtimes.
  • Developer tooling: git, build-essential, python3, pip, docker or podman (optional), and a C compiler toolchain for compiling inference runtimes.

Step 1 — Assemble hardware and prep OS

  1. Attach the AI HAT+ 2 to the Pi 5 header. Mount heatsinks and an active fan if you plan sustained inference.
  2. Install SSD and boot from it or mount it with fast I/O for model files. On Ubuntu use /etc/fstab to mount by UUID and set noatime.
  3. Flash Ubuntu 24.04 or the Pi 64-bit OS image. Enable SSH and update packages:
    sudo apt update && sudo apt upgrade -y
  4. Create a non-root developer user and add sudo access:
  5. sudo adduser dev && sudo usermod -aG sudo dev

Step 2 — Install AI HAT+ 2 runtime and drivers

The AI HAT+ 2 ships with an SDK / runtime that exposes its NPU to user-space libraries. Vendor details change, so follow the HAT provider's install guide — typical patterns are:

  1. Download the SDK package or clone the vendor repo.
  2. Run the install script (may add kernel modules and system services):
    sudo bash ./install_ai_hat2.sh
  3. Verify the device is present, and the runtime shows the NPU and memory:
    ai-hat2-cli info
    If there is no helper, check dmesg and lsusb/ls /dev for device nodes.
  4. Install optional hardware acceleration wrappers: ONNX Runtime with NPU plugin, or vendor-supplied Python bindings for direct inference offload.

Troubleshooting

  • If the HAT is not recognized, re-check the header seating and confirm the firmware is up to date.
  • Missing kernel modules usually mean you need a newer kernel — confirm you’re on a 5.15+ or 6.x kernel build recommended by the vendor.

Step 3 — Choose an edge-optimized model and quantization strategy

In 2026 several models are commonly used on-device. The practical guidance:

  • Target model size: for a Pi 5 with an AI HAT+ 2, choose models in the 1B–3B parameter range for good latency and responsiveness.
  • Quantization: quantize to int8 or newer Q4 variants (ggml Q4_0 / Q4_K_M) to cut memory by 2–6x while keeping quality acceptable for micro-apps.
  • Format: prefer GGML-compatible binaries (llama.cpp-style) or ONNX with NPU plugin support if your HAT runtime supports it.

Common edge candidates in 2026: distilled/open-instruct 1–3B models or vendor-provided tiny models optimized for NPUs. If you need an instruction-following assistant, pick a model already fine-tuned for instructions (or fine-tune locally, see later).

Step 4 — Prepare the model: convert and quantize

Two practical routes:

  1. ggml / llama.cpp route (works well on ARM):
    • Clone llama.cpp (or a maintained fork optimized for aarch64 and NPUs):
      git clone https://github.com/ggerganov/llama.cpp.git
      cd llama.cpp
      make clean && make -j4
    • Convert a Hugging Face model to ggml format and quantize. Example (pseudocode — follow the conversion tool docs):
      python3 convert.py --model hf-model-name --out ./models/my-ggml-model.bin --quantize q4_0
  2. ONNX / vendor NPU route:
    • Export model to ONNX and run vendor quantization tooling to produce NPU-optimized blobs. The vendor runtime will provide commands like ai-hat2-quantize or an ONNX plugin.

Tips:

  • Use the SSD when converting and storing models to avoid SD card I/O limits.
  • If conversion runs out of memory on the Pi, perform conversion on a workstation and copy the quantized artifact to the Pi.

Step 5 — Run an inference runtime locally

Pick one of two practical deployment flows:

Option A — Native runtime (llama.cpp style)

  1. Run the compiled binary against the quantized model:
    ./main -m ./models/my-ggml-model-q4_0.bin --threads 4 --model-threads 2
  2. Adjust flags for generation length, temperature, top_p, and token streaming. On the Pi you’ll prioritize shorter generation lengths and lower sampling complexity to keep latency acceptable.

Option B — Model server (FastAPI + vendor runtime)

Wrap the runtime in a small HTTP microservice to make it easy to integrate into micro-apps and developer toolchains:

python3 -m venv venv && source venv/bin/activate
pip install fastapi uvicorn ai-hat2-sdk

# app.py (outline)
from fastapi import FastAPI
from ai_hat2_sdk import InferenceClient

app = FastAPI()
client = InferenceClient(model_path='/models/my-quantized-model')

@app.post('/generate')
def generate(prompt: dict):
    return client.generate(prompt['text'], max_tokens=128, temperature=0.3)

# run
uvicorn app:app --host 127.0.0.1 --port 8000

Use FastAPI to keep the service lightweight and easy to containerize, and use systemd to daemonize this service (example later).

Step 6 — Production-ish concerns: resilience, security and maintenance

Resilience

  • Set OOM protections and adequate swap (but keep swap on SSD small; swap will hurt inference performance). Instead, reduce model size/quantize further if OOM persists.
  • Monitor NPU and CPU temps and throttle inference rates based on thermal headroom.

Security and privacy

Running locally reduces egress risk, but you still need to secure the local API:

  • Bind HTTP services to 127.0.0.1 by default. If remote access is required, front with a reverse proxy and enforce TLS and authentication (Caddy or Nginx with mTLS).
  • Use API keys or JWT tokens for service-to-service auth.
  • Encrypt model artifacts at rest if they contain sensitive licensed data, and protect backups.

Updates and reproducibility

  • Keep a small release process: version your quantized model artifact, keep conversion scripts in Git, and tag releases.
  • Automate conversion on CI (use a larger runner) and push artifacts to a private registry or S3 bucket; Pi then pulls the final quantized artifact.

Step 7 — Example: build a private “Docs Assistant” micro-app

Goal: locally host a searchable Q&A assistant that answers from a local docs corpus without sending data offsite.

Architecture

  • Indexer: run on laptop/CI to convert docs to embeddings and store them in a local vector DB (Chroma, SQLite+FAISS).
  • Pi service: local FastAPI that accepts queries, retrieves top-k docs, constructs a prompt, and runs the on-device model.
  • Client: CLI or browser UI only accessible on the local network or via a secured reverse proxy.

Minimal request flow

  1. Client sends question -> Pi service
  2. Pi service runs vector similarity against local DB -> returns top 3 snippets
  3. Pi service formats prompt: system instruction + snippets + user question
  4. Run model inference locally with short max tokens and streaming
  5. Return response

Systemd service example

[Unit]
Description=Local LLM microservice
After=network.target

[Service]
User=dev
Group=dev
WorkingDirectory=/home/dev/app
Environment="VIRTUAL_ENV=/home/dev/app/venv"
ExecStart=/home/dev/app/venv/bin/uvicorn app:app --host 127.0.0.1 --port 8000
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Performance tuning and benchmarks (practical tips)

  • Start with threads = number of physical cores and tune down if latency increases due to context switching.
  • Use streaming generation with small chunks for UI responsiveness (e.g., token streaming every 8–16 tokens).
  • Prefer lower temperature and reduced top_p for deterministic outputs with lower compute cost.
  • Measure tokens/sec and memory usage. Expect ~1–5 tokens/sec for unaccelerated models and better when the AI HAT+ 2 can offload matrix ops — vendor numbers will vary.

Common problems and fixes

  • Out of memory: use smaller model, stronger quantization (q4_0 → q4_k_m), or reduce max_tokens.
  • High latency: enable NPU offload or reduce sampling complexity (no nucleus sampling or lower top_k).
  • Service crashes on heavy load: add rate limiting and queue requests. Use Redis or local in-memory queue to smooth bursts.

Integration patterns for developer toolchains

Developers and IT teams want automations. Practical integrations:

  • Git hooks + local LLM-assisted commit message generator that runs on the Pi and returns suggestions via CLI.
  • CI/CD pipelines that push newly converted quantized artifacts into a secure artifact store; Pi pulls and rotates models during maintenance windows.
  • IoT devices using the Pi as an inference gateway: small devices send short contexts; the Pi returns intent and commands to local devices without cloud round trips — see patterns for running scalable micro-event streams at the edge when integrating many low-power clients.
  • Edge NPUs will standardize around ONNX + vendor-neutral runtimes — expect broader cross-HAT compatibility in 2026–2027.
  • Model distillation and compiler-level optimizations will continue to improve usable quality at 1B–3B scales, making many micro-apps indistinguishable from cloud models for domain-specific tasks.
  • Privacy-first apps and regulation will accelerate on-device inference adoption for sensitive verticals (health, finance, etc.).

Rule of thumb for 2026: If a model can be distilled to 3B or less with robust quantization, it’s likely viable on a Pi 5 + AI HAT+ 2 for many micro-app use cases.

Real-world example: a 48-hour weekend prototype

Scenario: you want a local “meeting minutes summarizer” that runs on office Pi devices.

  1. Day 1 morning: install OS and AI HAT+ 2 drivers (~2 hours).
  2. Day 1 afternoon: convert a 2B instruction-tuned model on a workstation and copy the quantized file (~3 hours).
  3. Day 2 morning: build a FastAPI that receives meeting audio transcripts (or text), retrieves context, and runs the model locally (~4 hours).
  4. Day 2 afternoon: secure the API, add a simple UI, test latency, and tune sampling parameters (~3–4 hours).

In under two days you can have a private, offline prototype that demonstrates privacy benefits and developer velocity — and that’s precisely why micro-apps are taking off.

Checklist before you roll to users

  • Model license compliance and local use permissions (verify you can run and distribute the model locally).
  • Production health checks: automated restarts, logs retention, and secure backups of model artifacts.
  • Performance SLA definition: expected latency per inference and rate limiting.

Final notes — what to experiment with next

  • Local fine-tuning or LoRA on small datasets to make micro-apps domain-specialized.
  • Hybrid architectures: small local LLM for private, short tasks + cloud model for heavy-lift or long-context jobs.
  • Edge federated learning patterns that keep raw data local while sharing model deltas centrally for aggregated improvements.

Actionable takeaways

  • Choose a 1–3B model and quantize appropriately (Q4 variants often hit the best price/quality point on ARM NPUs).
  • Use an SSD for model storage and run conversions on a workstation if necessary.
  • Wrap inference in a small HTTP microservice with local-only binding and strong auth when exposed externally.
  • Automate conversion and artifact distribution via CI to maintain reproducibility and updates.

Call to action

Ready to prototype an offline micro-app? Start with the Raspberry Pi 5 + AI HAT+ 2. Follow the steps above, pick a small instruction-tuned model, and ship a private assistant in a weekend. For a reproducible starter kit, template systemd files, and example FastAPI code ready for Pi 5, clone our community repo and join the conversation — share benchmarks, deployment patterns, and your micro-app ideas with other edge AI builders.

Advertisement

Related Topics

#edge-ai#iot#dev-tutorials
b

boards

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-01-25T04:45:33.792Z