Solo Server Setup Guide

Solo Server is a lightweight orchestration layer for hardware‑aware inference. Spin up Ollama, vLLM, or Llama.cpp back‑ends in seconds with an opinionated CLI and a consistent REST API.


# Install
pip install solo-server

# Interactive setup (detects hardware, writes solo.json)
solo setup

✨ Features

Seamless setupOne‑command solo setup auto‑detects CPU/GPU/RAM and writes an optimised config
📚 Open model registryPull weights from Hugging Face, Ollama, or local GGUF bins
🖥️ Cross‑platformmacOS (Apple Silicon & Intel), Linux, Windows 10/11
🛠️ Configurable frameworkTweak ports, back‑end, quantisation, & device mapping in ~/.solo_server/solo.json

Table of Contents


Installation

🔹 Prerequisites

# Install uv (see full instructions: https://docs.astral.sh/uv/getting-started/installation/)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create and activate a virtualenv
uv venv
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate   # Windows PowerShell

# Install Solo Server
uv pip install solo-server

# Run the interactive wizard
solo setup

The wizard detects hardware, selects the optimal compute back‑end (CUDA, HIP, Metal, CPU, …) and writes solo.json.


Solo Server Block Diagram


Commands

Serve a model

solo serve -m llama3.2:latest
FlagDescriptionDefault
-s, --serverBack‑end: ollama, vllm, llama.cppollama
-m, --modelModel name or path
-p, --portHTTP port5070

Test inference

solo test            # quick health‑check
solo test --timeout 120  # increase timeout for large models

List models

solo list   # scans Hugging Face cache & Ollama store

Check server status

solo status

Stop servers

solo stop   # gracefully shutdown running back‑ends

REST API

Solo exposes a thin proxy so your code never needs to change when you swap back‑ends.

Ollama‑style endpoints

curl http://localhost:5070/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?"
}'

OpenAI‑compatible endpoints (vLLM & Llama.cpp)

curl http://localhost:5070/v1/chat/completions -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Why is the sky blue?"}]
}'

⚙️ Configuration (solo.json)

solo setup writes a machine‑specific config at ~/.solo_server/solo.json. Edit it manually or rerun the wizard any time.

{
  "hardware": {
    "use_gpu": true,
    "compute_backend": "CUDA",
    "gpu_memory": 6144.0
  },
  "server": {"type": "ollama", "default_port": 5070},
  "active_model": {"server": "ollama", "name": "llama3.2:1b"}
}

📝 Project inspiration

Solo Server stands on the shoulders of:

  • uv – blazing‑fast Python package manager
  • llama.cpp, vLLM, Ollama – state‑of‑the‑art inference back‑ends
  • Hugging Face Hub, whisper.cpp, llamafile, podman, cog

If you find Solo useful, please ⭐ the repo!