Introduction

In 2025, running a capable large language model (LLM) locally is no longer reserved for data centers or specialized AI servers. With modern tooling and an NVIDIA RTX 3080 (or better), you can host and interact with a powerful local LLM right from your Arch Linux workstation.

In this guide, we’ll go step-by-step through:

  1. Preparing your Linux system for GPU-accelerated inference
  2. Running a local vLLM container with OpenAI-compatible API
  3. Connecting from your .NET 8/9 applications
  4. Tuning model performance and memory usage
  5. Alternative setup using Ollama for simpler management

We’ll close with a short summary and performance notes.


⚙️ 1. Prerequisites: System and GPU Setup

🧩 Hardware Requirements

  • CPU: Intel i9 (or AMD Ryzen 9 equivalent)
  • RAM: ≥ 32 GB (64 GB preferred)
  • GPU: NVIDIA RTX 3080 (10 GB VRAM) or higher
  • Storage: At least 50 GB free for models and cache

🧠 Why Local?

Running an LLM locally gives you:

  • Full data privacy — nothing leaves your machine.
  • Zero token costs — your GPU does the heavy lifting.
  • Offline availability — ideal for restricted or air-gapped systems.
  • Near-instant iteration for .NET prototyping and agent frameworks.

🧰 2. Installing the NVIDIA Container Runtime on Arch Linux

Before Docker can use your GPU, install the official NVIDIA runtime layer:

sudo pacman -S nvidia-dkms nvidia-utils nvidia-container-toolkit docker
sudo systemctl enable --now docker

Now configure Docker to use NVIDIA’s runtime:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Test your setup with a CUDA base image:

docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

You should see your RTX 3080 listed under “GPU 0”.


🚀 3. Running vLLM — a High-Performance Local LLM Server

vLLM is one of the most efficient open-source inference engines available today.
It supports PagedAttention, CUDA, and an OpenAI-compatible API out of the box.

✅ Basic Container Run

docker run -d \
  --name vllm \
  --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-7B-Instruct-v0.2 \
  --tensor-parallel-size 1

This command:

  • Uses GPU acceleration (--gpus all)
  • Maps your Hugging Face cache to avoid repeated downloads
  • Exposes the service on localhost:8000
  • Loads the Mistral 7B-Instruct model — fast, multilingual, and ideal for 10–16 GB VRAM cards

Once started, the OpenAI-compatible API is available at:

http://localhost:8000/v1

⚙️ 4. Performance Tuning: Quantization and Memory Efficiency

If your GPU runs close to its 10 GB VRAM limit, consider quantized loading:

docker run -d \
  --name vllm \
  --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-7B-Instruct-v0.2 \
  --load-format bitsandbytes

This loads the model in 4-bit precision using BitsAndBytes — cutting memory needs nearly in half with minimal performance loss.


🧩 5. Integrating vLLM with .NET 8/9

Because vLLM exposes an OpenAI-compatible REST interface, integrating it into your .NET apps is extremely straightforward.

📦 Using HttpClient

using System.Net.Http.Json;

var client = new HttpClient { BaseAddress = new Uri("http://localhost:8000/v1/") };

var request = new
{
    model = "mistralai/Mistral-7B-Instruct-v0.2",
    prompt = "Explain the benefits of Arch Linux in two sentences.",
    max_tokens = 100
};

var response = await client.PostAsJsonAsync("completions", request);
var content = await response.Content.ReadAsStringAsync();
Console.WriteLine(content);

This mirrors the OpenAI API structure, so your application can switch between local and cloud models with no code change.


🧠 Using the OpenAI .NET SDK

If you prefer using the official SDK:

using OpenAI;

var api = new OpenAIClient(new OpenAIClientOptions
{
    ApiKey = "sk-local",
    BaseAddress = new Uri("http://localhost:8000")
});

var chat = await api.ChatEndpoint.GetCompletionAsync(
    new ChatRequest
    {
        Model = "mistralai/Mistral-7B-Instruct-v0.2",
        Messages = new[]
        {
            new Message(Role.User, "Summarize the core principles of Arch Linux.")
        }
    });

Console.WriteLine(chat.FirstChoice.Message.Content);

The SDK doesn’t care whether it’s talking to OpenAI or your local vLLM — both follow the same REST schema.


🐋 6. Using Docker Compose for Persistent Setup

To make your local model server reproducible, define a simple docker-compose.yml:

version: "3.8"

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: vllm
    restart: unless-stopped
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    ipc: host
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    command: >
      --model mistralai/Mistral-7B-Instruct-v0.2
      --tensor-parallel-size 1

Start it permanently with:

docker compose up -d

You now have a persistent, GPU-accelerated LLM service running 24/7.


💡 7. Simpler Alternative: Ollama

If you prefer a no-configuration setup, Ollama offers a compact, user-friendly alternative with built-in GPU support.

docker run -d --gpus all -p 11434:11434 ollama/ollama
docker exec -it ollama ollama pull llama3

You can query it with:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain Arch Linux briefly."
}'

Ollama automatically downloads, quantizes, and caches models — great for smaller GPUs or local experimentation.


🧠 8. Model Recommendations for RTX 3080

ModelSizePerformanceNotes
Mistral 7B-Instruct-v0.2~13 GB⚡ Very fastGreat general-purpose model
Meta Llama 3 8B~15 GB🔥 Excellent reasoningSlightly heavier on VRAM
Phi-3 Mini (3.8B)~8 GB🧠 EfficientIdeal for code and Q&A
Llama 3 70B> 48 GB❌ Too largeNot suitable for 3080

If you want the best trade-off between speed, quality, and memory — Mistral 7B-Instruct is the sweet spot.


🧩 9. Debugging and Health Checks

You can verify your vLLM service with:

curl http://localhost:8000/v1/models

Expected response:

{
  "data": [{
    "id": "mistralai/Mistral-7B-Instruct-v0.2",
    "object": "model"
  }]
}

To view container logs:

docker logs -f vllm

🧾 10. Typical Use Cases

Once running, your local LLM becomes a building block for many projects:

  • Offline Copilots or AI assistants for developers
  • Automated report generation using .NET services
  • Local code completion and documentation tools
  • Data privacy–critical applications (finance, healthcare, public sector)

Because you can integrate vLLM directly via REST or gRPC, it fits neatly into any modern microservice or .NET backend.


🧭 Conclusion

Running a high-quality large language model locally on an RTX 3080 with Arch Linux is not just possible — it’s efficient, fast, and developer-friendly.

With vLLM, you get:

  • GPU-optimized inference
  • OpenAI-compatible API endpoints
  • Smooth integration with .NET 8 and 9
  • Reproducible Docker setups for enterprise or research

And with Ollama as a fallback, you can experiment freely without complex dependencies.

Whether you’re prototyping .NET Copilots, building private AI agents, or exploring model performance tuning, a local LLM stack empowers you to stay independent, private, and fast — all while leveraging the power of your RTX 3080.


Happy coding, and welcome to the era of private, GPU-accelerated AI on your own workstation! 🧠💻


Views: 6

🚀 Running Local Large Language Models (LLMs) on an RTX 3080 Using Docker and .NET

Johannes Rest


.NET Architekt und Entwickler


Beitragsnavigation


Ein Gedanke zu „🚀 Running Local Large Language Models (LLMs) on an RTX 3080 Using Docker and .NET

  1. This is peak .NET 8/9 flex: Hey LLM, tell me Arch Linuxs benefits in 2 sentences. Why use Arch? Because you love the thrill of constantly searching for solutions, and the satisfaction comes from figuring it out yourself, unlike just asking an AI. 😉🖥️🤖

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert