Model building is only one part of the journey. The real work is turning it into a service you can run, monitor, and scale.
The model is the starting point. Deployment is where it becomes useful.
A fine-tuned model is not a product yet. If you want a custom model to matter in the real world, you need a reliable way to serve it on hardware you control.
In this guide, I use Padauk as a small-language-model example and show the deployment path I would use on a custom Ubuntu VM. The exact model can change, but the deployment pattern stays the same.
If you want the project context first, the main page is here: Padauk.
What we are building
The goal is simple:
turn a fine-tuned model into a production-style inference service on a VM you own.
That means we want:
- a quantized model file that fits the machine
- a runtime that can serve inference reliably
- a process manager that restarts the service if it dies
- a reverse proxy so the endpoint feels like a real API
- logs and health checks so we can debug problems quickly
Here is the path we will follow:
- Fine-tuned checkpoint
- GGUF / quantized model
- Custom Ubuntu VM
- llama.cpp server
- systemd service
- Nginx / TLS
That is the difference between “I have a model” and “I have a service.”
Before you start
For a small fine-tuned model, start with a VM that gives you enough room for the weights, the context window, and the runtime overhead.
My practical starting point is:
- Ubuntu 22.04 or 24.04
- 4 to 8 vCPU
- 8 to 16 GB RAM
- SSD or NVMe storage
- SSH access
You do not need a giant machine to begin. You do need enough memory to avoid swapping all the time.
If your model is already quantized, deployment is much easier. If it is not, export and quantize first, then move on to the VM.
Step 1: Prepare the VM
Start with a clean Ubuntu machine and install the basic tools you will need.
sudo apt update
sudo apt upgrade -y
sudo apt install -y build-essential cmake git curl htop ufw
sudo ufw allow OpenSSH
sudo ufw allow 80
sudo ufw allow 443
sudo ufw enable
If you are using the VM for inference only, keep the attack surface small. Open only the ports you actually need.
Step 2: Install llama.cpp
llama.cpp is a practical choice because it gives you CPU-native inference, GGUF support, and an OpenAI-compatible server mode.
Build it directly on the VM:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DLLAMA_BUILD_SERVER=ON
cmake --build build -j$(nproc)
If you prefer a binary release for faster setup, you can use that too. The deployment idea is the same either way.
Step 3: Copy the model to the VM
By the time you reach deployment, you should already have a GGUF file.
In this example I will call it padauk.gguf.
Create a place for it on the server:
sudo mkdir -p /opt/padauk
sudo chown -R $USER:$USER /opt/padauk
Copy the model from your local machine:
scp /path/to/padauk.gguf user@YOUR_VM_IP:/opt/padauk/padauk.gguf
If the file is large, rsync is a better choice because it can resume interrupted transfers.
Step 4: Run a first inference test
Before adding systemd, reverse proxies, or TLS, run the model manually once. That makes it easier to debug startup issues.
/path/to/llama.cpp/build/bin/llama-server \
-m /opt/padauk/padauk.gguf \
--host 127.0.0.1 \
--port 8080 \
--threads 4 \
--ctx-size 4096
If your machine has enough RAM, you can also experiment with --mlock.
Do not force it if memory is tight.
Test it with curl:
curl -X POST http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "padauk",
"messages": [
{"role": "user", "content": "Hello. Give me a short introduction."}
]
}'
If this works locally, the model is ready for the next layer.

Step 5: Turn it into a systemd service
Running a server in a terminal is fine for testing.
Running it under systemd is what makes it behave like a real service.
Create a unit file:
[Unit]
Description=Padauk Inference Server
After=network-online.target
Wants=network-online.target
[Service]
User=llm
Group=llm
WorkingDirectory=/opt/llama.cpp
ExecStart=/opt/llama.cpp/build/bin/llama-server -m /opt/padauk/padauk.gguf --host 127.0.0.1 --port 8080 --threads 4 --ctx-size 4096
Restart=always
RestartSec=5
LimitNOFILE=65535
[Install]
WantedBy=multi-user.target
Then enable it:
sudo systemctl daemon-reload
sudo systemctl enable padauk.service
sudo systemctl start padauk.service
sudo systemctl status padauk.service
This is the first big production step: your model can now restart automatically if the process crashes.
Step 6: Put Nginx in front of it
You usually do not want the model server exposed directly to the public internet. Put a reverse proxy in front so your app talks to a stable public endpoint while the model stays on localhost.
Example Nginx config:
server {
listen 80;
server_name llm.example.com;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 300s;
}
}
Reload Nginx and, if you want HTTPS, add Certbot afterward.
sudo apt install -y nginx
sudo systemctl enable nginx
sudo systemctl restart nginx
sudo certbot --nginx -d llm.example.com
Now the model looks more like a production API and less like a local experiment.
Step 7: Tune for production-style inference
This is where AI engineering becomes systems engineering.
The first knobs I tune are:
--threads: match your CPU, but do not oversubscribe blindly--ctx-size: only make it as large as your use case needs- model quantization: smaller is usually better for a CPU VM, as long as quality stays acceptable
- storage: keep the GGUF file on fast disk
- process restart behavior: systemd should recover the service automatically
If you serve multiple clients, test latency under load. A model that works in a single curl request may fall apart when several users hit it at once.
Useful commands:
journalctl -u padauk.service -f
curl http://127.0.0.1:8080/v1/models
That gives you logs and a simple health signal.
Step 8: Connect your app
At this point the model is no longer just a checkpoint. It is an endpoint your application can call.
Your app can point to:
https://llm.example.com/v1
From there, you can use the same server for chat, function calling, and internal workflows.
That is the real benefit of deploying on a VM: you control the serving layer, the network layer, and the runtime behavior.
Common mistakes
The biggest problems I see are usually not model problems. They are deployment problems.
- using a VM that is too small
- exposing the inference port directly to the internet
- skipping
systemd - ignoring log files until something breaks
- choosing too much context size for no reason
- trying to run a full-size model when a smaller quantized model would work better
If the service is meant to feel production-like, it should also behave production-like: predictable startup, restart on failure, visible logs, and a clear network boundary.
Why Padauk works well as the example
Padauk is a good example because it keeps the focus on deployment. The point is not that this is a special model. The point is that once you know how to serve one small fine-tuned model well, the same pattern applies to the rest of your stack.
That is the mindset I want to encourage:
- model building is important
- but model serving is what turns the work into something people can actually use
Final takeaway
If you are an AI engineer, the job is not finished when fine-tuning ends. The harder and more valuable part is turning that fine-tuned model into a service that scales, restarts, logs, and integrates cleanly.
Padauk is just the example. The lesson applies to every custom model you want to put into production:
build the model, then build the path that makes it usable.
Frequently Asked Questions
1. What VM size should I start with for a fine-tuned LLM?
For a small quantized model, start with 4 to 8 vCPU and 8 to 16 GB RAM. If your context window or traffic is larger, scale up after you measure real usage.
2. Do I need a GPU to do this?
No. A lot of custom fine-tuned models can run well on CPU if they are quantized properly and the VM is sized sensibly.
3. Why use GGUF?
GGUF is practical because it works well with llama.cpp and makes CPU deployment much easier than shipping a raw training checkpoint.
4. Why use systemd instead of just running the model in a terminal?
systemd gives you automatic restart, boot-time startup, and better operational control. That is what makes the deployment feel production-like.
5. Why put Nginx in front of the model server?
Nginx gives you a stable public endpoint, better control over traffic, and an easier path to TLS. It also keeps the model server on localhost.
6. What should I monitor first?
Start with logs, startup time, latency, memory usage, and restart frequency. Those five signals tell you a lot about whether the VM deployment is healthy.
7. Can I use the same process for another fine-tuned model?
Yes. The deployment pattern is reusable. Only the model file and the tuning values change.