Choosing the right Ollama model for your WordPress is not trivial. There are 50+ options in the official registry and each has a different trade-off between quality, speed, and RAM consumption. This guide helps you decide based on your actual use case.
Methodology
I tested 8 representative models on a 4 vCPU / 16 GB RAM VPS without a GPU, running real WordPress tasks:
- Meta-description generation (10 articles, 155 char max)
- Draft rewriting (8 articles ~800 words)
- SEO analysis (keyword research + optimized titles)
- Comment summarization (50 real comments)
- Comment moderation (spam/legit classification)
Each model was evaluated on 3 axes:
- Quality (1-5): human-rated output. Publishable as-is or needs editing?
- Speed (tokens/second): measured with
ollama benchmark. - RAM: GB consumption of the loaded model (model size + KV cache).
Comparison table
| Model | Parameters | RAM | tok/s | Quality | Best for |
|---|---|---|---|---|---|
| llama3.1:8b | 8B | ~5 GB | ~45 | 4.0 | General, first deploy |
| qwen2.5:14b | 14B | ~10 GB | ~28 | 4.4 | Multilingual EN/ES, SEO |
| mistral:7b | 7B | ~4.5 GB | ~52 | 3.6 | Speed, low consumption |
| codellama:13b | 13B | ~9 GB | ~25 | 4.2 | Code generation, scripts |
| phi3:14b | 14B | ~9 GB | ~32 | 4.1 | Structured tasks, JSON |
| gemma2:9b | 9B | ~6 GB | ~40 | 4.0 | Good all-round balance |
| llama3.2:3b | 3B | ~2.5 GB | ~80 | 3.2 | Trivial tasks, max speed |
| deepseek-r1:8b | 8B | ~5.5 GB | ~38 | 4.3 | Reasoning, debugging |
Analysis by use case
General WordPress operation (recommended starting point)
llama3.1:8b is the sensible default. It generates acceptable meta-descriptions, summarizes comments correctly, and its latency (~45 tok/s) means a typical 500-token call takes about 11 seconds. It is the only model where the output is usually publishable as-is for basic SEO tasks.
If you have 16 GB of RAM and want more quality without sacrificing speed, jump to qwen2.5:14b — especially good in Spanish (trained on a larger multilingual corpus than llama).
Code generation (WP-CLI scripts, PHP snippets, SQL)
codellama:13b is the king here. It understands WordPress context (hooks, filters, SQL queries), generates code that compiles on the first try, and respects the code base conventions. For tasks like “give me a WP-CLI script that exports all posts to JSON with their SEO metadata” it produces functional code in 1-2 iterations.
Structured tasks and JSON (classification, extraction, RAG)
phi3:14b shines here. Its tuning for structured instructions makes outputs like “give me a JSON with {category, priority, action}” consistently valid without extra characters. For workflows that feed Qdrant, phi3 is more reliable than llama.
Reasoning and debugging (log analysis, diagnosis)
deepseek-r1:8b stands out. It does implicit chain-of-thought, reasons about hypotheses before proposing a solution, and its log analysis is more accurate. If your main use will be “I have this error, what is going on?”, deepseek-r1 is your model.
Raw speed and low consumption
llama3.2:3b for trivial tasks like “classify this comment as spam/legit”. At ~80 tok/s, you can moderate 100 comments in 4 minutes. Its 2.5 GB RAM footprint makes it ideal for running in parallel with a larger model.
Multi-model strategy (recommended)
In production, the best is not to pick a single one, but to combine them per task:
# /opt/d0a1/compose/ollama/.env
# Main model (quality, multilingual)
OLLAMA_MODEL_QUALITY=qwen2.5:14b
# Fast model (classification, simple JSON)
OLLAMA_MODEL_FAST=llama3.2:3b
# Code model (scripts, SQL)
OLLAMA_MODEL_CODE=codellama:13b
# Reasoning model (analysis, debug)
OLLAMA_MODEL_REASON=deepseek-r1:8b
n8n workflows can route based on task type. Example:
# In an n8n Code node (JavaScript):
const task = $json.task_type;
const modelMap = {
'seo': 'qwen2.5:14b',
'classify': 'llama3.2:3b',
'code': 'codellama:13b',
'analyze': 'deepseek-r1:8b',
'moderate': 'llama3.2:3b',
'generate': 'qwen2.5:14b'
};
return { model: modelMap[task] || 'llama3.1:8b' };
Hardware requirements
| Setup | Concurrent models | Recommended RAM | Recommended vCPU |
|---|---|---|---|
| Minimum | 1 (one model at a time) | 8 GB | 2 |
| Standard | 2 (fast + quality) | 16 GB | 4 |
| Pro | 3-4 (with Q4 quantization) | 32 GB | 8 |
| With GPU | Any + 10x speed | 16 GB + 8+ GB VRAM | 4 |
Quantization: how it affects things
By default Ollama downloads models in Q4_0 (4-bit quantization). This reduces RAM usage by ~75% with <5% quality loss. For more quality, you can download variants:
# Q4_0 (default, best balance)
ollama pull qwen2.5:14b
# Q5_K_M (more quality, +30% RAM)
# Not available as direct tag, use a modelfile
# Q8_0 (max CPU quality, +60% RAM)
# No direct tag either
# FP16 (original quality, +200% RAM, GPU only)
# Don't use on CPU
On CPU, stay with Q4_0. The difference with Q5 is barely noticeable and the RAM cost is not worth it.
Conclusion
If you are only going to use one model, start with llama3.1:8b or qwen2.5:14b (the latter if you need strong EN/ES). As your usage grows, add llama3.2:3b for classification tasks and codellama:13b if you generate scripts regularly.
And remember: no local model charges you per token. The only cost is the VPS.