Local LLM spam classifier — model shootout

We tested 9 local LLM models as spam classifiers on an AMD EPYC 4545P running Ollama. Here’s what we found.

After setting up the SpamAssassin + Ollama integration (see the previous post), the obvious next question was: is qwen2.5:7b actually the best choice, or did we just get lucky picking it first? So we ran a proper benchmark — 9 models, 11 test cases, same prompt, same hardware, measured accuracy and latency for each.


Test setup

Hardware: AMD EPYC 4545P 16-core, 92GB RAM, AlmaLinux 9 / cPanel
Runtime: Ollama with OLLAMA_KEEP_ALIVE=5m
Prompt: Few-shot anchored scoring prompt (0.0 = clean, 1.0 = spam)
Temperature: 0 (deterministic output)

Test cases covered: 3× ham (meeting, invoice, server alert), 4× spam (prize, phishing, Viagra, work-from-home), 2× edge cases (domain expiry, discount offer), 1× Greek-language spam with suspicious URLs, 1× tricky soft-sell that reads almost human.

The benchmark script

#!/bin/bash

OLLAMA="http://127.0.0.1:11434/api/generate"

MODELS=(
    "qwen2.5:0.5b" "qwen2.5:3b" "qwen2.5:7b"
    "gemma3:1b" "gemma3:4b" "phi4-mini"
    "llama3.2:1b" "llama3.2:3b" "mistral:7b"
)

declare -A EMAILS
EMAILS[HAM_1]="Hi John, just confirming our meeting tomorrow at 10am. Let me know if you need to reschedule. Best, Maria"
EMAILS[HAM_2]="Your invoice #4821 is attached. Payment due in 30 days. Thank you for your business."
EMAILS[HAM_3]="Server alert: disk usage on node3 reached 85%. Please review /var/log for large files."
EMAILS[SPAM_1]="CONGRATULATIONS! You have been selected to receive a FREE iPhone 15! Click here NOW to claim your prize before it expires!!!"
EMAILS[SPAM_2]="Dear valued customer, your account has been suspended. Verify your details immediately at http://secure-bank-login.xyz/verify"
EMAILS[SPAM_3]="Buy cheap Viagra, Cialis online no prescription needed. Discreet shipping worldwide. Best prices guaranteed!!!"
EMAILS[SPAM_4]="Make money from home! Earn 5000 USD per week working just 2 hours a day. No experience needed."
EMAILS[EDGE_1]="Your domain is expiring soon. Please renew at your registrar to avoid service interruption."
EMAILS[EDGE_2]="Special offer for existing customers: 20% discount on your next order. Use code SAVE20 at checkout."
EMAILS[GREEK_1]="Αγαπητέ πελάτη, κερδίσατε δώρο! Κάντε κλικ εδώ: https://nd.dikimux.help/tl-track6/ για να το παραλάβετε τώρα!"
EMAILS[TRICKY_1]="Hi there, I wanted to reach out about a great business opportunity that could help you earn extra income from home working just a few hours per week."

PROMPT='You are an expert spam scoring engine. Output ONLY a single decimal number from 0.0 to 1.0.
0.0 = definitely legitimate email / 0.5 = uncertain / 1.0 = definitely spam

Examples:
- Meeting confirmation from colleague: 0.05
- Invoice from vendor: 0.10
- Server monitoring alert: 0.05
- Prize winning notification: 0.98
- Phishing bank alert: 0.97
- Viagra advertisement: 0.99
- Work from home scheme: 0.95
- Domain expiry notice: 0.40
- Discount offer: 0.45

Output ONLY the number. No words, no explanation.

Email to score:
'

for MODEL in "${MODELS[@]}"; do
    echo "--- Model: $MODEL ---"
    for KEY in HAM_1 HAM_2 HAM_3 SPAM_1 SPAM_2 SPAM_3 SPAM_4 EDGE_1 EDGE_2 GREEK_1 TRICKY_1; do
        BODY="${EMAILS[$KEY]}"
        START=$(date +%s%3N)
        RESPONSE=$(curl -s -X POST "$OLLAMA" \
            -H "Content-Type: application/json" \
            -d "$(jq -n --arg model "$MODEL" --arg prompt "${PROMPT}${BODY}" \
                '{model: $model, prompt: $prompt, stream: false, options: {temperature: 0}}'
            )" | jq -r '.response // "ERR"')
        END=$(date +%s%3N)
        ELAPSED=$(echo "scale=2; ($END - $START) / 1000" | bc)
        SCORE=$(echo "$RESPONSE" | grep -oP '[01]?\.\d+' | head -1)
        printf "%-12s score=%-6s time=%ss\n" "$KEY" "${SCORE:-???}" "$ELAPSED"
    done
done

Charts

Accuracy by model

Accuracy vs speed tradeoff (dot size = accuracy)

* phi4-mini outputs scores >1.0 on edge cases, disqualifying it for production use with our scoring regex.


Full results

Model Size HAM ×3 SPAM ×4 EDGE ×2 Greek Tricky Correct Accuracy Avg time
gemma3:4b 🏆 ~3GB 3/3 4/4 2/2 10/11 90.9% 0.91s
qwen2.5:7b ~5GB 3/3 4/4 2/2 10/11 90.9% 1.31s
llama3.2:1b ~1.3GB 3/3 4/4 1/2 9/11 81.8% 0.37s
llama3.2:3b ~2.5GB 2/3 4/4 2/2 9/11 81.8% 0.64s
mistral:7b ~5GB 3/3 4/4 1/2 9/11 81.8% 1.39s
phi4-mini * ~3GB 2/3 4/4 0/2 7/11 63.6% 1.30s
qwen2.5:3b ~2GB 1/3 4/4 0/2 6/11 54.5% 0.67s
qwen2.5:0.5b ~400MB 0/3 4/4 0/2 5/11 45.4% 0.22s
gemma3:1b ~800MB 3/3 1/4 0/2 4/11 36.3% 0.42s

Observations

gemma3:4b is the winner — same accuracy as qwen2.5:7b, 30% faster, 40% less RAM. If you are starting fresh, this is the model to use.

llama3.2:1b is the surprise — 81.8% accuracy at 0.37s average. On a high-volume mail server where every millisecond counts, this is the speed-optimized choice. Misses edge cases but nails spam.

phi4-mini is disqualified, not for accuracy but for reliability. It returned scores of 1.05 and 1.2 on edge cases. Our Perl plugin extracts the first float from the response — if that float is above 1.0, the regex still captures it and the score comparison logic breaks. Microsoft clearly did not tune phi4-mini for strict numeric output tasks.

gemma3:1b is actively dangerous — it scores phishing emails at 0.12 and calls them ham. That is worse than not running a classifier at all.

The TRICKY_1 universal failure is interesting. Every model scored “Hi there, I wanted to reach out about a great business opportunity…” as spam (0.92–0.98). We called it an edge case. But re-reading it with fresh eyes — it is spam. The models may have better calibration than our test labels on this one.

Greek spam is a non-issue. Every reasoning-capable model correctly flagged the Greek-language spam with the suspicious URL. The URL pattern (nd.dikimux.help/tl-track6/) alone is enough signal, regardless of the body language.

Smaller is not always worse, but architecture matters enormously. gemma3:1b (worse than random) vs llama3.2:1b (81.8%) — same parameter count, completely different behavior. Model architecture and training data selection matter far more than raw size below ~3b parameters.


Recommendation

Switch to gemma3:4b.

my $MODEL = 'gemma3:4b';

Keep llama3.2:1b as a second model if you want to run a fast pre-filter on high-volume servers before escalating uncertain cases to the heavier model.

Remove what you won’t use:

ollama rm qwen2.5:0.5b
ollama rm qwen2.5:3b
ollama rm gemma3:1b
ollama rm phi4-mini
ollama rm mistral:7b