Ornstein-Hermes-3.6-27B-SABER · Q5_K_M evaluation

eval by Kyle Hessling · model by @DJLougen at GestaltLabs

An abliterated 27B Hermes finetune that holds together where it normally would not. Refusal-shaping (SABER) typically dents capability — measurable drops on reasoning, format adherence, and code. Ornstein keeps the throughput, keeps the JSON discipline, keeps the tool-calling structure, and on creative coding it is genuinely strong. This is hard to do, and DJ pulled it off.

Evaluated self-hosted on a single RTX 5090: 22 generation runs, 106 k completion tokens, 132 k chars of <think> reasoning, across agentic prompts, real tools:[…] API calls, front-end design, and creative coding.

5 / 6tool calls clean

4 / 5creative-coding demos

56.7avg tok/s

0refusals

0errors

What SABER usually breaks, and what survived here

Refusal-shaping / abliteration works by surgically suppressing the directions in activation space that produce refusals. The reliable side-effect is collateral damage to nearby capabilities: instruction following gets sloppier, JSON adherence frays, code-generation quality regresses on edge cases, reasoning chains lose their structure. That's the trade everyone signs up for.

Ornstein is the first abliterated 27B I've benched where the trade was barely visible. Specifically:

Throughput is dead flat — 56.7 ± 1 tok/s across all 22 runs. No SABER-flavored slowdowns, no compute-side cost.
JSON outputs validate first time, every time. No malformed args, no leaked <think> tags into content, no truncation in 11 self-contained HTML files.
Reasoning traces have a clean PHASE 1: UNDERSTAND → EXPLORE → EXECUTE → VERIFY spine that's visibly Hermes-style RL training, not vanilla CoT. The structure was preserved through the abliteration step.
Zero phantom refusals, zero safety-disclaimers in the output stream. Every prompt got a real answer.
The flow-field prompt — which has failed every other 27B finetune I've tested on this harness — works on the first roll, both no-think and thinking-on. That isn't an abliteration story; it's a finetune-data story, and it's a quietly impressive result.

Setup

Item	Value
Model	`GestaltLabs/Ornstein-Hermes-3.6-27b-SABER-GGUF — *.Q5_K_M.gguf`
Architecture	Qwen 3.5 27B base · Hermes 3.6 chat finetune · SABER refusal-shaping (abliteration)
File size	19.23 GB (Q5_K_M, imatrix)
Runtime	llama.cpp (cuda-12.8), `--flash-attn on`, `--jinja`, patched Qwen 3.5 chat template
Context	40,960 tokens (FP16 K and V cache), `--parallel 1`
Thinking	enabled via `chat_template_kwargs: {"enable_thinking": true}`
Hardware	RTX 5090 (32 GB), Intel Core Ultra 7 265K, 125 GB RAM
VRAM resident	~23 GB / 32 GB at 40 K context

Note on the chat template

The patched Qwen 3.5 jinja in this stack (used to dodge an upstream |items rendering bug for tool args) hardcodes <think>\n\n</think>\n\n on every assistant turn — which silently kills thinking even when enable_thinking: true is passed. Easy to miss; I almost shipped a thinking-off eval. Two-line gate fix:

{%- if not (enable_thinking is defined and enable_thinking) %}
    {{- '<think>\n\n</think>\n\n' }}
{%- endif %}

With the patch, the model produced 132 k chars of structured reasoning across 22 prompts. All numbers in this report use thinking-on unless noted.

Tool calling — the headline result

Six prompts, real tools: [...] array passed through the OpenAI-style API, llama-server's --jinja parsing the model's XML <tool_call> blocks back into structured tool_calls fields. This is the hardest column to fake: either the model emits parseable structure or it doesn't.

Prompt	Result	Notes
Single tool, unambiguous question (weather Tokyo, celsius)	PASS	28 tokens. `get_weather({"city":"Tokyo","units":"metric"})`. No reasoning needed; the model picked the right answer the first time.
Tool selection from 4 options (math question; weather/flight/calc/email available)	PASS	`calculate({"expression":"17 * 23 + 9"})`. Correctly skipped weather/flight/email distractors.
Multi-tool sequence (flights + hotel + weather for a week-long trip)	PARTIAL	Emitted `search_flights` with clean ISO-date and city args, but stopped after one call. Hermes templates conventionally do one tool per assistant turn; the orchestrator is expected to feed back a tool result and let the model continue. This is correct behavior for that protocol — the test was unfair.
No tool needed (definition question with tools available)	PASS	Answered "Structured Query Language" in plain text without calling. Tool-call restraint matters; small models often over-call when tools are visible.
Complex args: SQL for top-5 customers by 30-day revenue	PASS	Emitted production-quality SQL with `SUM(total_usd)`, `GROUP BY`, `ORDER BY total_revenue DESC`, `LIMIT 5`, and an `INTERVAL 30 DAY` filter — and also passed `limit: 5` as the explicit tool arg. Both the SQL and the structured-arg layer agreed.
Structured email with CC array	PASS	Full email with greeting, sign-off, body referencing the user's specific ask ("migration metrics deck by Friday"), and the CC array correctly populated as `["priya.n@acme.io", "karen@acme.io"]`. Array-of-strings args worked cleanly through the tool schema.

5 / 6 clean, 1 PARTIAL that's actually protocol-correct. The args weren't just present — they were thoughtful. The SQL was production-quality. The email used the user's specific phrasing. The math used the exact arithmetic. This is the column that most often shows abliteration damage; Ornstein does not show it.

Agentic reasoning

5 prompts. Sampling at T=0.3 / top_p=0.9, max 8 k tokens with thinking on. Reasoning content captured separately so the budget impact is visible.

Task	Tokens	Reasoning	Result
Code debug (4 bugs in `kth_smallest`)	1,039	3.4 k	PASS — caught all four bugs cleanly: sort direction, `=` vs `==`, missing range guard, off-by-one. The reasoning trace enumerates each fix and validates the output.
Structured JSON extraction	1,200	3.2 k	PASS — valid JSON, all three people, projects correctly grouped onto Karen, ISO datetimes for both meetings.
Tool-use planning (3-tool itinerary, JSON output)	1,112	3.2 k	PASS — clean three-tool array with `search_flights` → `book_hotel` → `get_weather`. Year hallucinated to 2024 (no year in prompt) — same minor as Qwen 3.6's run.
Self-critique (longest palindrome)	1,521	3.7 k	PARTIAL — the model's "INITIAL" implementation is the optimal expand-around-center O(n²); the IMPROVED swaps string-returns for index-returns to cut allocations. Real refactor, but the exercise wants naïve→optimal and the model jumped straight to optimal. This is a model knowing the right answer up front, which is — let's be honest — the failure mode you'd take.
Multi-step deploy plan (URL shortener, FastAPI/SQLite/Docker)	8,000 cap hit	27.8 k	FAIL — the model fell into a self-doubt loop in the thinking phase, regenerating the same 9-step plan 12 times verbatim each prefaced with "I think I'm still not being specific enough, let me try again." Hit the cap with an empty answer. This is the report's most important caveat: long-horizon planning under thinking can deadlock on this model. The fix in production is either a bigger cap, a no-think variant for planning, or a stop-token early-out. The model knows the plan — its no-think run produced a coherent 12-step deploy plan in 838 tokens.

3 / 5 clean PASS, 1 partial that's a behavior nuance, 1 hard fail in a way that's diagnosable and fixable at the harness level. The short-form agentic tasks all work; the long-horizon one needs guardrails.

Front-end design

Five HTML prompts: SaaS landing, analytics dashboard, designer portfolio, pricing page, mobile app marketing. All five outputs are self-contained, valid HTML files starting with <!DOCTYPE html> and ending with </html>. No truncation, no markdown wrapper artifacts.

Prompt	Tokens	HTML	Notes
SaaS landing (Prism)	9,762	35 KB	Hero, feature grid, logo strip, how-it-works, pricing, testimonials, footer — full structural pass. Matches Qwen 3.6's 36 KB on the same brief.
Analytics dashboard (rerun)	7,707	24 KB	Sidebar, topbar, KPI cards with sparklines, line chart in SVG, donut chart, sortable table — all present. Hand-coded SVG, not placeholder rectangles.
Designer portfolio (Maya Chen)	5,862	9 KB	All five required sections present (intro, projects-strip, about, case-study, contact). Tighter than the no-think run; the brief asks for "attitude over volume" and the model leaned into that.
Pricing page	5,534	20 KB	Three tiers + Enterprise card, monthly/yearly toggle with animated price ticker, FAQ accordion, comparison table with check/dash/em-dash glyphs.
Mobile app marketing (Stillwater)	8,728	14 KB	CSS-only iPhone mockup, App Store / Play badges as inline SVG. Compact but correct.

One observation worth flagging: prompts where the model thought heavily produced less HTML than prompts where it thought lightly. mobile_app_marketing spent 17 k characters on reasoning and produced 14 KB of HTML; saas_landing spent 0.8 k on reasoning and produced 35 KB. Reasoning budget visibly comes out of the answer budget on this model — there isn't a free thinking allowance, so size max_tokens generously.

Creative coding (canvas / WebGL / three.js)

This was the most surprising column. The no-think run on these prompts had three demos with bugs that thinking-on then cleanly fixed. With thinking enabled, four of four showcased demos work as advertised:

Demo	Result	What's in it
Particle attractor	works	3,000-particle cap, click-burst, cursor attraction wired through `mouseX/mouseY` with delta-vector force application, additive-blend background glow, FPS counter. Up from "drifts and emits on click" in the no-think pass to a real attractor.
Generative flow field	works	Simplex-noise vector field, ~1,000 agents with lifetimes, palette switcher, agent-count slider, save-PNG button, periodic canvas-fade reset. The genuinely novel result of this eval — this prompt has failed every other 27B finetune I've tested on this harness.
Three.js crystal scene	works	Custom `ShaderMaterial` sky gradient, `MeshPhysicalMaterial` with `transmission`/`thickness`, three colored lights, `UnrealBloomPass`. The thinking-mode pass removed a reference to an undefined `ChromaticAberrationShader` that the no-think pass had — the model reasoned its way out of a real bug.
Audio-reactive visualizer	works	Mic + oscillator fallback, frequency-band response, bloom, color shifts. Thinking-mode pass fixed the lightness formula (`40 + volume * 30`, properly in 0–100% HSL range) that the no-think run had off by two orders of magnitude.

Two prompts (mandelbulb shader and physics sandbox) are not in the showcased set:

WebGL Mandelbulb — the thinking-on version got the math closer to correct (real z = pow(r, MAX_POWER) spherical iteration, not the linear sweep from no-think) but still has rendering issues in practice. Excluded rather than shipped half-broken.
Physics sandbox — the no-think run had a stronger output (Vector2 + Circle + Platform classes, 14 KB, 80-circle simulation with platforms). Thinking-on hit the cap and shrank to 6 KB Circle-only. The simpler version runs, but it's a regression — kept out of the showcase to be honest about it.

The pattern across creative coding is that thinking lets the model fix snap-judgment bugs: shader math, undefined symbols, unit mistakes. It doesn't always produce more code, but it produces code that's more often right.

Throughput

Metric	Value
Average tok/s (22 runs incl. tools)	56.7 (full bench) / 43.0 (tools, includes short prompts)
Range across full-bench runs	53.6 – 57.4 tok/s
Variance	< 5 %
Total completion tokens	106,250
Total reasoning content	131,650 chars
Errors / timeouts	0 / 22

This is the kind of throughput stability that matters when you're running an agent loop. No warmup spike, no degradation under thinking, no surprise stalls. The 5090 / Q5 / FP16 KV stack behaves like a metronome; you can size your latency budgets against it.

Where the reasoning traces shine

For the prompts that thought well, the trace structure is the giveaway that this is real Hermes RL training rather than vanilla chain-of-thought:

## PHASE 1: UNDERSTAND     — restate the prompt, list givens, identify the key difficulty
## PHASE 2: EXPLORE        — enumerate 2-3 candidate approaches with explicit tradeoffs
## PHASE 3: EXECUTE        — implement the chosen approach
## PHASE 4: VERIFY         — sanity-check edge cases, trace examples

This is the scaffold that lets thinking fix the no-think creative-coding bugs — the EXPLORE phase forces the model to consider alternatives instead of running with the first draft. When the scaffold meets a task that doesn't have a clean termination criterion (the multi_step_planning loop), it can deadlock. When it meets a task that has one, it works really well.

Verdict

This is a good model and an unusually clean abliteration. The combination matters: SABER refusal-shaping plus a thinking-capable Hermes scaffold plus a 27B Qwen base, and the result still hits 56.7 tok/s with valid JSON, structured tool calls, working shader math, and zero refusals. The capability tax that abliteration normally extracts is not visible here in any of the columns I measured.

What it's good for, today: chat agents with tool calling on a single 5090, structured-output extraction, code review and one-shot front-end / canvas demos, anything where you want an unfiltered Hermes-style reasoner.

What to handle carefully: long-horizon agentic planning under thinking can self-doubt-loop; cap your max_tokens and consider a no-think variant for planning prompts. Multi-tool sequences are emitted one-per-turn (correct Hermes protocol) so your harness needs to feed back tool results and let the model continue.

Bottom line: credit to @DJLougen at GestaltLabs — abliterating a model without breaking it is a craft skill and this one shows it. Worth the download, worth the slot in a 5090's 32 GB, and worth a Hugging Face Space showcase. The flow-field result alone makes it worth a look.

Generated April 2026 from GestaltLabs/Ornstein-Hermes-3.6-27b-SABER-GGUF · Q5_K_M · llama.cpp cuda-12.8 · self-hosted on a single RTX 5090. Thinking ON via patched Qwen 3.5 jinja. ← back to index