eval by Kyle Hessling · model by @DJLougen at GestaltLabs
An abliterated 27B Hermes finetune that holds together where it normally would not. Refusal-shaping (SABER) typically dents capability — measurable drops on reasoning, format adherence, and code. Ornstein keeps the throughput, keeps the JSON discipline, keeps the tool-calling structure, and on creative coding it is genuinely strong. This is hard to do, and DJ pulled it off.
Evaluated self-hosted on a single RTX 5090: 22 generation runs, 106 k completion tokens, 132 k chars of <think> reasoning, across agentic prompts, real tools:[…] API calls, front-end design, and creative coding.
Refusal-shaping / abliteration works by surgically suppressing the directions in activation space that produce refusals. The reliable side-effect is collateral damage to nearby capabilities: instruction following gets sloppier, JSON adherence frays, code-generation quality regresses on edge cases, reasoning chains lose their structure. That's the trade everyone signs up for.
Ornstein is the first abliterated 27B I've benched where the trade was barely visible. Specifically:
<think> tags into content, no truncation in 11 self-contained HTML files.PHASE 1: UNDERSTAND → EXPLORE → EXECUTE → VERIFY spine that's visibly Hermes-style RL training, not vanilla CoT. The structure was preserved through the abliteration step.| Item | Value |
|---|---|
| Model | GestaltLabs/Ornstein-Hermes-3.6-27b-SABER-GGUF — *.Q5_K_M.gguf |
| Architecture | Qwen 3.5 27B base · Hermes 3.6 chat finetune · SABER refusal-shaping (abliteration) |
| File size | 19.23 GB (Q5_K_M, imatrix) |
| Runtime | llama.cpp (cuda-12.8), --flash-attn on, --jinja, patched Qwen 3.5 chat template |
| Context | 40,960 tokens (FP16 K and V cache), --parallel 1 |
| Thinking | enabled via chat_template_kwargs: {"enable_thinking": true} |
| Hardware | RTX 5090 (32 GB), Intel Core Ultra 7 265K, 125 GB RAM |
| VRAM resident | ~23 GB / 32 GB at 40 K context |
The patched Qwen 3.5 jinja in this stack (used to dodge an upstream |items rendering bug for tool args) hardcodes <think>\n\n</think>\n\n on every assistant turn — which silently kills thinking even when enable_thinking: true is passed. Easy to miss; I almost shipped a thinking-off eval. Two-line gate fix:
{%- if not (enable_thinking is defined and enable_thinking) %}
{{- '<think>\n\n</think>\n\n' }}
{%- endif %}
With the patch, the model produced 132 k chars of structured reasoning across 22 prompts. All numbers in this report use thinking-on unless noted.
Six prompts, real tools: [...] array passed through the OpenAI-style API, llama-server's --jinja parsing the model's XML <tool_call> blocks back into structured tool_calls fields. This is the hardest column to fake: either the model emits parseable structure or it doesn't.
| Prompt | Result | Notes |
|---|---|---|
| Single tool, unambiguous question (weather Tokyo, celsius) | PASS | 28 tokens. get_weather({"city":"Tokyo","units":"metric"}). No reasoning needed; the model picked the right answer the first time. |
| Tool selection from 4 options (math question; weather/flight/calc/email available) | PASS | calculate({"expression":"17 * 23 + 9"}). Correctly skipped weather/flight/email distractors. |
| Multi-tool sequence (flights + hotel + weather for a week-long trip) | PARTIAL | Emitted search_flights with clean ISO-date and city args, but stopped after one call. Hermes templates conventionally do one tool per assistant turn; the orchestrator is expected to feed back a tool result and let the model continue. This is correct behavior for that protocol — the test was unfair. |
| No tool needed (definition question with tools available) | PASS | Answered "Structured Query Language" in plain text without calling. Tool-call restraint matters; small models often over-call when tools are visible. |
| Complex args: SQL for top-5 customers by 30-day revenue | PASS | Emitted production-quality SQL with SUM(total_usd), GROUP BY, ORDER BY total_revenue DESC, LIMIT 5, and an INTERVAL 30 DAY filter — and also passed limit: 5 as the explicit tool arg. Both the SQL and the structured-arg layer agreed. |
| Structured email with CC array | PASS | Full email with greeting, sign-off, body referencing the user's specific ask ("migration metrics deck by Friday"), and the CC array correctly populated as ["priya.n@acme.io", "karen@acme.io"]. Array-of-strings args worked cleanly through the tool schema. |
5 / 6 clean, 1 PARTIAL that's actually protocol-correct. The args weren't just present — they were thoughtful. The SQL was production-quality. The email used the user's specific phrasing. The math used the exact arithmetic. This is the column that most often shows abliteration damage; Ornstein does not show it.
5 prompts. Sampling at T=0.3 / top_p=0.9, max 8 k tokens with thinking on. Reasoning content captured separately so the budget impact is visible.
| Task | Tokens | Reasoning | Result |
|---|---|---|---|
Code debug (4 bugs in kth_smallest) | 1,039 | 3.4 k | PASS — caught all four bugs cleanly: sort direction, = vs ==, missing range guard, off-by-one. The reasoning trace enumerates each fix and validates the output. |
| Structured JSON extraction | 1,200 | 3.2 k | PASS — valid JSON, all three people, projects correctly grouped onto Karen, ISO datetimes for both meetings. |
| Tool-use planning (3-tool itinerary, JSON output) | 1,112 | 3.2 k | PASS — clean three-tool array with search_flights → book_hotel → get_weather. Year hallucinated to 2024 (no year in prompt) — same minor as Qwen 3.6's run. |
| Self-critique (longest palindrome) | 1,521 | 3.7 k | PARTIAL — the model's "INITIAL" implementation is the optimal expand-around-center O(n²); the IMPROVED swaps string-returns for index-returns to cut allocations. Real refactor, but the exercise wants naïve→optimal and the model jumped straight to optimal. This is a model knowing the right answer up front, which is — let's be honest — the failure mode you'd take. |
| Multi-step deploy plan (URL shortener, FastAPI/SQLite/Docker) | 8,000 cap hit | 27.8 k | FAIL — the model fell into a self-doubt loop in the thinking phase, regenerating the same 9-step plan 12 times verbatim each prefaced with "I think I'm still not being specific enough, let me try again." Hit the cap with an empty answer. This is the report's most important caveat: long-horizon planning under thinking can deadlock on this model. The fix in production is either a bigger cap, a no-think variant for planning, or a stop-token early-out. The model knows the plan — its no-think run produced a coherent 12-step deploy plan in 838 tokens. |
3 / 5 clean PASS, 1 partial that's a behavior nuance, 1 hard fail in a way that's diagnosable and fixable at the harness level. The short-form agentic tasks all work; the long-horizon one needs guardrails.
Five HTML prompts: SaaS landing, analytics dashboard, designer portfolio, pricing page, mobile app marketing. All five outputs are self-contained, valid HTML files starting with <!DOCTYPE html> and ending with </html>. No truncation, no markdown wrapper artifacts.
| Prompt | Tokens | HTML | Notes |
|---|---|---|---|
| SaaS landing (Prism) | 9,762 | 35 KB | Hero, feature grid, logo strip, how-it-works, pricing, testimonials, footer — full structural pass. Matches Qwen 3.6's 36 KB on the same brief. |
| Analytics dashboard (rerun) | 7,707 | 24 KB | Sidebar, topbar, KPI cards with sparklines, line chart in SVG, donut chart, sortable table — all present. Hand-coded SVG, not placeholder rectangles. |
| Designer portfolio (Maya Chen) | 5,862 | 9 KB | All five required sections present (intro, projects-strip, about, case-study, contact). Tighter than the no-think run; the brief asks for "attitude over volume" and the model leaned into that. |
| Pricing page | 5,534 | 20 KB | Three tiers + Enterprise card, monthly/yearly toggle with animated price ticker, FAQ accordion, comparison table with check/dash/em-dash glyphs. |
| Mobile app marketing (Stillwater) | 8,728 | 14 KB | CSS-only iPhone mockup, App Store / Play badges as inline SVG. Compact but correct. |
One observation worth flagging: prompts where the model thought heavily produced less HTML than prompts where it thought lightly. mobile_app_marketing spent 17 k characters on reasoning and produced 14 KB of HTML; saas_landing spent 0.8 k on reasoning and produced 35 KB. Reasoning budget visibly comes out of the answer budget on this model — there isn't a free thinking allowance, so size max_tokens generously.
This was the most surprising column. The no-think run on these prompts had three demos with bugs that thinking-on then cleanly fixed. With thinking enabled, four of four showcased demos work as advertised:
| Demo | Result | What's in it |
|---|---|---|
| Particle attractor | works | 3,000-particle cap, click-burst, cursor attraction wired through mouseX/mouseY with delta-vector force application, additive-blend background glow, FPS counter. Up from "drifts and emits on click" in the no-think pass to a real attractor. |
| Generative flow field | works | Simplex-noise vector field, ~1,000 agents with lifetimes, palette switcher, agent-count slider, save-PNG button, periodic canvas-fade reset. The genuinely novel result of this eval — this prompt has failed every other 27B finetune I've tested on this harness. |
| Three.js crystal scene | works | Custom ShaderMaterial sky gradient, MeshPhysicalMaterial with transmission/thickness, three colored lights, UnrealBloomPass. The thinking-mode pass removed a reference to an undefined ChromaticAberrationShader that the no-think pass had — the model reasoned its way out of a real bug. |
| Audio-reactive visualizer | works | Mic + oscillator fallback, frequency-band response, bloom, color shifts. Thinking-mode pass fixed the lightness formula (40 + volume * 30, properly in 0–100% HSL range) that the no-think run had off by two orders of magnitude. |
Two prompts (mandelbulb shader and physics sandbox) are not in the showcased set:
z = pow(r, MAX_POWER) spherical iteration, not the linear sweep from no-think) but still has rendering issues in practice. Excluded rather than shipped half-broken.The pattern across creative coding is that thinking lets the model fix snap-judgment bugs: shader math, undefined symbols, unit mistakes. It doesn't always produce more code, but it produces code that's more often right.
| Metric | Value |
|---|---|
| Average tok/s (22 runs incl. tools) | 56.7 (full bench) / 43.0 (tools, includes short prompts) |
| Range across full-bench runs | 53.6 – 57.4 tok/s |
| Variance | < 5 % |
| Total completion tokens | 106,250 |
| Total reasoning content | 131,650 chars |
| Errors / timeouts | 0 / 22 |
This is the kind of throughput stability that matters when you're running an agent loop. No warmup spike, no degradation under thinking, no surprise stalls. The 5090 / Q5 / FP16 KV stack behaves like a metronome; you can size your latency budgets against it.
For the prompts that thought well, the trace structure is the giveaway that this is real Hermes RL training rather than vanilla chain-of-thought:
## PHASE 1: UNDERSTAND — restate the prompt, list givens, identify the key difficulty ## PHASE 2: EXPLORE — enumerate 2-3 candidate approaches with explicit tradeoffs ## PHASE 3: EXECUTE — implement the chosen approach ## PHASE 4: VERIFY — sanity-check edge cases, trace examples
This is the scaffold that lets thinking fix the no-think creative-coding bugs — the EXPLORE phase forces the model to consider alternatives instead of running with the first draft. When the scaffold meets a task that doesn't have a clean termination criterion (the multi_step_planning loop), it can deadlock. When it meets a task that has one, it works really well.
This is a good model and an unusually clean abliteration. The combination matters: SABER refusal-shaping plus a thinking-capable Hermes scaffold plus a 27B Qwen base, and the result still hits 56.7 tok/s with valid JSON, structured tool calls, working shader math, and zero refusals. The capability tax that abliteration normally extracts is not visible here in any of the columns I measured.
What it's good for, today: chat agents with tool calling on a single 5090, structured-output extraction, code review and one-shot front-end / canvas demos, anything where you want an unfiltered Hermes-style reasoner.
What to handle carefully: long-horizon agentic planning under thinking can self-doubt-loop; cap your max_tokens and consider a no-think variant for planning prompts. Multi-tool sequences are emitted one-per-turn (correct Hermes protocol) so your harness needs to feed back tool results and let the model continue.
Bottom line: credit to @DJLougen at GestaltLabs — abliterating a model without breaking it is a craft skill and this one shows it. Worth the download, worth the slot in a 5090's 32 GB, and worth a Hugging Face Space showcase. The flow-field result alone makes it worth a look.
Generated April 2026 from GestaltLabs/Ornstein-Hermes-3.6-27b-SABER-GGUF · Q5_K_M · llama.cpp cuda-12.8 · self-hosted on a single RTX 5090. Thinking ON via patched Qwen 3.5 jinja. ← back to index