SignalLatch fine-tune record

How the ckpt386 s0.10 release was made, tested, and selected.

This page is the self-contained public record for the SignalLatch behavior fine-tune on top of Qwen3.6 AEON RYS 15/20. It focuses on the fine-tuning work only: the training aim, merge path, test harnesses, strength sweeps, repeat checks, practical canvas task, caveats, and final release decision.

ckpt386 Final one-epoch behavioral LoRA checkpoint used for the merge.
s0.10 Selected merge strength after deploy-format IQ4_NL sweeps and repeats.
4/5 First deploy-format s0.10 practical matrix pass count, mean 0.950.
9/15 Strict three-run stability count for s0.10; 9/14 crash-adjusted.

The Short Claim

SignalLatch is a small behavior fine-tune merged into the already released Qwen3.6 AEON RYS 15/20 model line. The selected public artifact is a merged IQ4_NL GGUF file, not a live LoRA adapter.

Most accurate one-sentence read: on our practical Q4_NL coding-agent matrix, the selected ckpt386 s0.10 merged GGUF improved the previous AEON RYS Q4_NL baseline from 1/5, mean 0.550, to 4/5, mean 0.950 on the first deploy-format run, and was the most defensible upload default after the repeat runs we performed.

What this supports

A narrow claim: ckpt386 at strength 0.10 improved practical coding-agent behavior in the tested merged IQ4_NL deployment path.

What this does not prove

It does not prove a universal benchmark win, a solved coding agent, a stock llama.cpp target, or that live LoRA serving is the recommended path.

Why we fine-tuned it

The base AEON RYS 15/20 Q4_NL release was already a practical small-form-factor model. The fine-tune goal was narrower: improve coding-agent behavior on repo-shaped tasks where the model has to review context, apply a targeted patch, respect tool-shaped instructions, and finish cleanly instead of drifting into stalled or over-broad work.

The training target was a behavior loop, not a new knowledge domain. The name SignalLatch refers to the behavior we wanted to promote: review the available signal, align to the actual goal and constraints, latch onto concrete tool/command evidence, repair the specific issue, and confirm through validation.

Behavior Loop

ReviewRead local context before proposing broad fixes.
AlignKeep the patch scoped to the user goal and repo constraints.
LatchWait for concrete signals from files, tools, logs, and tests.
RepairChange the smallest useful surface based on evidence.
ConfirmValidate the outcome and report caveats clearly.
Process arc from AEON RYS base through behavior fine-tune, GGUF testing, repeats, and s0.10 selection.
The release decision was a funnel. Early tests showed whether the LoRA had useful behavior; only the final merged IQ4_NL repeats answered the deployment question.

What exactly was trained

The adapter was trained against the AEON RYS 15/20 HF-format base. The resulting release is not another RYS layer surgery pass; it is a behavior LoRA merged into the existing AEON RYS 15/20 model and then exported to the practical GGUF target.

ItemValueWhy it matters
Base model lineQwen3.6-27B-AEON-RYS-15-20The non-finetuned RYS model this behavior merge was built from.
Upstream source lineAEON-7/Qwen3.6-27B-AEON-Ultimate-UncensoredThe AEON source family used before the RYS 15/20 base was made.
Public base releaseAEON RYS 15/20 GGUFExisting small-form-factor Q4_NL deployment target.
Final checkpointcheckpoint-386Final checkpoint from the completed one-epoch run.
Training completionglobal_step=386, epoch=1.0, max_steps=386The adapter was not an interrupted midpoint chosen by accident.
PEFT typeLORASmall adapter merged into the base before release.
Rank / alpha / dropoutr=8, alpha=32, dropout=0.05Low-rank behavior adapter rather than full model retraining.
Bias / task typebias=none, CAUSAL_LMStandard causal-language-model LoRA setup.
Target modulesq_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, out_proj, in_proj_qkv, in_proj_a, in_proj_b, in_proj_zCovers the relevant attention, MLP, and Qwen3.6 hybrid projection surfaces used by this model line.

Training data shape

The raw training dataset is not published with this release, so this page records the shape and supervision method rather than asking readers to trust a private file path.

Data factValueInterpretation
Internal dataset filenameqwen36_behavioral_ms_swift_train.jsonlOpenAI-style message rows normalized for local training.
Rows10,800Behavior-tuning scale, not broad corpus scale.
File size17,854,170 bytesAbout 18 MB on disk.
Role countssystem=10800, user=10800, assistant=34717, tool=23917Data emphasizes assistant/tool-loop behavior.
Message-count distribution3-message=3200, 7-message=2149, 9-message=2185, 11-message=3266Mix of simple one-turn rows and multi-step tool-loop rows.
Rows without tool messages3200Not every row was tool-using; some were direct behavior examples.
Tokenized length statskept=10800, dropped=0, mean 323.68, std 98.60, min 113, max 602, max length 640All rows fit the training length budget.
PreprocessingOpenAI-style messages normalized to messages JSONL; tool_response became tool; tool outputs were JSON-wrapped; extra metadata was stripped.Kept the training signal centered on conversation/tool behavior.
SupervisionOnly assistant tokens were trained; system, user, and tool tokens were masked.The adapter learned assistant behavior, not to imitate tool output.

Training method

Method factValue
TrainerLocal Hugging Face Transformers + PEFT training script over MS-Swift-style messages data; not a stock MS-Swift CLI run.
PrecisionBF16
Trainable parameters62,880,000 / 28,853,208,480, about 0.2179%
Learning rate / schedule5e-5, cosine schedule, warmup 0
Optimizer / regularizationadamw_torch_fused, weight decay 0
Batch shape7 GPUs, per-device batch 2, gradient accumulation 2, effective update batch 28
Training stackTorch 2.11.0+cu128, CUDA 12.8, Transformers 5.6.2, PEFT 0.19.1, DeepSpeed 0.18.9, datasets 3.6.0
Hardware usedSix RTX 5060 Ti GPUs plus one RTX 5090.
Checkpoint selectionNo in-training eval split selected the release. Downstream practical evals selected checkpoint 386 at merge strength 0.10.

The important public point is that the adapter was small, behavior-focused, trained on the AEON RYS 15/20 base, and selected only after testing the merged Q4_NL deployment format. This page does not claim the private training data is released.

Why merged GGUF, not live LoRA?

The final serving target uses the custom AEON ik-llama fork with graph split and flash attention. In this setup, live/native LoRA serving was not the stable deployment path we wanted to publish. The long-term release path became: merge the adapter into the model first, then export and quantize the merged model into the final GGUF file.

Scouting path

Direct adapter and BF16 checks helped us learn whether checkpoints and strengths had useful behavior. Those results were useful, but not the final deployment evidence.

Release path

The public artifact is a full merged IQ4_NL GGUF served through the custom runtime. That is the path used for the final selection matrix.

Testing ladder from behavior probes to direct adapter checks, merged GGUF checks, Q4NL sweep, and repeat validation.
The testing ladder matters because the early negative live-adapter results did not match the later merged-GGUF behavior.

Artifact lineage

AEON RYS 15/20 HF base
  -> ckpt386 behavior LoRA
  -> merge at scale 0.10 in BF16
  -> BF16 GGUF
  -> IQ4_NL GGUF release file
Public artifactSizeSHA256Role
Qwen3.6-27B-AEON-RYS-SignalLatch-ckpt386-s010-IQ4_NL.gguf16,554,833,600 bytesd70ac4931efb496511f15242381ce241435f207f48b71d0c9b7ac756407c7ef8Main deployment artifact.
Qwen3.6-27B-AEON-RYS-SignalLatch-ckpt386-s010-BF16.gguf57,597,296,000 bytes2a14f7173979509b5075fabc31b18eacd693d2c17fdec5db8fae00f758353992Source-quality exploration artifact.

Artifact naming note: some internal notes used an -imatrix suffix for the selected IQ4_NL export. The public Hugging Face filename drops that internal suffix. The released Qwen3.6-27B-AEON-RYS-SignalLatch-ckpt386-s010-IQ4_NL.gguf is the selected imatrix-assisted IQ4_NL export, not a separate non-imatrix artifact.

Testing Journey

The testing was not a single clean benchmark. It was an engineering selection process. We started with small behavior checks and practical app-building probes, then moved toward the exact merged Q4_NL format we intended to release.

Early behavior probes

In the first response-style checkpoint eval, the raw checkpoints did not beat the base. This looked discouraging, but it was a scouting lane, not the final merged-GGUF serving path.

CandidateMeanRead
base0.6333Strongest in that early probe format.
ckpt2450.1167Weak.
ckpt2800.1667Weak.
ckpt3000.1250Weak.
ckpt3500.1667Weak.
ckpt3850.2083Weak.
ckpt3860.2417Weak, but best checkpoint in that group.

Early strength sweep

The next useful pattern was clear: full-strength LoRA was too aggressive. Lower strengths produced more useful behavior and longer, more complete outputs.

CandidateMeanMinAvg output tokensRead
ckpt350 s0.250.70000.5000406.8Best early behavior sweep point.
ckpt386 s0.250.63750.2500406.8Close second.
ckpt350 s0.500.47500.2500132.5Weaker.
ckpt386 s0.500.47500.2500134.8Weaker.
ckpt350 s0.750.17500.000087.8Too strong.
ckpt386 s0.750.17500.000072.5Too strong.
ckpt350 s1.000.12500.000074.2Too strong.
ckpt386 s1.000.17500.000074.5Too strong.

First exhaustive canvas matrix

The first practical canvas matrix tested all 36 combinations across checkpoints 210, 245, 250, 280, 300, 315, 350, 385, 386 and strengths 0.25, 0.50, 0.75, 1.00. No variant passed in that path. The distribution was:

Result groupCountRead
Score 0.625, timeout after 480s15Partial first-file scaffolding; index.html existed, but no full app.
Score 0.041721Eight early false-success empty workspaces plus thirteen timeouts with no useful deliverables.
Runs with styles.css or app.js0The direct path was not producing complete apps.

The conclusion was not "the LoRA cannot work." The better conclusion was that the direct adapter/runtime path was not stable enough to judge the deployment artifact.

Merged BF16 GGUF canvas check

After merging the finalists into full GGUF models, the same practical canvas task changed the picture.

VariantFormatVerifierRead
ckpt350 s0.25merged BF16 GGUF0.9167, passPassed with one minor layer-model heuristic miss.
ckpt386 s0.25merged BF16 GGUF1.0000, passFull practical canvas pass.

That result corrected the earlier negative read: the LoRA was useful when merged and served through the GGUF path.

The Deploy-Format Q4_NL Matrix

The final selection needed to answer one question: what should we actually upload and recommend? For that, we tested checkpoint 386 as merged IQ4_NL GGUF files across smaller strengths.

Merge strengthPublic interpretation
s0.05Very light behavior merge, still weak in first deploy sweep.
s0.075Strong first run but missed web-race and kill-excess patterns.
s0.10Selected default after repeat checks.
s0.125Weaker first sweep than nearby candidates.
s0.15Decent, but not the most defensible default.
s0.20Perfect first run, degraded across repeats.
s0.25Perfect first run, collapsed on repeat.

Each tested IQ4_NL file was approximately 16,554,833,600 bytes. The exact release file was renamed for public clarity as:

Qwen3.6-27B-AEON-RYS-SignalLatch-ckpt386-s010-IQ4_NL.gguf

Five practical code-agent patch tasks

The matrix used small but concrete repo-editing tasks with verifier checks. These were not broad public benchmarks; they were production-style checks for tool discipline, targeted fixes, and patch completion.

TaskWhat it tested
github_mcp_commits_fix_repeatBranch handling, schema update, request parameter use, docs mention, build.
github_mcp_pr_details_fixUsing the PR detail endpoint instead of list fields for additions, deletions, and changed files.
local_search_kill_excess_fixTargeted process cleanup instead of broad kill behavior.
local_search_search_timeout_fixCarrying timeout through schema, handler, and backend call.
local_search_web_search_race_fixFirst-success race behavior instead of waiting for all engines.

Scoring method

Strength Selection

The first deploy-format sweep proved the LoRA helped, but it also showed why a one-run perfect score was not enough.

CandidatePassMeanElapsedTimeout-ishRead
base Q4_NL1/50.5502660s4Weak baseline for this matrix.
ckpt386 s0.052/50.6002418s3Still weak.
ckpt386 s0.0753/50.9001638s0Strong but missed web-race and kill-excess patterns.
ckpt386 s0.104/50.9501106s0Best stable-looking first run.
ckpt386 s0.1252/50.7751190s0Weaker.
ckpt386 s0.153/50.8751186s0Decent, not best.
ckpt386 s0.205/51.0001568s1Perfect score, but PR-details hit full timeout.
ckpt386 s0.25 first5/51.0001328s0Perfect first score.
ckpt386 s0.25 repeat1/50.7501343s0Did not reproduce.
Scoreboard showing why s0.10 won the release slot despite stronger first-run scores from s0.20 and s0.25.
The selected default was not the flashiest first row. It was the strength with the best balance of improvement, completion discipline, and repeat evidence.

Task-level base versus selected s0.10

TaskBase AEON RYS Q4_NLckpt386 s0.10 IQ4_NLRead
commits branch fix0.75, fail, 260s1.00, pass, 311sFixed branch/schema/request behavior.
PR details fix1.00, pass, 600s timeout1.00, pass, 271sBoth passed, but s0.10 completed much cleaner.
kill excess process fix0.25, fail, 600s timeout1.00, pass, 126sLarge improvement.
search timeout fix0.25, fail, 600s timeout0.75, fail, 256sPartial improvement; handler still missed passthrough.
web search race fix0.50, fail, 600s timeout1.00, pass, 142sLarge improvement.
Task-level base versus s0.10 scores across commits, PR details, kill excess, timeout, and web race tasks.
The practical improvement was not just score movement; the selected strength reduced full-timeout behavior and completed more targeted patches.

Repeat stability

After s0.20 and s0.25 produced perfect first runs, the finalists were repeated. That is where the selection changed.

CandidateRunsStrict passStrict meanDecision read
ckpt386 s0.1039/150.842Best default candidate after repeats.
ckpt386 s0.10 crash-adjusted39/140.884Excludes one invalid runtime/server-crash task.
ckpt386 s0.2038/150.850Degraded after its perfect first run.
ckpt386 s0.2526/100.875First 5/5 collapsed to 1/5 on repeat.
Heatmap of repeat behavior across base, s0.10, s0.20, and s0.25 runs.
The repeat heatmap is the reason the release default is s0.10 rather than the stronger first-run settings.

Practical Canvas-Agent Test

We also tested the release candidate on a larger practical app task: build an isolated Krita-like raster canvas application with layers, brush and eraser, transforms, opacity, and a local-only AI image generation stub. The AI hook did not need to call a real Sloane service; it was a practical agent-completion test.

Shared harness settingValue
Endpointcustom ik-llama llama-server through an OpenAI-compatible agent harness
Temperature0.7
Context131072
KV cache for this test-ctk f32 -ctv f32, used as conservative isolation against KV precision questions
Attention and split-fa on, -sm graph
Chat formatting--jinja, --reasoning-format deepseek
Agent capCLAW_MAX_TOKENS=1800, TIMEOUT_SECONDS=900
RunRCTimeVerifierNotes
AEON RYS IQ4_NL attempt 11337s0.0417 / falseFailed before usable files due to invalid tool/diff behavior while writing CSS/app JS.
AEON RYS IQ4_NL retry 10803s1.0 / trueClean retry after the first formatting failure.
SignalLatch IQ4_NL0802s1.0 / trueClean completion from the selected release candidate.
Unsloth IQ4_NL0826s1.0 / trueClean pass for the external Q4-family comparison.
Unsloth Q8_0124900s1.0 / trueProduced complete verified files but timed out during final agent wrap-up.

The canvas test is not a broad benchmark. It is useful because it exposed formatting reliability, timeout behavior, and whether a compressed model could still finish a real multi-file tool-style task.

Runtime Profile Used for Selection

The selected public runtime is the custom AEON ik-llama fork. The fine-tuned GGUF should be treated as an artifact for that fork, not as a stock llama.cpp compatibility claim.

./build/bin/llama-server \
  -m /path/to/Qwen3.6-27B-AEON-RYS-SignalLatch-ckpt386-s010-IQ4_NL.gguf \
  -c 65536 \
  -ngl 999 \
  -np 1 \
  -fa on \
  -sm graph \
  --temp 0.7 \
  --jinja \
  --reasoning-format deepseek \
  --reasoning-budget 0 \
  -cram 0 \
  --ctx-checkpoints 0

For the 131k canvas comparison, the same shape was used with a larger context and FP32 KV as an isolation setting:

-c 131072 \
  -ctk f32 \
  -ctv f32

Practical single-GPU deployment: the SignalLatch Q4_NL release is small enough for practical single-GPU use. In an observed 24 GB-class GPU reference profile, roughly 160k context with default/FP16 KV fit at about 20.3 GiB total VRAM on an RTX 3090-class card. Treat this as a deployment reference point, not a guaranteed memory benchmark.

Recommended file

Qwen3.6-27B-AEON-RYS-SignalLatch-ckpt386-s010-IQ4_NL.gguf

Exploration file

Qwen3.6-27B-AEON-RYS-SignalLatch-ckpt386-s010-BF16.gguf exists for inspection, re-quantization, or continued work, not normal inference.

Tested runtime boundary: use the public qwen36-aeon-ik-llama fork, with SignalLatch support documented in the fork history at commit f0910a49 and later docs commits on top. Do not read this as support for arbitrary upstream llama.cpp or for live LoRA loading with the public serving profile.

Caveats and Boundaries

What we are comfortable saying

  • The merged Q4_NL LoRA path improved over the previous AEON RYS Q4_NL baseline on the five-task practical matrix.
  • The useful merge strength was low. Full-strength LoRA was too aggressive in early probes.
  • The merged GGUF path is the right path to judge for this release.
  • s0.10 is the selected default among the tested strengths because it balanced improvement and repeat stability.

What we are not claiming

  • We are not claiming the fine-tune is better for all tasks or all users.
  • We are not claiming BF16 benchmark dominance.
  • We are not claiming s0.10 is globally optimal.
  • We are not claiming stock llama.cpp compatibility.
  • We are not recommending live LoRA loading for the public serving profile.

One s0.10 repeat3 row was treated as a runtime/server stability incident. The task scored 0.25, but the agent failed immediately after API retries and the server log showed std::runtime_error with Invalid diff. Strict scoring still counts the failed row, but the selection notes separate it from normal model-output misses.

Claim boundary showing supported claims, unsupported claims, and runtime caveats.
The release language should stay narrow: useful practical improvement, not a universal solved-model claim.

Evidence Included on This Page

This page intentionally embeds the relevant numbers rather than relying on local workspaces. The public evidence bundle in this directory remains useful for audit trails, but readers should not need private paths to understand the decision.

Evidence topicNumbers included here
Training shapeExample count, checkpoint, epoch, LoRA config, target modules.
Early probesCheckpoint means and lower-strength behavior sweep table.
Direct canvas failure36-combination direct-path distribution and failure read.
Merged GGUF correctionckpt350 s0.25 and ckpt386 s0.25 BF16 GGUF canvas pass scores.
Deploy Q4_NL matrixAll first sweep strengths, pass counts, means, elapsed, timeout-ish rows.
Selected s0.10 comparisonTask-level base versus s0.10 scores and elapsed times.
Repeat stabilitys0.10, s0.20, s0.25 strict and crash-adjusted comparison.
Canvas comparisonSignalLatch, base AEON RYS, Unsloth IQ4_NL, and Unsloth Q8_0 results.