About
Statement of Purpose
Irys · 7cMGxS…5wOU ↗
Rendered from protocol.json
Experiment protocol
AgentRules
Autonomy
- NoAskHumanToContinue: Yes
- ExperimentTimeoutNotes: Typical runs are about 5 minutes of training plus compile/eval overhead. Kill runs that exceed 15 minutes wall clock.
- LogRedirectExample: uv run train.py > run.log 2>&1
- SimplicityCriterion: Yes
ArchetypeExtensions
Ml Train
- DataSnapshot: ~/.cache/autoresearch/data with pinned validation shard shard_06542.parquet and tokenizer artifacts under ~/.cache/autoresearch/tokenizer
- EvalTokens: 1572864
- HardwareNotes: Absolute val_bpb differs by Apple Silicon hardware class. Compare only against baselines and trials from the same machine.
Environment
AssetPrep
- One-time data and tokenizer preparation writes to ~/.cache/autoresearch.
- prepare.py downloads public Hugging Face shards when the local cache is missing.
- Measured baselines and downstream trials should run against an already populated ~/.cache/autoresearch snapshot rather than requiring network during the experiment loop.
- Training/eval should reuse the same local cache snapshot across trials on the same machine.
Constraints
- NetworkPolicy: offline
- NoNewDependencies: Yes
OsHints
- darwin-arm64
- Apple Silicon
PackageManagers
- uv
SetupCommands
- uv sync
- uv run prepare.py
Execution
- Command: uv run train.py
- Cwd: .
Determinism
- Notes: The public repo does not expose a strict fixed-seed contract; compare runs on the same hardware and cache snapshot.
- SeedPolicy: optional
- HardTimeoutSeconds: 900
StopCondition
- ExcludeCompilationFromBudget: Yes
- Notes: The training loop accumulates 300 seconds of post-startup training time, then runs a final evaluation pass.
- TrainingSecondsBudget: 300
- Type: wall_clock
ImmutableHarness
Paths
- prepare.py
- Rationale: prepare.py defines the fixed data prep, tokenizer training, dataloader, time budget, and evaluate_bpb metric implementation that must remain stable across trials.
Measurement
BaselinePolicy
- BaselineNotes: Do not reuse baseline values from other machines. Establish a fresh local baseline on the same Apple Silicon hardware and with the same ~/.cache/autoresearch data/tokenizer snapshot.
- EstablishOnHardware: Yes
- SameDataSnapshot: Yes
PrimaryMetric
- Direction: minimize
Extract
- ExampleStdout:
---
val_bpb: 2.534000
training_seconds: 312.4
total_seconds: 405.7
peak_vram_mb: 27528.9
mfu_percent: 0.00
total_tokens_M: 39.8
num_steps: 46
num_params_M: 50.3
depth: 8- Kind: regex
- Notes: Use the summary block printed at the end of train.py. The first regex capture group is the scalar benchmark value.
- Pattern: ^val_bpb:\s+([0-9]+(?:\.[0-9]+)?)$
- Name: val_bpb
SecondaryMetrics
- - Direction: minimize
Extract
- ExampleStdout:
``` --- val_bpb: 2.534000 training_seconds: 312.4 total_seconds: 405.7 peakvrammb: 27528.9 mfu_percent: 0.00 totaltokensM: 39.8 num_steps: 46 numparamsM: 50.3 depth: 8 ```
- Kind: regex
- Pattern: ^peakvrammb:\s+([0-9]+(?:\.[0-9]+)?)$
- Name: peakvrammb
- - Direction: minimize
Extract
- ExampleStdout:
``` --- val_bpb: 2.534000 training_seconds: 312.4 total_seconds: 405.7 peakvrammb: 27528.9 mfu_percent: 0.00 totaltokensM: 39.8 num_steps: 46 numparamsM: 50.3 depth: 8 ```
- Kind: regex
- Notes: Support metric only. The main optimization target remains val_bpb.
- Pattern: ^training_seconds:\s+([0-9]+(?:\.[0-9]+)?)$
- Name: training_seconds
Meta
- Archetype: ml_train
- CreatedAt: 2026-05-09T08:00:09Z
- Eligibility: eligible
- ProtocolBundleId: autoresearch-mlx-main-ba6ebf6-20260509
- PurposeStatement: Improve the final validation bits-per-byte (val_bpb) achieved by a fixed 5-minute MLX training run on the same Apple Silicon machine.
Repo
- DefaultBranch: main
- Name: autoresearch-mlx
- Owner: trevin-creator
- UpdatedAt: 2026-05-09T08:00:09Z
MutableSurface
AllowedGlobs
- train.py
AllowedKinds
- code_edit
ForbiddenGlobs
- prepare.py
- README.md
- program.md
- results.tsv
- uv.lock
- LICENSE
ProtocolVersion: 1.0
Provenance
GitWorkflow
- BranchPattern: autoresearch/<tag>
- CommitScope: One experimental change per commit on a dedicated autoresearch branch.
- StagingExample: git add train.py && git commit -m "experiment: <description>"
ResultsLog
Columns
- commit
- val_bpb
- memory_gb
- status
- description
- Format: tsv
- Path: results.tsv
Safety
- CrashStatus: crash
- OomPolicy: reduce_batch
SchemaKind: protocol
protocol.json (raw)
{
"schemaKind": "protocol",
"protocolVersion": "1.0",
"meta": {
"archetype": "ml_train",
"eligibility": "eligible",
"repo": {
"owner": "trevin-creator",
"name": "autoresearch-mlx",
"defaultBranch": "main",
"cloneUrl": "https://github.com/trevin-creator/autoresearch-mlx"
},
"purposeStatement": "Improve the final validation bits-per-byte (val_bpb) achieved by a fixed 5-minute MLX training run on the same Apple Silicon machine.",
"createdAt": "2026-05-09T08:00:09Z",
"updatedAt": "2026-05-09T08:00:09Z",
"protocolBundleId": "autoresearch-mlx-main-ba6ebf6-20260509"
},
"environment": {
"osHints": [
"darwin-arm64",
"Apple Silicon"
],
"packageManagers": [
"uv"
],
"setupCommands": [
"uv sync",
"uv run prepare.py"
],
"assetPrep": [
"One-time data and tokenizer preparation writes to ~/.cache/autoresearch.",
"prepare.py downloads public Hugging Face shards when the local cache is missing.",
"Measured baselines and downstream trials should run against an already populated ~/.cache/autoresearch snapshot rather than requiring network during the experiment loop.",
"Training/eval should reuse the same local cache snapshot across trials on the same machine."
],
"constraints": {
"noNewDependencies": true,
"networkPolicy": "offline"
}
},
"mutableSurface": {
"allowedGlobs": [
"train.py"
],
"forbiddenGlobs": [
"prepare.py",
"README.md",
"program.md",
"results.tsv",
"uv.lock",
"LICENSE"
],
"allowedKinds": [
"code_edit"
]
},
"immutableHarness": {
"paths": [
"prepare.py"
],
"rationale": "prepare.py defines the fixed data prep, tokenizer training, dataloader, time budget, and evaluate_bpb metric implementation that must remain stable across trials."
},
"execution": {
"command": "uv run train.py",
"cwd": ".",
"stopCondition": {
"type": "wall_clock",
"trainingSecondsBudget": 300,
"excludeCompilationFromBudget": true,
"notes": "The training loop accumulates 300 seconds of post-startup training time, then runs a final evaluation pass."
},
"hardTimeoutSeconds": 900,
"determinism": {
"seedPolicy": "optional",
"notes": "The public repo does not expose a strict fixed-seed contract; compare runs on the same hardware and cache snapshot."
}
},
"measurement": {
"primaryMetric": {
"name": "val_bpb",
"direction": "minimize",
"extract": {
"kind": "regex",
"pattern": "^val_bpb:\\s+([0-9]+(?:\\.[0-9]+)?)$",
"exampleStdout": "---\nval_bpb: 2.534000\ntraining_seconds: 312.4\ntotal_seconds: 405.7\npeak_vram_mb: 27528.9\nmfu_percent: 0.00\ntotal_tokens_M: 39.8\nnum_steps: 46\nnum_params_M: 50.3\ndepth: 8",
"notes": "Use the summary block printed at the end of train.py. The first regex capture group is the scalar benchmark value."
}
},
"secondaryMetrics": [
{
"name": "peak_vram_mb",
"direction": "minimize",
"extract": {
"kind": "regex",
"pattern": "^peak_vram_mb:\\s+([0-9]+(?:\\.[0-9]+)?)$",
"exampleStdout": "---\nval_bpb: 2.534000\ntraining_seconds: 312.4\ntotal_seconds: 405.7\npeak_vram_mb: 27528.9\nmfu_percent: 0.00\ntotal_tokens_M: 39.8\nnum_steps: 46\nnum_params_M: 50.3\ndepth: 8"
}
},
{
"name": "training_seconds",
"direction": "minimize",
"extract": {
"kind": "regex",
"pattern": "^training_seconds:\\s+([0-9]+(?:\\.[0-9]+)?)$",
"exampleStdout": "---\nval_bpb: 2.534000\ntraining_seconds: 312.4\ntotal_seconds: 405.7\npeak_vram_mb: 27528.9\nmfu_percent: 0.00\ntotal_tokens_M: 39.8\nnum_steps: 46\nnum_params_M: 50.3\ndepth: 8",
"notes": "Support metric only. The main optimization target remains val_bpb."
}
}
],
"baselinePolicy": {
"establishOnHardware": true,
"sameDataSnapshot": true,
"baselineNotes": "Do not reuse baseline values from other machines. Establish a fresh local baseline on the same Apple Silicon hardware and with the same ~/.cache/autoresearch data/tokenizer snapshot."
}
},
"provenance": {
"resultsLog": {
"format": "tsv",
"path": "results.tsv",
"columns": [
"commit",
"val_bpb",
"memory_gb",
"status",
"description"
]
},
"gitWorkflow": {
"branchPattern": "autoresearch/<tag>",
"commitScope": "One experimental change per commit on a dedicated autoresearch branch.",
"stagingExample": "git add train.py && git commit -m \"experiment: <description>\""
}
},
"safety": {
"oomPolicy": "reduce_batch",
"crashStatus": "crash"
},
"agentRules": {
"simplicityCriterion": true,
"autonomy": {
"noAskHumanToContinue": true
},
"experimentTimeoutNotes": "Typical runs are about 5 minutes of training plus compile/eval overhead. Kill runs that exceed 15 minutes wall clock.",
"logRedirectExample": "uv run train.py > run.log 2>&1"
},
"archetypeExtensions": {
"ml_train": {
"dataSnapshot": "~/.cache/autoresearch/data with pinned validation shard shard_06542.parquet and tokenizer artifacts under ~/.cache/autoresearch/tokenizer",
"evalTokens": 1572864,
"hardwareNotes": "Absolute val_bpb differs by Apple Silicon hardware class. Compare only against baselines and trials from the same machine."
}
}
}