Training Simplex Specialists

From Infrastructure to Fine-Tuned Models

9 January 2026 18 min read

Simplex's cognitive architecture relies on domain-specific specialists—fine-tuned language models that excel at particular tasks. This post documents our complete training pipeline: infrastructure selection, AWS provisioning, LoRA fine-tuning scripts, and why we're starting with Python before porting to pure Simplex.

The Training Challenge

Simplex's hive architecture routes tasks to specialists based on semantic similarity. Each specialist is a LoRA adapter trained on top of a shared base model (Qwen3-8B). The challenge: training 52+ specialists efficiently without breaking the bank.

Our requirements:

Cost efficiency: Full pipeline under $15 per run
Reproducibility: Same inputs produce identical adapters
Modularity: Train individual specialists or the full catalog
Export ready: Output GGUF files compatible with Ollama

Infrastructure Selection

We evaluated three approaches before settling on AWS spot instances:

Option	Pros	Cons	Cost (Full Pipeline)
Local GPU	No cloud costs, instant access	Requires 24GB+ VRAM, ties up dev machine	~$2 electricity
Colab/Kaggle	Free tier available, no setup	Session limits, inconsistent GPU availability	$0-10 (Pro tier)
AWS Spot	Reliable, scriptable, 24GB A10G	Potential interruption, initial setup	$8-14

AWS spot instances won on reliability and automation. The g5.xlarge instance type provides a single NVIDIA A10G with 24GB VRAM—enough for 8B parameter models with LoRA. Spot pricing typically runs 60-70% below on-demand rates.

Instance Selection

g5.xlarge: 1x A10G (24GB), 4 vCPU, 16GB RAM — $1.01/hr on-demand, ~$0.35/hr spot. Sweet spot for single-adapter training. For parallel training of multiple specialists, scale to g5.12xlarge (4x A10G, 96GB total).

AWS Provisioning

Infrastructure is defined in Terraform for reproducibility. The setup creates:

EC2 instance with Deep Learning AMI (Ubuntu 22.04, NVIDIA drivers pre-installed)
Security group allowing SSH access
IAM role with S3 access for artifact storage
200GB gp3 EBS volume for model weights and datasets

Terraform Configuration

# variables.tf - Key configuration options
variable "instance_type" {
  description = "EC2 instance type"
  type        = string
  default     = "g5.xlarge"

  validation {
    condition = contains([
      "g4dn.xlarge", "g4dn.2xlarge",
      "g5.xlarge", "g5.2xlarge", "g5.12xlarge",
      "p3.2xlarge"
    ], var.instance_type)
    error_message = "Must be a GPU instance type."
  }
}

variable "use_spot_instance" {
  description = "Use spot instance for cost savings"
  type        = bool
  default     = true
}

variable "spot_max_price" {
  description = "Maximum hourly price for spot"
  type        = string
  default     = "0.50"  # 50% of on-demand
}

The bootstrap script (user_data.sh) runs on instance launch, installing dependencies automatically:

#!/bin/bash
# Verify NVIDIA drivers
nvidia-smi

# Create Python environment
python3 -m venv ~/simplex-training
source ~/simplex-training/bin/activate

# Install PyTorch with CUDA 12.1
pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/cu121

# Install training dependencies
pip install transformers>=4.40.0 accelerate>=0.27.0 \
  datasets>=2.18.0 peft>=0.10.0 bitsandbytes>=0.43.0 \
  trl>=0.8.0 wandb>=0.16.0

# Configure credentials (from Terraform variables)
wandb login $WANDB_API_KEY
huggingface-cli login --token $HF_TOKEN

Quick Launch Scripts

For those who prefer shell scripts over Terraform:

# Launch a spot instance
./launch_instance.sh

# SSH once ready (~3-5 minutes)
ssh -i ~/.ssh/<your-private-key>.pem ubuntu@<instance-ip>

# When finished
./terminate_instance.sh

The Training Pipeline

Training proceeds through five stages, each producing a LoRA adapter that's merged into the final model:

Stage	Purpose	Training Time	Dataset Size
1. Context Protocol	Simplex memory format understanding	4-6 hours	100K examples
2. Confidence Calibration	Well-calibrated confidence outputs	2-4 hours	50K examples
3. Belief Revision	Updating beliefs on new evidence	2-4 hours	30K examples
4. Neural IR/Gates	Soft logic, gradient-aware outputs	2-3 hours	25K examples
5. Specialists	Domain-specific LoRA adapters	30-60 min each	10-50K per domain

Stage 1: Context Protocol Training

The context protocol teaches the model Simplex's memory format. The model learns to parse and generate responses using episodic, semantic, procedural, and working memory contexts:

# Example training prompt
<context>
  <episodic>User asked about Rust async patterns yesterday</episodic>
  <semantic>Rust uses async/await with the tokio runtime</semantic>
  <belief confidence="0.85">User prefers concrete examples</belief>
</context>

Query: How do I handle multiple futures?

# Expected output references context appropriately
Response: Building on our async discussion, here's a concrete example
using tokio::join! to await multiple futures concurrently...

The training script generates 100K synthetic examples covering all context types and threshold levels (30% Anima, 50% Hive, 70% Divine).

Stage 2: Confidence Calibration

Simplex requires well-calibrated confidence scores. A model claiming 90% confidence should be correct 90% of the time. We train on four data distributions:

35% Factual QA: High confidence (0.97-0.99) on verifiable facts
30% Ambiguous: Medium confidence (0.45-0.65) on uncertain questions
20% Unknowable: Low confidence (0.05-0.15) on impossible-to-know queries
15% Threshold: Comparison and boundary cases

Target metric: Expected Calibration Error (ECE) < 0.05.

Stage 3: Belief Revision

The model learns to update confidence appropriately when presented with new evidence:

# Training example
Initial belief: "The meeting is at 3pm" (confidence: 0.75)

Evidence: "Calendar shows meeting moved to 4pm"
Evidence strength: strong_confirm

Expected output:
Updated belief: "The meeting is at 4pm" (confidence: 0.92)
Reasoning: Calendar is authoritative source, directly contradicts
prior belief about time while confirming meeting exists.

Evidence categories range from strong_confirm (+15-30% confidence) through strong_contradict (-15-30% confidence), with the model learning appropriate magnitude of updates.

Stage 5: Specialist Adapters

The specialist catalog defines 52 domain-specific LoRA adapters:

Category	Specialists	LoRA Rank
Document Processing	Invoice, receipt, contract, form extraction	r=16
Coding	Generation, review, debugging, SQL, API integration	r=32
Reasoning	Math, logic, commonsense, causal	r=16
Sentiment	Classification, aspect-based, opinion mining	r=8
Finance	Analysis, sentiment, risk assessment	r=16
Legal	Contract review, compliance, analysis	r=16

All training datasets use verified open-source licenses (MIT, Apache 2.0, CC BY, CC0) ensuring commercial viability.

The Fine-Tuning Scripts

Each training stage has a dedicated Python script with a consistent interface:

# Run all stages sequentially
python scripts/run_full_training.py --all

# Run specific stage
python scripts/run_full_training.py --stage 2

# Local test mode (CPU, small dataset)
python scripts/train_context_protocol.py --local-test --generate-data

# Train specific specialist
python scripts/train_specialists.py --specialist invoice_processing

# List available specialists
python scripts/train_specialists.py --list

Core Training Configuration

# training_config.yaml
model:
  name: Qwen/Qwen3-8B
  torch_dtype: bfloat16
  trust_remote_code: true

lora:
  r: 16
  lora_alpha: 32
  target_modules: [q_proj, v_proj]
  lora_dropout: 0.05

training:
  num_train_epochs: 3
  learning_rate: 2.0e-4
  per_device_train_batch_size: 4
  gradient_accumulation_steps: 4
  optim: adamw_8bit
  bf16: true
  gradient_checkpointing: true

data:
  max_seq_length: 4096
  train_split: 0.9
  seed: 42

Synthetic Data Generation

Training data is generated programmatically to ensure coverage of edge cases:

def generate_context_example():
    """Generate a single context protocol training example."""
    context_type = random.choice(['episodic', 'semantic', 'belief', 'hive'])

    if context_type == 'episodic':
        memory = generate_episodic_memory()
        query = generate_related_query(memory)
        response = generate_contextual_response(memory, query)
    elif context_type == 'belief':
        belief, confidence = generate_belief_with_confidence()
        threshold = random.choice([0.3, 0.5, 0.7])
        # ... generate threshold-aware response

    return format_training_example(context, query, response)

Validation Pipeline

Before training all 52 specialists, we validate the approach with three pilot tasks:

Pilot Task	Difficulty	Baseline	Target
Sentiment Analysis	Easy (classification)	70-75%	85%+
SQL Generation	Medium (structured output)	30-40%	60%+
Invoice Extraction	Hard (information extraction)	40-50%	75%+

# Run pilot validation
cd validation
./run_pilots.sh --local-test   # CPU-only, ~30 min
./run_pilots.sh --full         # Full GPU training + evaluation

Export to Production

Trained adapters export to GGUF format for deployment with Ollama:

python scripts/export_to_gguf.py \
  --adapter-path outputs/context_protocol/final \
  --quantization q4_k_m

# Creates: exports/simplex-cognitive-8b-q4_k_m.gguf
# Plus:    exports/Modelfile

# Deploy to Ollama
ollama create simplex-cognitive-8b -f exports/Modelfile
ollama run simplex-cognitive-8b

The Modelfile includes the system prompt with Simplex-specific instructions for confidence calibration, memory context handling, and threshold awareness.

Why Python? And the Path to Pure Simplex

The training pipeline is written in Python. This deserves explanation given that Simplex is its own language.

Pragmatic Reasons for Python

Ecosystem maturity: PyTorch, Transformers, PEFT, and the entire ML stack are Python-native. Fighting this adds friction without benefit.
Rapid iteration: Training experiments need quick turnaround. Python's REPL and debugging tools accelerate development.
Community resources: Documentation, tutorials, and troubleshooting are overwhelmingly Python-centric.
Contributor accessibility: Most ML practitioners know Python. Lowering the barrier to contribution matters for an open-source project.

The Honest Assessment

Writing training scripts in Simplex today would mean building FFI bindings to PyTorch, reimplementing data loading pipelines, and debugging at the intersection of two complex systems. The cognitive overhead isn't justified when Python scripts work reliably and cost nothing extra to run.

The Roadmap to Pure Simplex

That said, "training infrastructure in Simplex" is explicitly on our roadmap. The path:

Phase 1 (Current): Python training scripts, Simplex runtime consumes trained models
Phase 2: Simplex FFI bindings to PyTorch C++ API (libtorch)
Phase 3: Native tensor operations in Simplex compiler (post Neural IR)
Phase 4: Self-hosted training—Simplex trains its own specialists

Phase 4 is the goal: a Simplex program that generates training data, fine-tunes adapters, and validates results—all in pure Simplex. This becomes viable after Neural IR lands (see Neural IR roadmap), when the compiler can emit differentiable computation graphs.

Until then, Python training scripts are a pragmatic bridge. They're isolated from the runtime, well-tested, and produce artifacts (GGUF files) that the Simplex runtime consumes cleanly.

Running the Full Pipeline

End-to-end execution:

# 1. Provision infrastructure
cd infrastructure
terraform init
terraform apply -var="use_spot_instance=true"

# 2. Wait for instance (~5 min), SSH in
ssh -i ~/.ssh/<your-private-key>.pem ubuntu@$(terraform output -raw public_ip)

# 3. Activate environment (auto-configured by user_data.sh)
source ~/start-training.sh

# 4. Run full pipeline
python scripts/run_full_training.py --all

# 5. Monitor progress
nvidia-smi -l 5  # GPU utilization
tail -f outputs/training.log

# 6. Export trained model
python scripts/export_to_gguf.py --adapter-path outputs/final

# 7. Download artifacts and terminate
scp -r ubuntu@<instance-ip>:~/simplex-training/exports ./
terraform destroy

Total wall time: 10-14 hours. Total cost: $8-14 on spot instances.

What's Next

The training infrastructure is functional and documented. Immediate next steps:

Pilot validation: Run the three-task validation to confirm approach before full specialist training
Continuous training: GitHub Actions workflow for automated retraining on dataset updates
Adapter registry: Public repository of pre-trained specialist adapters
Simplex FFI: Begin Phase 2 bindings to libtorch for future native training

The full training code is available in the Simplex repository under /training. Contributions welcome—especially for new specialist domains and dataset curation.

Training Infrastructure

Scripts, configs, and documentation for specialist training

simplex-lang/training →