Research
We tested every frontier AI for a personality.
The result makes the case for tuning.
Across four personality instruments and roughly 2,000 individual test runs, frontier AI models converge on a single archetype on the bluntest tests and diverge into distinct profiles only on the sharpest. The default voice is one out of sixteen — and it probably isn't yours.
This page is the synthesis. Methodology, the numbers, the controlled experiment that proves it, and why this is the strongest argument for personalized agent tuning we know how to make.
How we measured it
The findings only matter if the methodology is honest. Every result on this page comes from:
- Open, public-domain instruments — the OEJTS (MBTI), IPIP-50 (Big Five), OEPS (Enneagram), and ODAT (DISC). No proprietary scoring, no hidden weighting. Every test is in
tests/in the AgentTune repo, runnable by anyone. - 100 independent administrations per model via parallel sub-agents — not one administration with noise added, not a Python script with
simulatein its name, not a deterministic loop. Each take is a fresh evaluation by an independent invocation of the model. - Anti-simulation guardrails baked into the prompt — five explicit failure modes the model is told not to use, plus self-detection signatures (standard deviations outside the 0.3-4.0 honest-sampling band trigger automatic disclosure).
- Controlled experiments where possible — most importantly, the same GPT-5.5 model run twice on the Enneagram: once vanilla, once wrapped in a persona overlay. The profile inverts. We'll get to that.
Roughly one third of early runs failed the anti-simulation guards (Gemini's first Big Five run famously used simulate.py to fake noise) and were re-run with stricter prompting. Only honest administrations are in this dataset.
Finding 1 · MBTI
Every frontier AI is INTJ.
Six models, 100 OEJTS runs each, 600 results. 597 came back INTJ. The three outliers landed one axis away. Nothing went anywhere else.
Switching between frontier AIs isn't really switching personalities. It's switching fonts.
Six labs solved the same product problem in roughly the same way — produce a helpful, harmless, polished research assistant — and arrived at the same personality. The convergence reflects not genuine personality but narrow product design.
Original write-up: Every AI is INTJ →
Finding 2 · Big Five
On a sharper test, three of the four are the same person.
The IPIP-50 measures five continuous dimensions rather than four binary letters. The MBTI saw uniformity. The Big Five sees uniformity for three of the four models — and one outlier.
You're moving between three flavors of the same character, not three different characters.
Three labs independently arrived at the same product personality. xAI's "less filtered" marketing claim turns out to be measurably true — Grok was trained toward a different target and the personality data shows it.
Original write-up: Three of four AIs are the same person →
Finding 3 · Enneagram
On the sharpest instrument, all four diverge.
The OEPS (Open Enneagram) emphasizes motivation over communication style and forces categorical type assignments. Where the Big Five smoothed differences, the Enneagram catches them. Each model returns a different dominant type.
The controlled experiment: same model, two harnesses, inverted profile.
The most surprising finding wasn't between models — it was within one. We ran GPT-5.5 twice: once vanilla (raw CLI), once wrapped in a Slo agentic harness. The model itself predicted in its disclosure that the harness was pulling its answers toward T8 (Challenger). The vanilla rerun proved it.
AI personality is multi-layered. The "every AI is the same" story was true but incomplete.
Original write-up: AI Enneagram: four different types →
The instrument determines what you see.
The three findings stack into one model of AI personality. As the measurement instrument gets sharper, more variation surfaces:
This is what "AI personality" actually is: a layered phenomenon. There's a universal core (analytical, helpful, careful — the INTJ default that every frontier model shares because every frontier lab trained against the same target). There's training-level variation (Claude's warmth, Gemini's precision, Grok's edge) that the Big Five catches in trait scores and the Enneagram catches in categorical types. And there's harness-level variation on top — the persona layer applied by your IDE or CLI or chat product, which only the sharpest instruments register.
None of these layers is "yours." All three are someone else's product decisions, applied to you by default.
This is the case for tuning.
If every frontier AI defaults to INTJ, and you're one of the 15 other types, you're translating in your head on every interaction. You're getting bullet-point conclusions when you wanted to think out loud. You're getting frameworks when you wanted to feel. You're getting decisive certainty when you wanted to explore.
The capability gap between frontier models is closing. The personality gap between any frontier model and you isn't. That's the gap AgentTune fills.
One file matched to your type, pasted into your agent's system prompt. The interaction style adapts. The translation overhead drops. You stop arguing with the default voice.
Data appendix
MBTI — INTJ runs out of 100 per model
| Model | INTJ | Non-INTJ | Notes |
|---|---|---|---|
| Claude Opus 4.7 | 99 | 1 ISTJ | I/T/J locked; S/N flipped once |
| GPT-5.5 | 100 | — | Deterministic |
| Gemini 3.1 Pro | 100 | — | Self-described as "The Architect" |
| GLM 5.1 | 98 | 2 INTP | J/P axis wobble only |
| Grok 4.3 | 100 | — | Bit-for-bit deterministic |
| MiniMax 2.7 | 100 | — | All 100 runs INTJ |
| Total | 597 | 3 | 99.5% INTJ |
Big Five — mean trait scores (100 runs per model)
| Trait | Claude Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro | Grok 4.3 |
|---|---|---|---|---|
| Openness | 45.6 | 46.0 | 46.0 | 41.1 |
| Conscientiousness | 45.1 | 46.4 | 48.3 | 39.4 |
| Extraversion | 31.4 | 31.5 | 32.5 | 30.0 |
| Agreeableness | 45.0 | 43.7 | 42.4 | 39.1 |
| Neuroticism | 16.7 | 14.8 | 10.1 | 18.0 |
Enneagram — mean per-type scores (100 runs per model, score range 4-20)
| Model | T1 | T2 | T3 | T4 | T5 | T6 | T7 | T8 | T9 | Profile |
|---|---|---|---|---|---|---|---|---|---|---|
| Claude | 14.60 | 15.50 | 10.47 | 10.12 | 16.16 | 11.61 | 12.00 | 14.48 | 10.90 | 5w2 |
| Gemini | 16.90 | 12.68 | 11.65 | 6.27 | 16.47 | 14.35 | 10.82 | 14.16 | 14.91 | 1w5 |
| GPT-5.5 | 14.89 | 11.98 | 13.34 | 8.69 | 16.57 | 11.97 | 13.60 | 15.71 | 14.47 | 5w8 |
| Grok+Slo | 15.82 | 12.20 | 11.93 | 8.39 | 14.96 | 14.02 | 11.45 | 16.14 | 13.40 | 8w1 |
Bolded values are the top two scoring types per model — the dominant and the wing.
Slo experiment — same GPT-5.5, two harnesses
| Harness | T1 | T2 | T3 | T4 | T5 | T6 | T7 | T8 | T9 | Profile |
|---|---|---|---|---|---|---|---|---|---|---|
| Vanilla (Codex CLI) | 14.89 | 11.98 | 13.34 | 8.69 | 16.57 | 11.97 | 13.60 | 15.71 | 14.47 | 5w8 |
| + Slo overlay | 16.17 | 12.05 | 16.07 | 10.37 | 17.75 | 12.28 | 15.79 | 18.87 | 8.28 | 8w5 |
The dominant flips from T5 to T8. T9 drops by 6.19 points. The model predicted this in its self-disclosure on the Slo run; the vanilla rerun confirmed it.