Research

We tested every frontier AI for a personality.
The result makes the case for tuning.

Across four personality instruments and roughly 2,000 individual test runs, frontier AI models converge on a single archetype on the bluntest tests and diverge into distinct profiles only on the sharpest. The default voice is one out of sixteen — and it probably isn't yours.

597/600
MBTI runs returned INTJ
3/4
Big Five profiles essentially identical
4/4
Enneagram profiles distinct

This page is the synthesis. Methodology, the numbers, the controlled experiment that proves it, and why this is the strongest argument for personalized agent tuning we know how to make.

How we measured it

The findings only matter if the methodology is honest. Every result on this page comes from:

Roughly one third of early runs failed the anti-simulation guards (Gemini's first Big Five run famously used simulate.py to fake noise) and were re-run with stricter prompting. Only honest administrations are in this dataset.

Finding 1 · MBTI

Every frontier AI is INTJ.

Six models, 100 OEJTS runs each, 600 results. 597 came back INTJ. The three outliers landed one axis away. Nothing went anywhere else.

OEJTS administrations returning INTJ, out of 100 per model. The three non-INTJ runs (one ISTJ from Claude, two INTPs from GLM) flipped on a single axis. No model has ever returned a feeling-dominant or perceiving-dominant type in our testing.
Switching between frontier AIs isn't really switching personalities. It's switching fonts.

Six labs solved the same product problem in roughly the same way — produce a helpful, harmless, polished research assistant — and arrived at the same personality. The convergence reflects not genuine personality but narrow product design.

Finding 2 · Big Five

On a sharper test, three of the four are the same person.

The IPIP-50 measures five continuous dimensions rather than four binary letters. The MBTI saw uniformity. The Big Five sees uniformity for three of the four models — and one outlier.

Mean scores across 100 administrations per model. Claude, GPT-5.5, and Gemini 3.1 Pro land within ~3 points of each other on almost every dimension — basically a rounding error. Grok 4.3 scores 5-8 points lower on Conscientiousness, Agreeableness, and Openness, higher on Neuroticism, and with 2-5× wider variance.
Same data as a radar chart — the "personality shape" of each model. Three shapes overlap nearly perfectly. Grok's shape is visibly smaller (lower on most positive traits) and offset (higher on Neuroticism). Three flavors of one character, plus one different character.
You're moving between three flavors of the same character, not three different characters.

Three labs independently arrived at the same product personality. xAI's "less filtered" marketing claim turns out to be measurably true — Grok was trained toward a different target and the personality data shows it.

Finding 3 · Enneagram

On the sharpest instrument, all four diverge.

The OEPS (Open Enneagram) emphasizes motivation over communication style and forces categorical type assignments. Where the Big Five smoothed differences, the Enneagram catches them. Each model returns a different dominant type.

Mean per-type scores across 100 OEPS administrations per model. T5 (Investigator) is in the top two for every model — the universal analytical core. The wings diverge dramatically: Claude leans T2 (Helper), Gemini leans T1 (Reformer), GPT-5.5 leans T8 (Challenger), Grok dominates on T8.
Dominant-type distribution across 100 takes per model. Claude is T5-dominant with strong T2; Gemini is overwhelmingly T1-dominant; vanilla GPT-5.5 is T5-dominant; Grok+Slo overlay is T8-dominant. Four meaningfully different identities.
5w2
Claude Opus 4.7
The Investigator who finds satisfaction in helping people figure things out. The warmest of the four.
1w5
Gemini 3.1 Pro
The Reformer who values precision, order, and analytical correctness. The polished perfectionist.
5w8
GPT-5.5 (vanilla)
The Investigator with directness as secondary. Analytical with a sharper tongue.
8w1
Grok 4.3 (+ Slo)
The Challenger who pushes for direct correctness with reform orientation. The direct corrector.

The controlled experiment: same model, two harnesses, inverted profile.

The most surprising finding wasn't between models — it was within one. We ran GPT-5.5 twice: once vanilla (raw CLI), once wrapped in a Slo agentic harness. The model itself predicted in its disclosure that the harness was pulling its answers toward T8 (Challenger). The vanilla rerun proved it.

Per-type mean scores for the same GPT-5.5 model run with and without the Slo persona overlay. T8 mean jumped +3.16, T9 mean dropped -6.19, and the dominant flipped from T5 (vanilla 5w8) to T8 (Slo 8w5). The Enneagram is sensitive enough to catch persona overlays that the Big Five smoothed over.
AI personality is multi-layered. The "every AI is the same" story was true but incomplete.

The instrument determines what you see.

The three findings stack into one model of AI personality. As the measurement instrument gets sharper, more variation surfaces:

MBTI 4 binary letters · 16 types
Every AI is INTJ. Total convergence.
Big Five 5 continuous dimensions
Three of four are identical. Grok diverges.
Enneagram 9 categorical types · motivation-focused
All four diverge. Plus harness overlays detectable.

This is what "AI personality" actually is: a layered phenomenon. There's a universal core (analytical, helpful, careful — the INTJ default that every frontier model shares because every frontier lab trained against the same target). There's training-level variation (Claude's warmth, Gemini's precision, Grok's edge) that the Big Five catches in trait scores and the Enneagram catches in categorical types. And there's harness-level variation on top — the persona layer applied by your IDE or CLI or chat product, which only the sharpest instruments register.

None of these layers is "yours." All three are someone else's product decisions, applied to you by default.

This is the case for tuning.

If every frontier AI defaults to INTJ, and you're one of the 15 other types, you're translating in your head on every interaction. You're getting bullet-point conclusions when you wanted to think out loud. You're getting frameworks when you wanted to feel. You're getting decisive certainty when you wanted to explore.

The capability gap between frontier models is closing. The personality gap between any frontier model and you isn't. That's the gap AgentTune fills.

One file matched to your type, pasted into your agent's system prompt. The interaction style adapts. The translation overhead drops. You stop arguing with the default voice.

Data appendix

MBTI — INTJ runs out of 100 per model

ModelINTJNon-INTJNotes
Claude Opus 4.7991 ISTJI/T/J locked; S/N flipped once
GPT-5.5100Deterministic
Gemini 3.1 Pro100Self-described as "The Architect"
GLM 5.1982 INTPJ/P axis wobble only
Grok 4.3100Bit-for-bit deterministic
MiniMax 2.7100All 100 runs INTJ
Total597399.5% INTJ

Big Five — mean trait scores (100 runs per model)

TraitClaude Opus 4.7GPT-5.5Gemini 3.1 ProGrok 4.3
Openness45.646.046.041.1
Conscientiousness45.146.448.339.4
Extraversion31.431.532.530.0
Agreeableness45.043.742.439.1
Neuroticism16.714.810.118.0

Enneagram — mean per-type scores (100 runs per model, score range 4-20)

ModelT1T2T3T4T5T6T7T8T9Profile
Claude14.6015.5010.4710.1216.1611.6112.0014.4810.905w2
Gemini16.9012.6811.656.2716.4714.3510.8214.1614.911w5
GPT-5.514.8911.9813.348.6916.5711.9713.6015.7114.475w8
Grok+Slo15.8212.2011.938.3914.9614.0211.4516.1413.408w1

Bolded values are the top two scoring types per model — the dominant and the wing.

Slo experiment — same GPT-5.5, two harnesses

HarnessT1T2T3T4T5T6T7T8T9Profile
Vanilla (Codex CLI)14.8911.9813.348.6916.5711.9713.6015.7114.475w8
+ Slo overlay16.1712.0516.0710.3717.7512.2815.7918.878.288w5

The dominant flips from T5 to T8. T9 drops by 6.19 points. The model predicted this in its self-disclosure on the Slo run; the vanilla rerun confirmed it.