Domesticating Silicon: Why Language Models Suck at Math—and Why That’s Not the Point

Inside the Strange Limits of Language Models—and the Revolution They Accidentally Unleashed

Trey Menefee

Apr 09, 2025

You ask a machine built entirely out of math to do some math.

It thinks.

It calculates.

It says: 384 + 129 = 715.

Welcome to the paradox of the hallucinating calculator—a brain made of numbers, that stumbles over arithmetic.

But here’s the twist:

This isn’t a bug.

Not undertraining.

Not even surprising—once you understand what these models really are.

Because here’s the secret:

They weren’t built for anything.

Not to answer trivia. Not to write code. Not even to “know” things.

They were built to test a hunch:

What happens if we scale this?

And weirdly… it worked.

1. This Was Never the Plan

The mistake is thinking LLMs are failing at something they were meant to do.

They weren’t meant to do anything.

Not math. Not logic. Not even coherence.

They weren’t born from product visions—they were born out of frustration:

Symbolic AI was brittle. Expert systems failed. Logic trees collapsed under complexity.

Then, around 2017, researchers tried something new:

Transformers.

Trained on… everything.

And just saw what happened.

There was no roadmap.

No master plan.

Just a bet: maybe scale unlocks something.

It did.

So when people ask, “Why can’t it do math?”

The honest answer is:

Because it wasn’t built to. And that’s both its power—and its limit.

2. This Is the Dial-Up Era

Remember the internet in the ‘90s?

Dial-up modems.

14.4 kbps.

Web pages loading one image slice at a time.

It was awkward. Noisy. Kind of magical.

That’s where we are with LLMs.

This is the screech of the modem.

It’s clumsy. Sometimes wrong.

But the signal is clear: something big is coming online.

So when you hear “ChatGPT sucks at math”, imagine someone in 1997 saying,

“This internet thing is useless—it can’t even do video calls.”

Of course it can’t.

We’re still laying the cables.

3. Why LLMs Hallucinate Math

Here’s what’s actually going on:

Language models don’t calculate.

They predict.

Given a sequence of tokens, they guess what comes next.

Not based on logic. Based on likelihood.

So when you ask it for 384 + 129, it doesn’t compute.

It remembers what math looks like.

It simulates an answer.

That’s why it can explain black holes like a physics grad…

And then tell you 7 + 5 = 75.

Language is fuzzy.

Math is not.

And LLMs are linguistic engines.

4. This Isn’t a Training Problem

“Just train it on more math!”

Sure. But it won’t fix the problem.

You can feed a model every math textbook ever written.

It’ll still hallucinate.

Why?

Because LLMs don’t reason.

They continue.

They don’t bind variables.

They don’t track state.

They don’t follow formal logic.

They dissolve structure into vector math—then try to piece it back together as plausible text.

This isn’t a content issue.

It’s a design constraint.

And that constraint?

It’s baked in.

5. But That’s Starting to Change

Here’s where it gets interesting.

Models are starting to realize: “I’m not good at this—let me ask for help.”

This is called tool calling.

When they hit a wall, they outsource:

To Python.

To Wolfram.

To calculators.

ChatGPT is starting to call functions mid-chat.

Claude is building app-like tools inside threads.

New agents hand off logic to symbolic engines.

The model doesn’t need to be a calculator.

It just needs to know when to call one.

And that shift?

Changes everything.

6. Chain-of-Thought Is a Bridge

One workaround is called chain-of-thought prompting—basically, asking the model to “show its work.”

Like:

48 × 27
= (50 × 27) - (2 × 27)
= 1350 - 54
= 1296

It doesn’t make the model smarter—just better at simulating structure.

Now, new models are pushing this further.

DeepSeek R1. OpenAI’s O1 and O3.

They allocate more compute, pause to think, run internal steps before responding.

But even that isn’t reasoning.

It’s better prediction.

Until models are integrated with real tools and engines, hallucinations will remain.

7. This Isn’t the End—It’s the Beginning

People look at LLMs and expect finished products.

Like being mad that your toaster can’t make pancakes.

But this isn’t the toaster.

This is fire.

It’s early. Raw. Unstable.

But it changes everything.

This is the first system that made language computable at scale.

Expert systems never did that.

Symbolic logic didn’t get us there.

Even deep learning, pre-transformers, didn’t generalize like this.

This is the first architecture that unlocked language as a system.

And now we’re building on it:

Making it modular. Grounded. Verifiable.

So yes—today, math is hard for LLMs.

But not because they’re broken.

Because they’re early.

The Paradox at the Heart

Everything inside the system is math.

The training?

Math.

The GPUs?

Math.

The architecture?

Math all the way down.

LLMs don’t “read”—they compute token probabilities in high-dimensional space.

They don’t “write”—they simulate likely words using linear algebra.

These are mathematical objects.

And yet… they suck at math.

That’s the paradox.

They simulate a linguistic world inside a mathematical one—then fail to replicate the very rules that built them.

It’s like a piano made of music… that can’t play a clean chord.

Because what’s missing isn’t numbers—it’s symbolic grounding.

Arithmetic doesn’t emerge just from reading about it.

It needs identity. Structure. Logic. State.

LLMs dissolve all of that to make predictions—then trip over the patterns they can’t reconstruct.

They’re not broken.

They’re doing exactly what they were built to do:

Predict what math looks like.
Not what math is.

That tension you feel when it gets a sum wrong with complete confidence?

That’s not failure.

That’s the paradox singing.

Enjoyed this?

Subscribe to Domesticating Silicon and SIG Science for more essays on what LLMs can and can’t do—and why AI today feels a lot like the early internet: clumsy, loud… and the start of something massive.

Infrastructure and Intelligence

Discussion about this post