LLMs are the coolest thing happening in software at this time. This software can, with comparatively trivial prompting effort, perform computations that you and I will never succeed at coding nor justify developing a single-purpose ML model for.
There's no use covering that over and over. If you know, you know; if you don't, you will.
How can I participate in this event?
I can use LLMs in all the various ways. I can build systems of LLMs that solve all kinds of problems. I can do beginner level development of a toy LLM. I can keep atop all the ideas for improving LLMs.
I can and do and will do all of those things. I wonder if I can do any more?
If, hypothetically, I wanted to contribute in some way to the cutting edge of AI development, what would I do?
Note my blog post a few days ago about rearchitecting the way we think about programming languages. A day of research found that (counter to my initial thought) I was not creating any more opportunities for optimization, though there may be some lingering ideas there for user experience (multiple views of the same underlying AST, AST-oriented diffing, percolating compiled performance implications up to the IDE to display while coding, etc.). The fundamental contributions I have a shot at making are not in the optimization space. They're in the developer experience space. That may be worthwhile! And it is good for me to realize which of those I should really focus my attention on.
In the LLM space, I will not be training any usable foundation models, because I do not have a few billion extra dollars to spend on compute. I will not be researching the optimal 20% improvement ideas on how to multiply 16 bit floating points efficiently nor squeeze another Gbps out of the GPUs memory architecture. I could design some novel orchestration structure, but honestly that looks well covered by MCP and the million companies big and small thinking about that very hard.
What can I do to push the frontier in AI?
Well, if I were to push that frontier, it would occur under these constraints:
2hr/workday * 5 workdays plus 5hr/weekend day * 2 weekend days = 20 hours
. I have a real job and a life.What kind of AI contribution would occur under those constraints?
It'd have these characteristics:
Fortunately, I do have some ideas there.
Firstly, wait for my forthcoming book of philosophy.
In the meantime:
One way to make intelligent decisions is to know so much about the world (past, present, and structural) that you can accurately simulate future events under each possible decision.
We clearly do that to some degree. You can imagine a machine that knows more than we do, which would improve its simulations' accuracy. You can imagine a machine that can more faithfully simulate more possibilities than we can in a given unit of time.
This is a broad vibe encapsulated by the scaling LLM paradigm: Give it so much information about everything in the world that it can accurately operate within effectively arbitrary contexts.
Another way that we sometimes imagine intelligence is to possess some compact core algorithm, or some kind of internal reasoning oracle, that can make good decisions with seemingly no (or vanishingly little) prior knowledge about that context.
Such intelligence does not know the names of all the wives of the twelfth king of France. Such intelligence has never played chess. Such intelligence has never encountered any example of a Trojan horse.
Yet, we imagine, this intelligence can beat us at chess and detect a Trojan horse situation and read or query all the historical references to find the wives of the twelfth king of France for the purpose of writing a graduate thesis on European monarchies.
I think the issue with this latter intelligence is clear: You have to know something about the world in order to do anything meaningful in it. If such a machine existed, it would be of no use until you've fed it enough data, one way or another, that it knew enough about the world to understand what you wanted it to do and to manipulate it's real world environment sufficiently to achieve its objectives.
The former type of intelligence, the one saturated with all information in the universe, has no such problem. It can make good decisions always. It knows everything and can simulate all possible futures as well or better than you can, and it will do pretty good most of the time.
The issue with the former intelligence is that it is very expensive to make! Expensive in terms of hardware and time to train and lack of data.
Real LLMs are somewhere in the middle. They have lots of data, but obviously not enough to fully simulate all possible futures. They have also clearly condensed from that data some circuits of high quality world-navigating heuristics or sub-models buried in their weights.
Humans are also somewhere in the middle. We train on way less data than the big foundation LLMs. Any challenge we still pose to LLMs are attributable to some high quality world-navigating sub-models that most of us create during our training due to some structures consistently emerging from our biological structure combined with some aspects of our senses feeding data to those structures throughout our ten or twenty year pre-training phase.
The open research question for what I'm calling Minimal Intelligence is: What structure and training regimen will consistently produce a machine capable of decision quality comparable to humans or LLMs (in a more limited set of real world domains, since it just won't have as many world-facts memorized that it can operate without further training) while training on a much smaller set of data than either.
This Minimal Intelligence almost certainly won't know the wives of the twelfth king of France. And it should not! It might not know the rules of chess. And perhaps it shouldn't! But it should be able to spend a few minutes reviewing the rules of chess, communicated to it via some language it has been trained to understand, and maybe play one practice game of chess, and then be able to beat you on most games henceforth.
That is Minimal Intelligence. That is what I want to build.
As hinted at above, the important part of Minimal Intelligence is that it has some seed structure and some system for feeding it training data such that it grows the rest of the structures of useful intelligence consistently.
You should be able to "start from scratch" and train comparable intelligences from this seed most times.
The easiest way to find that seed structure is almost certainly to first create a supermassive machine of the data-maximized LLM variety and then distilling the meaningful sub-models from its ginormous structure into a much smaller core model that knows all the "being intelligent" things but none or very little of the "facts about the world" things.
You then train this seed structure with a curated set of training data sufficient to provide an interface between the core intelligence and the world it should be operating in, including your method of communicating with it.
Notably: This is one thing I can do. I can download the biggest open weights models and research new methods of distilling that core intelligence from it.
Note: This is totally different from people distilling 70B parameter models into 8B parameter models. What I'm doing would make it much dumber than even those. The result of the distillation I'm describing would not even be able to chat with you, because it won't know enough English to reply to "Hello". That would have to be trained back into it.
It's worth noting for philosophical completeness that one way I could create a Minimal Intelligence is to create an evaluation framework, pick some sort of structure (10 layers, 1B weights, etc.), and randomly generate the weights. Evaluate. Randomly generate a new set of weights. I could get lucky and land on a set of weights that happens to exhibit strong intelligent characteristics!
The odds of that are unfathomably low, so this is a useless thought. But it pumps our intuition of what we're looking for: the structure.
The human brain is an intelligent machine. Unfortunately, its weights are harder to copy and distill than those of an LLM. We can certainly get some inspiration from here, but I feel like that's something that a number of folks have likely worked on some non-trivial amount, so I'm not viewing it as my advantage given my constrains.
The smartest human in the world, given 2 seconds to think, will be hard-pressed to consistently beat a median human given 10 minutes or 1 hour or 6 months to think. (After accounting for experience, within reason. Obviously a smart human with 10 years of experience programming computers can come to a conclusion about a coding problem in 2 seconds that a median human with no coding experience may genuinely take a year to figure out, because they may have to learn everything about programming first.)
Some machine learning models seem to have the following issue: The only way the model can spend time thinking is by actually doing activity in the world.
A bare LLM is like this. People used to write dumb prompts like "Consider this extremely complicated question with many parts and accurately reply with exactly one word indicating the correct answer right... now: " and then the LLM was required to answer "Goecentricity". If it said any other words before writing the correct answer or if it spat out the wrong answer followed by a bunch of text wandering through the problem space and then spat out the correct answer later on, then people thought, "Oh no! It's wrong!"
Well, duh it was wrong! You didn't give it any time to think.
The large model makers have implemented a method to protect people from this horrible misunderstanding of how to use the model: They give the model space to write out as many words as the model wants into a private text box before beginning writing the actual reply to the human user. The model makers have dubbed this feature "reasoning".
This is a significant improvement! But it's a bit awkward. The model interacts with the world by writing text, and the only way for it to "think" is by "acting". It has to write to the world in order to think. There is no internal thinking.
That's all due to the design of the LLM. And that's fine. It doesn't really matter whether the thinking is some "internal" process of recursing through logic circuits with evolving thought embeddings before taking an external action or literally just writing to a private text file for a while before writing to a public text file -> Those are the same things at the core.
The Minimal Intelligence we're imagining needs some kind of way to ponder the implications of the rules of chess to itself for some time before it begins playing its first game against us.
I can talk to an LLM very easily because we speak the same language.
I can also talk to a multi-model LLM via text and it can respond to me in a picture, which I can read but cannot write. I cannot just produce images. It's a ton of work for me to try to make a picture representing my thought, and the result will be horrifically low fidelity unless I spend a lot of time on it.
I can also write text to a text-only LLM and then that model can write text to a multi-modal LLM, and then that model can output a picture, and that picture can be fed to an LLM, which can respond with some text, etc.
We can play this same game of telephone with videos, 3D models, different programming languages, assembly (I want to test this more), Abstract Syntax Trees (very intriguing), and anything else we can tokenize.
Communicating with me in English requires a fairly huge amount of knowledge about the English language. Though knowing the corpus of English language may (and apparently does) assist in imparting a prodigous amount of intelligence upon the knower, my entire premise of Minimal Intelligence is that one can imbue a machine with the structure of that intelligence without teaching it all of the English langauge at the same time.
But if it doesn't speak English, then I can't talk to it! I need some way to interact with my model. I need some way to communicate to it the rules of the world, whether the world in this case means chess or a programming target or just the literal world.
It may well be helpful to have stages:
The LLM purist would say, "Why bother with this Minimal Intelligence when you can have a maximal one that can talk to you natively?"
My answer to that is 🤷♂️. There may be huge efficiency gains to be had. We can shift off lots of hard thinking to these relatively tiny Minimal Intelligences to tackle problems that are sufficiently abstract to render the immense knowledge of the foundational LLMs are merely dead weight on that particular problem.
The Minimal Intelligence is the machine learning version of an inlined SIMD instruction that a compiler puts in place of a hot for
loop in some Python code. We insert it on the hot paths where it makes sense.
Our Minimal Intelligence knows basically no facts about the world. Or at least as few as possible to still be useful.
We want it to have some core concept of making good decisions given the world it finds itself in.
The most abstract way to construct a world and offer decisions is what we may call a Game.
Basically, I'm envisioning that we procedurally generate a massive sequence of Games with different rules, states, and decisions to be made that our poor knowledgeless model is shuffled through throughout training.
It'll know no facts largely because there is so little consistency between the Games it experiences that there are no facts worth remembering. Each Game is sufficiently varied to require an essentially blank slate evaluation.
I should say: I've read the Wikipedia page on the "No Free Lunch Theorem". I would not claim to understand it properly. But it gives to me the vague sense of something that rings true intuitively: If a model has to function in literally any possible Game, then it can do no better on average than a random agent would.
If you are born into a world and shown a game, you can learn that game and get quite good at it. Then you can learn a new game and get quite good at it. But for the millionth game you are required to learn, you will either not learn it or you will have to overwrite your knowledge of some other game from long ago. If there are few or no shared concepts between the games due to incredible variance in the rules, states, and decision options, then the finite number of neurons in your finitely sized brain will be all filled up with knowledge and you'll have to drop something in order to learn to be good at the new Game. Given that there are (presumably) an infinite number of potential Games with irreconcilable rulesets, your average performance on any Given game is zero, or no better than random decision making.
Devoid of any shared structure between the rules of the Games, there is nothing to learn. This gets at the heart of the issue I described in "1. Minimum Intelligence".
So, our problem is harder: We need to procedurally generate a sequence of Games that vary incredibly widely in rules, state, and decision options but that shares enough rule structure with some interpretation of "the Real World" that a model training on these Games has something to learn. Something to actually grab onto that is worth encoding and the encoding of which we are inclined to label "intelligence".
This is an important fact about the Real World: Something that is intelligent for the Real World, meaning intelligent for the space of Games that can be encountered in the Real World, meaning that frequently makes good decisions in the space of Games that can be encountered in the Real World... Such a machine will not be good at Games that are not in the space of Games evident in the Real World, and vice versa.
This raises the question: Is there a through line of structure that bounds the space of Games reachable in the Real World?
I don't know.
This raises another question: Is there really any "Minimal Intelligence" to be had separable from knowledge about the kinds of Games that one frequently encounters in our Real World (chess, war, particle colliders, and interpretations of Robert Frost's poetry)?
I don't know. It might be that what I'm describing is a compromise; a machine that does very well in any Game you can describe as some sort of logical or resource optimization problem but that cannot and will not ever help you draft your marketing plan. That's for the LLMs, which know everything there is to know about marketing, business plans, and the humans we want to sell to.