More Data and AI
There is no doubt that when all else being equal, more information processing power leads to more intelligence. But there is a prevailing belief that if we could simply process more and more data with our current technologies, we will eventually get to human-level AI, and I’m not so sure about that. Not that it can’t be true in the theoretical sense, but that it may not be the best practical way to get there. I’ll go over a bit of my reasoning here.
The only currently known example of generally intelligent agents is us, humans. And clearly we do not need massive quantities of data to learn, certainly not at the level that the machine learning (ML) algorithms today are using. We do not need to see anywhere close to millions of cat pictures to be able to recognize cats effectively. It seems that we have, in our biology or genes, the encoding of some “pre-trained” structure that allows us to adapt, learn, and generalize at highly proficiency levels.
This encoding probably contains some critical information about the fundamental nature of reality, what people colloquially call “common sense”, and potentially many other types of information. And because of how little data (compared to the usual ML training data set size) we need to learn how to do new things, this biological encoding is clearly highly adept at transfer learning. We are living proof that extremely effective transfer learning is possible without ingesting and processing huge amounts of data.
So then the question becomes: did this highly effective biologically built-in encoding that we have take a large amount of data to create? The “training” process to get this encoding, in essence, is the combined workings of many evolutionary processes, both biological and social. It’s not clearly evident to me that the answer to this question is a definite “yes”. The many generations of humans, and our biological precursors before we became modern day homo sapiens, can be thought of as “training cycles”. The information and experiences we take in during all the lifetimes of past generations are the “training data”. So how large is this “dataset”?
I think arguments can be made on both sides: that it’s supermassive, or that it’s not very big compared to the volume of digital training data we have access to today. On one hand, the interactions with physical reality is near-infinite resolution, and it forms an absolutely massive dataset, especially over generations. On the other hand, our sensory perception is generally very low resolution, and we don’t seem to retain much high fidelity information over time.
It is precisely because of my uncertainty on the answer to this question that I have doubts about the “more data is the way to get to AI” narrative.