Initial Thoughts On GPT-4
GPT-4 was released today, through a coordinated PR launch with much fanfare from partners that utilize it in various products (like Duolingo, Khan Academy, and Morgan Stanley). It comes with improved capabilities across several major areas which I won’t repeat here. Instead I’ll share some of my initial observations and thoughts based on the released information, with no actual usage experience of the new model (API access is still behind a waitlist).
The trend of being more and more closed continues. The irony of the name OpenAI has been increasingly called out recently, and the GPT-4 release takes yet another large step away from being open. Notably, even the high level technical information on the model - such as the size, training cost, and high level architecture - which were made available for its predecessor GPT-3, are kept secret. From the paper accompanying the press release, OpenAI now directly cites competition as a reason for the lack of details:
Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.
Still, we can make some educated guesses based on what they did publish.
GPT-4’s knowledge of world events has the same cutoff time as GPT-3 (i.e. up to September 2021). This suggests that the new model is substantially based on the existing model of GPT-3, with very limited new corpus being incorporated. The high level architecture of the base model likely remains similar, as otherwise it probably requires retraining from scratch, and so could more easily have included newer data (at least in theory).
The technical focus of GPT-4 seems to be about increasing training compute efficiency, by developing techniques to better predict model performance using much smaller (in terms of compute, not necessarily number of parameters) models. Before the official release today, there were wild rumors about the size of GPT-4, ranging from the same as GPT-4 to ridiculously high numbers of hundreds of trillions. Surprisingly there is no mention of scaling the model size, unlike with GPT-3, which went from 1.5B parameters of GPT-2 to 175B. The only scale-related improvement explicitly mentioned is the prompt/context length, which got a major improvement, from 4,000 tokens to up to 32,000.
If I were to guess, I’d say that OpenAI likely encountered some form of diminishing returns with scaling, and had to revise their technical strategy accordingly. Perhaps they learned that the performance of the model did not meaningfully improve with moderate parameter size increases, and substantial increases in parameter size is not (yet) feasible. Maybe instead they found that model performance was very sensitive to minor or moderate amounts of architectural changes, and so decided to focus on the ability to rapidly iterate through them. The cynical take on this would be that we have hit the ceiling (or at least a plateau) of what can be achieved with transformer-based LLMs. The more optimistic view would be that GPT-4 is only a stepping stone to a much more dramatic leap in performance in GPT-5. But either way, I think this is a small negative signal for the scaling laws of LLMs.
The extensive use of standardized tests as benchmarks is probably more for hype generation, than showcasing leaps in performance. These tests are probably on the easier end of “tasks that are done using language”, since they operate in fairly narrow domains with formats that perfectly suit an LLM (i.e. question and answer). After all GPT-3 had already shown similar performance on some of these tests. Sure, being multimodal makes GPT-4 able to perform on a much wider range of tests (ones with graphs and diagrams), and its performance has definitely improved from its predecessor. But personally I am more impressed by the model’s performance on various real life tasks. What standardized test performance results do better on however, is generating hype in the media. A title like “New AI Passes the Bar With Top 10% Score” is just too clickbaity to ignore. And whenever PR hype is involved, I can’t help but think that it is used to mask the lack of real, dramatic, “holy shit” levels of actual improvement.
Factual hallucination remains a problem despite noticeable improvements. The data from OpenAI’s own testing shows that GPT-4’s response accuracy under adversarial settings (i.e. when deliberately trying to trip it up) is around 70-80% (compared to 50-60% of the latest version of ChatGPT, which is based on GPT-3.5). This is still more than a 1 in 5 chance for GPT-4 to make a false claim in its response, making human oversight required for anything even slightly critical. Improvements here are better than what I’d call incremental, but far from a leap. Notably the improvements in this area seem to predominantly come from more/better RLHF than technical breakthroughs, thus limiting the scaling potential.
The jailbreaking war escalates. Another large area of focus in the development of GPT-4 is getting it to stay within the guardrails set by OpenAI more reliably. Only time playing in people’s hands will tell how effective they are, but given how LLMs fundamentally work, I think it will always be possible to break them out of their policy jail. However the techniques to do so could become so complex that it may not be accessible for most people (e.g. requiring a prompt that is longer than the publicly allowed limit).
The multimodal aspect of GPT-4 (namely its ability to accept image as input) is undoubtedly the main breakthrough, but since it’s still under tuning, we don’t know much about it for now. I am very excited to see the practical use cases that it will make possible. Of course I’m also eager to learn how the integration was technically achieved, but given the increasingly closed nature of OpenAI, I’m not expecting much directly from them.
Lastly, based on everything shown today, my AGI Doomer forecast probability went down slightly. GPT-4’s improvements are quite impressive, and no doubt bring immense practical benefits for building useful things today. But I feel that they are mostly incremental improvements on what its predecessor could do already, just more nuanced, more reliable, and less likely to go rogue. The scaling laws of LLMs seem to be taking a breather for now, which should lower the Doomer fear.