Initial Thoughts On Operator
Earlier today OpenAI announced the long-rumored computer-use agent product, Operator. It is powered by a new model trained specifically for computer use, aptly named Computer-Using Agent (CUA) (CUA). The model shows significant improvements over the previous state-of-the-art, Anthropic’s Claude 3.5 with computer use, at least on the benchmarks performed by OpenAI. The demos shown in the announcement post are impressive, but like the Sora announcement last year, they are most probably extensively cherry-picked examples. The real, practical, one or few-shot performance of the model is likely far from what’s shown.
The one somewhat surprising thing was the choice for the model to use purely visual data (i.e. screenshot) to perform reasoning and actions, even though the technology (currently) only works with websites in a browser. Surely having access to the DOM, plus the ability to generate and execute JS, would yield a much more powerful and reliable model? But, if the OpenAI team has decided to build CUA with visual data only, they must have good reasons. So let’s speculate on what they might be.
They could be really confident about the scaling trajectory of visual-only computer use. This may be partly informed by their exclusive knowledge of how well scaling is working internally, on yet-to-be-released models. Perhaps the capability of these models is already so promising, that they feel there is no need to rely on other ways.
It may come down to an opinion, that ultimately we will not need anything other than visual data to build these models, so why bother now. This is similar to the classic LiDAR (Google/Waymo) vs. camera-based (Tesla) self-driving car systems. Yes you do get more information from LiDAR and the website page content, but humans only need eyes to perform these tasks, so machines should be able to do the same.
Visual-only definitely has the advantage of being more adaptable. If a computer-use model is trained with website content (HTML and JS), when you want to interact with native desktop apps, you’d have to re-train or at least tweak the model, since these apps have very different internal structures and syntax. In some environments, like Apple’s mobile ecosystem, looking into the applications’ UI programmatically may not even be possible. So if the goal is to have one model that can do everything, and mobile-use is needed, a visual-only model makes sense.
If CUA was relying on the website’s content to function, some companies may decide to make it difficult to extract structured content out of the page. Maybe Ticketmaster doesn’t want an army of Operator bots to grab all the concert tickets the second they come on sale. There are of course only limited ways they can obfuscate things - after all the browser still needs to understand the structure to render the page - but it would create an additional hurdle for the model to overcome. With a purely visual model, there is not much anyone can do, as it is exactly how a human uses a computer.
Whatever their reasons are, I do agree with the OpenAI team’s thesis that visual-only computer use is the ultimate solution. But I also think that in the interim, while the model’s capabilities are not quite there, there is still tremendous value in feeding it with other forms of data. And there is no doubt that Operator is a data-mining product, all the changes on the webpage is captured along with the model output and actions performed, to further train future models.