ディスカッション (11件)
ついに、特定のアプリに依存せず、あらゆるコンピュータ操作を自律的に実行できる世界初の「完全汎用アクションモデル(Computer Action Model)」が登場しました。人間と同じように画面を理解し、マウス操作やキーボード入力を自在に操るAIの進化は、業務自動化の歴史を塗り替える大きな一歩となりそうです。
Hey guys! I’m Neel, been holed up in our south park office for the past year working on model training. excited to share our research!
This is a preview of a very different type of computer use model—we train on the internet. Specifically we have 11 million hours of computer video stored on our storage cluster (previously shared https://news.ycombinator.com/item?id=45438496 (https://news.ycombinator.com/item?id=45438496) !) and the model can work in 30 FPS. Since we match the fundamental form factor of computer-use, we can get our model to do CAD, browse websites, and even drive a car using arrow keys. I’m super excited to see what our model can do as we scale more, it's a fun frontier to work on (not language models :) ).
The team and I will be online responding to the comments, so drop any questions.
I rly liked the point about ctrl-c only being able to be labelled retrocausally. I do think that with enough past context you should be able to know what was copied - in some sense the past does encode the future - but also an agentic decision is precisely the kind where the future is more informative than the past for reconstructing that decision.
It does make me wonder if you should have the inverse dynamics model split into specifically retrocausal and causal. You kind of do this already with the inverse and forward dynamics model, but the idea of a model that knows only
about the future training in a feedback loop with a model that knows only about the past is kind of interesting.
I think you could just do a clever masking regime in your diffusion model to achieve the same effect without a whole architecture change.
This looks extremely impressive, really deserves more attention here.
Are the inverse dynamics and forward dynamics models trained separately? It sounds like if the inverse dynamics model is meant to extrapolate more training data, then perhaps all that means is it takes very little data to generalize directly with the forward dynamics model assuming the right architecture.
Just wanted to say: this is might impressive research.
Really interesting breakdown, proper nerdsniped into this, thanks for the refreshing AI news outside of language models :)
May I suggest a driving demo in a parking lot with a mannequin instead of a real world video where it drives way too close to a pedestrian?
Otherwise, very cool and exciting!
At first glance, this looks incredible to me. The authors train one model on 40K hours of computer-use video, previously labeled by contractors with keyboard and mouse actions, then use that model, in effect, to label 11M hours of computer-use video, which they use to train the computer-action model. The key advance is in compression. Quoting from the OP:
> [previous models] burn a million tokens to understand just one minute of 30 FPS computer data. Our video encoder encodes nearly 2 hours of video in the same number of tokens—that’s 50x more token-efficient than the previous state-of-the-art and 100x more token-efficient than OpenAI’s encoder.
While I was already aware that there are people working on new, more efficient "world models," this is the first one I've seen in action. I'm a bit in shock at how good it is, quite frankly.
I've added the OP, as well as a related 2018 paper on Behavioral Cloning from Obervation (BCO) to my reading list.[a] So far, I've only skimmed the 2018 paper, but it's already evident that it's well-written. I'm no expert in deep RL, and I can understand it. BTW, "Behavioral Cloning from Obervation" is a really good name, with an easy-to-remember acronym.
Thank you for sharing this on HN.
[a] https://arxiv.org/abs/1805.01954 (https://arxiv.org/abs/1805.01954)
Nice that it can drive a car, but you could just use openpilot.
dammmmmmnnnn - lots to like here. I'm impressed with the 80,000 parallel website fuzzing desktops. And the 30hz (everything). Amazing.
This seems like really great research, and the first time I’ve seen overwhelming praise on HN. Congrats!
I wanted to comment though that your title is not doing you any favors, and I suspect that is why this is not getting more traction (which it deserves). I fully expected some half baked GitHub repo, but instead found something truly awesome.
To use your own words, Neel, “ a very different type of computer use model” would have had me clicking faster. I’m not great at titles, however, and maybe there are better ideas out there.
Anyway, can’t wait to see how this develops! Especially looking forward to the CAD work.
Very impressive stuff!
Can you prompt it or is it strictly Copilot-style prediction?