I came across this last week when looking into how I'd run clawd (now Moltbot) on a Mac. I've been using their Lume cli for a few days to sandbox Claude and it's been reasonably good so far.
Expanding the VM's disk didn't propagate to the client; seems like it needs tooling like VMware-Tools to do some magic. I backed up the important stuff (I'd been using it less than a day) and created a new VM. That's been the only hurdle so far.
Anyway, I haven't used this, but thought it'd be useful to someone to hear that a related tool seems to be ok after ~five days of intermittent use. I've been an ESX (and recently Proxmox) fan in the past (and all the others cause I was curious), so I say this as someone who has kicked some VM rocks in my time.
As for locally run AI, this is the first time I've given one access to any data directly and I can't speak intelligently about any of that or this specific tool. Sorry.
The trajectory export feature is smart. Evaluation and training data collection in the same tool.
I'm curious how the benchmarks handle non-determinism. Real GUIs have loading states, animations, popups that appear sometimes but not always. Does cuabench control for that, or is variance just part of the measurement?
Also interested in what "Windows Arena" tests specifically. Windows has so many edge cases - UAC prompts, driver install dialogs, random update notifications. Those feel like the hard mode for computer-use agents.
Thanks - trajectory export was key for us since most teams want both eval and training data.
On non-determinism: we actually handle this in two ways. For our simulated environments (HTML/JS apps like the Slack/CRM clones), we control the full render state so there's no variance from animations or loading states. For native OS environments, we use explicit state verification before scoring - the reward function waits for expected elements rather than racing against UI timing. Still not perfect, but it filters out most flaky failures.
Windows Arena specifically - we're focusing on common productivity flows (file management, browser tasks, Office workflows) rather than the edge cases you mentioned. UAC prompts and driver dialogs are exactly the hard mode scenarios that break most agents today. We're not claiming to solve those yet, but that's part of why we're open-sourcing this - want to build out more adversarial tasks with the community.
Interesting, a computer use environment. I made a CUA benchmark too, 200 web tasks with internal code based evaluation. You can integrate them if you want.
Hey visarga - I'm the founder of Cua, we might have met at the CUA ICML workshop? The OS-agnostic VNC approach of your benchmark is smart and would make integration easy. We're open to collaborating - want to shoot me an email at f@trycua.com?
Fair point - we just open-sourced this so benchmark results are coming. We're already working with labs on evals, focusing on tasks that are more realistic than OSWorld/Windows Agent Arena and curated with actual workers. If you want to run your agent on it we'd love to include your results.
I came across this last week when looking into how I'd run clawd (now Moltbot) on a Mac. I've been using their Lume cli for a few days to sandbox Claude and it's been reasonably good so far.
Expanding the VM's disk didn't propagate to the client; seems like it needs tooling like VMware-Tools to do some magic. I backed up the important stuff (I'd been using it less than a day) and created a new VM. That's been the only hurdle so far.
Anyway, I haven't used this, but thought it'd be useful to someone to hear that a related tool seems to be ok after ~five days of intermittent use. I've been an ESX (and recently Proxmox) fan in the past (and all the others cause I was curious), so I say this as someone who has kicked some VM rocks in my time.
As for locally run AI, this is the first time I've given one access to any data directly and I can't speak intelligently about any of that or this specific tool. Sorry.
The trajectory export feature is smart. Evaluation and training data collection in the same tool.
I'm curious how the benchmarks handle non-determinism. Real GUIs have loading states, animations, popups that appear sometimes but not always. Does cuabench control for that, or is variance just part of the measurement?
Also interested in what "Windows Arena" tests specifically. Windows has so many edge cases - UAC prompts, driver install dialogs, random update notifications. Those feel like the hard mode for computer-use agents.
Thanks - trajectory export was key for us since most teams want both eval and training data.
On non-determinism: we actually handle this in two ways. For our simulated environments (HTML/JS apps like the Slack/CRM clones), we control the full render state so there's no variance from animations or loading states. For native OS environments, we use explicit state verification before scoring - the reward function waits for expected elements rather than racing against UI timing. Still not perfect, but it filters out most flaky failures.
Windows Arena specifically - we're focusing on common productivity flows (file management, browser tasks, Office workflows) rather than the edge cases you mentioned. UAC prompts and driver dialogs are exactly the hard mode scenarios that break most agents today. We're not claiming to solve those yet, but that's part of why we're open-sourcing this - want to build out more adversarial tasks with the community.
Interesting, a computer use environment. I made a CUA benchmark too, 200 web tasks with internal code based evaluation. You can integrate them if you want.
https://github.com/UiPath/uipath_enterprise_benchmark
https://arxiv.org/abs/2511.17131
Hey visarga - I'm the founder of Cua, we might have met at the CUA ICML workshop? The OS-agnostic VNC approach of your benchmark is smart and would make integration easy. We're open to collaborating - want to shoot me an email at f@trycua.com?
Interesting project, but the lack of any actual benchmark results on existing models/agents is disappointing.
Fair point - we just open-sourced this so benchmark results are coming. We're already working with labs on evals, focusing on tasks that are more realistic than OSWorld/Windows Agent Arena and curated with actual workers. If you want to run your agent on it we'd love to include your results.