A framework to enable multimodal models to operate a computer.
Using the same inputs and outputs of a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective.
Key Features
Compatibility : Designed for various multimodal models.