NextDesk is an intelligent desktop automation application powered by Google's Gemini AI that uses the ReAct (Reasoning + Acting) framework to understand and execute complex computer tasks through natural language commands.
โ ๏ธ UNDER ACTIVE DEVELOPMENT This project is currently in active development and not ready for production use. The vision-based element detection tool (detectElementPosition) is particularly unreliable and not recommended for use at this time. We recommend using keyboard shortcuts (pressKeys) and thegetShortcutstool instead for more reliable automation.
This Flutter desktop application combines AI reasoning with keyboard automation and input control to automate desktop tasks. Simply describe what you want to do in natural language (e.g., "open Chrome and search for Flutter documentation"), and the AI agent will break it down into executable steps, reason about each action, and perform the automation.
| Feature | Status | Notes |
|---|---|---|
| ReAct Framework | โ Working | Core reasoning loop is functional |
| Keyboard Automation | โ Working | Reliable keyboard shortcuts via pressKeys |
| AI Shortcuts Tool | โ Working | getShortcuts dynamically fetches shortcuts |
| Mouse Control | โ Working | Basic mouse movement and clicks |
| Screenshot Capture | โ Working | Screen capture functionality |
| Vision Detection | Unreliable, not recommended for use | |
| User Interaction | โ Working | Agent can ask user questions via dialog |
| Task Persistence | โ Working | Isar database for task history |
Current Focus: Improving vision detection accuracy and reliability.
| Platform | Status | Notes |
|---|---|---|
| macOS | โ Supported | Fully tested and working |
| Windows | Requires testing of bixat_key_mouse plugin for proper keyboard and mouse control | |
| Linux | Requires testing of bixat_key_mouse plugin for proper keyboard and mouse control |
Note: While the
bixat_key_mouseplugin claims to support Windows and Linux, we need to thoroughly test keyboard and mouse control functionality on these platforms before officially supporting them in this application.
Main interface showing task history and quick actions
The AI's reasoning process displayed in real-time with numbered thought steps
Execution log showing all function calls and their parameters
nextdesk/
โโโ lib/
โ โโโ main.dart # Application entry point
โ โโโ config/
โ โ โโโ app_theme.dart # Centralized theme & design system
โ โ โโโ app_config.dart # API keys and configuration
โ โโโ models/
โ โ โโโ task.dart # Task data model (Isar)
โ โ โโโ detection_result.dart # UI element detection results
โ โ โโโ react_agent_state.dart # ReAct agent state
โ โโโ services/
โ โ โโโ gemini_service.dart # Gemini AI model initialization
โ โ โโโ vision_service.dart # AI-powered UI element detection
โ โ โโโ automation_service.dart # All automation functions
โ โ โโโ shortcuts_service.dart # AI-powered keyboard shortcuts
โ โโโ providers/
โ โ โโโ app_state.dart # Main state management (Provider)
โ โโโ screens/
โ โ โโโ main_screen.dart # Main UI with responsive layout
โ โโโ widgets/
โ โ โโโ task_card.dart # Reusable task card widget
โ โ โโโ user_prompt_dialog.dart # User interaction dialog
โ โโโ main.g.dart # Generated Isar database code
โโโ macos/ # macOS platform-specific code
โโโ windows/ # Windows platform-specific code
โโโ linux/ # Linux platform-specific code
โโโ pubspec.yaml # Dependencies and project configuration
โโโ README.md # This file
The application follows separation of concerns with a clean modular architecture:
Task: Isar database model for storing automation tasks with thoughts and stepsDetectionResult: Model for UI element detection results with coordinatesReActAgentState: State management for the ReAct reasoning cycle
GeminiService: Initializes and configures Gemini AI model with function callingVisionService: AI-powered UI element detection using Gemini or Qwen Vision APIAutomationService: Wrapper for all automation capabilities (mouse, keyboard, screen)
AppState: Main state management using Provider pattern- Manages task execution state
- Handles ReAct agent lifecycle
- Stores execution logs and thought history
- Manages database operations
MainScreen: Primary interface with responsive layout- Adaptive design (800px breakpoint)
- Side-by-side panels on large screens
- Drawer navigation on small screens
- Tabbed interface for thoughts and actions
TaskCard: Reusable task card with animations and metrics
AppTheme: Centralized design system- Material Design 3 theme
- Color palette (Purple/Blue/Green)
- 8px spacing system
- Typography using Google Fonts Inter
- Shadow and border radius constants
The application implements the ReAct (Reasoning + Acting) pattern, which combines reasoning and action in an iterative loop:
1. THOUGHT โ 2. ACTION โ 3. OBSERVATION โ (repeat)
The AI agent analyzes the current state and decides what to do next:
- Understands the user's goal
- Considers what has been done so far
- Plans the next logical step
The agent executes one of the available automation functions:
captureScreenshot(): Takes a screenshot to see the current statedetectElementPosition(description): Finds UI elements using AI visionmoveMouse(x, y): Moves cursor to coordinatesclickMouse(button, action): Performs mouse clickstypeText(text): Types text via keyboardpressKeys(keys): Presses keyboard shortcutswait(seconds): Waits for a specified duration
The agent receives feedback from the action:
- Success/failure status
- Element coordinates (for detection)
- Screenshot data
- Error messages
This cycle repeats until the task is complete or max iterations (20) is reached.
The application uses Google's Gemini AI with function calling capabilities:
GenerativeModel(
model: 'gemini-2.5-flash',
apiKey: apiKey,
tools: [
captureScreenshotTool,
detectElementTool,
moveMouseTool,
clickMouseTool,
typeTextTool,
pressKeysTool,
waitTool,
],
)The AI can:
- Understand natural language instructions
- Reason about multi-step tasks
- Call automation functions with appropriate parameters
- Process visual information from screenshots
The VisionService supports two vision providers for UI element detection:
- Uses Google's Gemini 2.5 Flash model
- Integrated with Google AI Studio
- Fast and reliable for most use cases
- Uses Alibaba Cloud's Qwen 2.5 VL 72B Instruct model
- OpenAI-compatible API format
- Provides image size detection and confidence scores
- Configurable resolution parameters
How it works:
- Takes a screenshot of the current screen
- Sends the image + element description to the selected vision API
- AI analyzes the image and returns pixel coordinates
- Returns a
DetectionResultwith x, y coordinates and confidence score
Example:
final result = await VisionService.detectElementPosition(
imageBytes,
"blue Submit button",
);
// Returns: {x: 450, y: 320, confidence: 0.95}Switching Providers:
Edit lib/config/app_config.dart:
static const String visionProvider = 'qwen'; // or 'gemini'
static const String qwenApiKey = 'sk-your-qwen-api-key';See QWEN_INTEGRATION.md for detailed setup instructions.
Uses the bixat_key_mouse package (custom Rust-based FFI) for:
- Mouse Control: Move cursor, click, double-click, right-click
- Keyboard Control: Type text, press keys, keyboard shortcuts
- Screen Capture: Take screenshots via
screen_capturer
The AppState class manages:
- Current task execution state
- Execution logs and thought logs
- Screenshot data
- Task history from database
- ReAct agent state (iteration count, current thought, observations)
Tasks are stored locally using Isar (NoSQL database):
@collection
class Task {
Id id = Isar.autoIncrement;
String prompt = '';
List<String> thoughts = []; // AI reasoning steps
List<String> steps = []; // Executed actions
bool completed = false;
DateTime createdAt = DateTime.now();
}- google_generative_ai (^0.4.3): Gemini AI integration with function calling
- bixat_key_mouse: Custom Rust-based FFI package for mouse/keyboard control
- screen_capturer (^0.2.1): Cross-platform screen capture functionality
- provider (^6.1.1): State management using ChangeNotifier pattern
- isar (^3.1.0+1): Fast, local NoSQL database for task persistence
- isar_flutter_libs (^3.1.0+1): Isar platform-specific bindings
- flutter_animate (^4.5.0): Declarative animations and transitions
- google_fonts (^6.1.0): Inter font family for typography
- Material Design 3: Modern design system with gradient themes
- path_provider (^2.1.1): Access to file system paths
- uuid (^4.2.1): Generate unique identifiers for tasks
- image (^4.5.4): Image processing and manipulation
- build_runner (^2.4.6): Code generation for Isar
- isar_generator (^3.1.0+1): Generates Isar database code
- Flutter SDK (>=3.0.0)
- Gemini API key from Google AI Studio
- macOS desktop environment (Windows and Linux not yet supported - see Platform Support)
-
Clone the repository
git clone <repository-url> cd nextdesk
-
Install dependencies
flutter pub get
-
Set up the bixat_key_mouse package
Ensure the
bixat_key_mouse_dartpackage is available at:../bixat_key_mouse_dart -
Configure API key
Copy the example config file and add your API key:
cp lib/config/app_config.dart.example lib/config/app_config.dart
Then open
lib/config/app_config.dartand replace the API key:static const String geminiApiKey = 'YOUR_GEMINI_API_KEY_HERE';
-
Generate Isar database code
flutter pub run build_runner build
-
Run the application
flutter run -d macos # or windows/linux
Input: "Open Chrome and search for Flutter documentation"
ReAct Process:
THOUGHT: I need to open Chrome first
ACTION: pressKeys(['cmd', 'space'])
OBSERVATION: Spotlight opened
THOUGHT: Now type Chrome to search for it
ACTION: typeText('Chrome')
OBSERVATION: Text typed successfully
THOUGHT: Press Enter to open Chrome
ACTION: pressKeys(['enter'])
OBSERVATION: Chrome is opening
THOUGHT: Wait for Chrome to load
ACTION: wait(2)
OBSERVATION: Waited 2 seconds
THOUGHT: Now I need to click on the address bar
ACTION: captureScreenshot()
OBSERVATION: Screenshot captured
THOUGHT: Detect the address bar
ACTION: detectElementPosition('address bar at the top')
OBSERVATION: Found at x:500, y:100
THOUGHT: Click on the address bar
ACTION: moveMouse(500, 100)
ACTION: clickMouse('left', 'click')
OBSERVATION: Clicked successfully
THOUGHT: Type the search query
ACTION: typeText('Flutter documentation')
ACTION: pressKeys(['enter'])
OBSERVATION: Task complete
Input: "Create a new text file named 'notes.txt' on the desktop"
Input: "Take a screenshot and save it"
- โ Natural language task understanding
- โ ReAct reasoning framework (Thought โ Action โ Observation)
- โ AI-powered UI element detection using computer vision
- โ Mouse and keyboard automation
- โ Screenshot capture and analysis
- โ Task history and persistence (Isar database)
- โ Multi-step task execution with iteration control
- โ Real-time execution logs and thought visualization
- โ Responsive desktop interface
- Multi-monitor support
- Task templates and macros
- Voice command input
- Task scheduling and automation
- Error recovery and retry logic
- Performance optimization
- Plugin system for custom actions
- Cloud sync for task history
- Dark/Light theme toggle
- Export task history to JSON/CSV
The project follows a clean, modular architecture with clear separation of concerns:
- Models: Data structures for tasks, detection results, and agent state
- Services: AI integration, vision processing, and automation functions
- Providers: State management using Provider pattern
- Screens: Main UI with responsive layout
- Widgets: Reusable UI components
- Config: Centralized theme and design system
- API Key: Store your Gemini API key securely (use environment variables in production)
- Local Processing: All automation runs locally on your machine
- Data Storage: Task history is stored locally using Isar database
- Screenshots: Temporary screenshots are kept in memory and not persisted
- No Telemetry: No data is sent to external servers except Gemini API calls
- Permissions: Requires accessibility permissions for automation (user-controlled)
The detectElementPosition function uses AI vision to locate UI elements, but it is currently unreliable and NOT recommended for use:
- โ Not Production Ready: This feature is experimental and under active development
- โ Accuracy Issues: Detection may be off by several pixels or fail entirely
- โ Inconsistent Results: Same element may be detected differently across runs
- โ Complex UIs: Elements in dense or overlapping layouts are very difficult to detect
- โ Similar Elements: May confuse similar-looking buttons or icons
- โ Performance: Vision API calls are slow and may timeout
โ RECOMMENDED APPROACH:
- Use keyboard shortcuts (
pressKeys) whenever possible - much more reliable - Use
getShortcutstool to dynamically fetch keyboard shortcuts for applications - Avoid vision-based detection until this feature is stabilized in future releases
This is a known limitation of the current implementation and AI vision models. We are actively working on improving this feature.
-
"Failed to detect element"
- Note: Element detection is not always precise and may fail
- Use keyboard shortcuts instead of mouse clicks when possible
- Ensure the element description is very clear and specific
- Try taking a screenshot first to verify the UI state
- Check that the element is visible on screen
- Improve description with more details (e.g., "blue Submit button in bottom right corner with white text")
-
"API key error"
- Verify your Gemini API key is valid
- Check your internet connection
- Ensure you haven't exceeded API quotas
- Update the API key in
lib/services/gemini_service.dart
-
Mouse/keyboard not working
- Grant accessibility permissions to the app (System Preferences โ Security & Privacy)
- Check that
bixat_key_mousepackage is properly installed - Verify platform-specific permissions
- Restart the application after granting permissions
Contributions are welcome! Please feel free to submit a Pull Request.
Built with โค๏ธ using Flutter and Google Gemini AI