Skip to content

bixat/NextDesk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

22 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

NextDesk

NextDesk is an intelligent desktop automation application powered by Google's Gemini AI that uses the ReAct (Reasoning + Acting) framework to understand and execute complex computer tasks through natural language commands.

โš ๏ธ UNDER ACTIVE DEVELOPMENT This project is currently in active development and not ready for production use. The vision-based element detection tool (detectElementPosition) is particularly unreliable and not recommended for use at this time. We recommend using keyboard shortcuts (pressKeys) and the getShortcuts tool instead for more reliable automation.

๐ŸŒŸ Overview

This Flutter desktop application combines AI reasoning with keyboard automation and input control to automate desktop tasks. Simply describe what you want to do in natural language (e.g., "open Chrome and search for Flutter documentation"), and the AI agent will break it down into executable steps, reason about each action, and perform the automation.

๐Ÿšง Development Status

Feature Status Notes
ReAct Framework โœ… Working Core reasoning loop is functional
Keyboard Automation โœ… Working Reliable keyboard shortcuts via pressKeys
AI Shortcuts Tool โœ… Working getShortcuts dynamically fetches shortcuts
Mouse Control โœ… Working Basic mouse movement and clicks
Screenshot Capture โœ… Working Screen capture functionality
Vision Detection โš ๏ธ NOT READY Unreliable, not recommended for use
User Interaction โœ… Working Agent can ask user questions via dialog
Task Persistence โœ… Working Isar database for task history

Current Focus: Improving vision detection accuracy and reliability.

๐Ÿ–ฅ๏ธ Platform Support

Platform Status Notes
macOS โœ… Supported Fully tested and working
Windows โš ๏ธ NOT SUPPORTED YET Requires testing of bixat_key_mouse plugin for proper keyboard and mouse control
Linux โš ๏ธ NOT SUPPORTED YET Requires testing of bixat_key_mouse plugin for proper keyboard and mouse control

Note: While the bixat_key_mouse plugin claims to support Windows and Linux, we need to thoroughly test keyboard and mouse control functionality on these platforms before officially supporting them in this application.

๐Ÿ“ธ Screenshots

Home Screen

Home Screen Main interface showing task history and quick actions

Dashboard & Thoughts

Dashboard and Thoughts The AI's reasoning process displayed in real-time with numbered thought steps

Actions

Actions Log Execution log showing all function calls and their parameters

๐Ÿ—๏ธ Project Structure

nextdesk/
โ”œโ”€โ”€ lib/
โ”‚   โ”œโ”€โ”€ main.dart                      # Application entry point
โ”‚   โ”œโ”€โ”€ config/
โ”‚   โ”‚   โ”œโ”€โ”€ app_theme.dart            # Centralized theme & design system
โ”‚   โ”‚   โ””โ”€โ”€ app_config.dart           # API keys and configuration
โ”‚   โ”œโ”€โ”€ models/
โ”‚   โ”‚   โ”œโ”€โ”€ task.dart                 # Task data model (Isar)
โ”‚   โ”‚   โ”œโ”€โ”€ detection_result.dart     # UI element detection results
โ”‚   โ”‚   โ””โ”€โ”€ react_agent_state.dart    # ReAct agent state
โ”‚   โ”œโ”€โ”€ services/
โ”‚   โ”‚   โ”œโ”€โ”€ gemini_service.dart       # Gemini AI model initialization
โ”‚   โ”‚   โ”œโ”€โ”€ vision_service.dart       # AI-powered UI element detection
โ”‚   โ”‚   โ”œโ”€โ”€ automation_service.dart   # All automation functions
โ”‚   โ”‚   โ””โ”€โ”€ shortcuts_service.dart    # AI-powered keyboard shortcuts
โ”‚   โ”œโ”€โ”€ providers/
โ”‚   โ”‚   โ””โ”€โ”€ app_state.dart            # Main state management (Provider)
โ”‚   โ”œโ”€โ”€ screens/
โ”‚   โ”‚   โ””โ”€โ”€ main_screen.dart          # Main UI with responsive layout
โ”‚   โ”œโ”€โ”€ widgets/
โ”‚   โ”‚   โ”œโ”€โ”€ task_card.dart            # Reusable task card widget
โ”‚   โ”‚   โ””โ”€โ”€ user_prompt_dialog.dart   # User interaction dialog
โ”‚   โ””โ”€โ”€ main.g.dart                   # Generated Isar database code
โ”œโ”€โ”€ macos/                             # macOS platform-specific code
โ”œโ”€โ”€ windows/                           # Windows platform-specific code
โ”œโ”€โ”€ linux/                             # Linux platform-specific code
โ”œโ”€โ”€ pubspec.yaml                       # Dependencies and project configuration
โ””โ”€โ”€ README.md                          # This file

Architecture Overview

The application follows separation of concerns with a clean modular architecture:

1. Models (lib/models/)

  • Task: Isar database model for storing automation tasks with thoughts and steps
  • DetectionResult: Model for UI element detection results with coordinates
  • ReActAgentState: State management for the ReAct reasoning cycle

2. Services (lib/services/)

  • GeminiService: Initializes and configures Gemini AI model with function calling
  • VisionService: AI-powered UI element detection using Gemini or Qwen Vision API
  • AutomationService: Wrapper for all automation capabilities (mouse, keyboard, screen)

3. Providers (lib/providers/)

  • AppState: Main state management using Provider pattern
    • Manages task execution state
    • Handles ReAct agent lifecycle
    • Stores execution logs and thought history
    • Manages database operations

4. Screens (lib/screens/)

  • MainScreen: Primary interface with responsive layout
    • Adaptive design (800px breakpoint)
    • Side-by-side panels on large screens
    • Drawer navigation on small screens
    • Tabbed interface for thoughts and actions

5. Widgets (lib/widgets/)

  • TaskCard: Reusable task card with animations and metrics

6. Configuration (lib/config/)

  • AppTheme: Centralized design system
    • Material Design 3 theme
    • Color palette (Purple/Blue/Green)
    • 8px spacing system
    • Typography using Google Fonts Inter
    • Shadow and border radius constants

๐Ÿง  How It Works: The ReAct Framework

The application implements the ReAct (Reasoning + Acting) pattern, which combines reasoning and action in an iterative loop:

ReAct Cycle

1. THOUGHT โ†’ 2. ACTION โ†’ 3. OBSERVATION โ†’ (repeat)

1. THOUGHT (Reasoning Phase)

The AI agent analyzes the current state and decides what to do next:

  • Understands the user's goal
  • Considers what has been done so far
  • Plans the next logical step

2. ACTION (Acting Phase)

The agent executes one of the available automation functions:

  • captureScreenshot(): Takes a screenshot to see the current state
  • detectElementPosition(description): Finds UI elements using AI vision
  • moveMouse(x, y): Moves cursor to coordinates
  • clickMouse(button, action): Performs mouse clicks
  • typeText(text): Types text via keyboard
  • pressKeys(keys): Presses keyboard shortcuts
  • wait(seconds): Waits for a specified duration

3. OBSERVATION (Feedback Phase)

The agent receives feedback from the action:

  • Success/failure status
  • Element coordinates (for detection)
  • Screenshot data
  • Error messages

This cycle repeats until the task is complete or max iterations (20) is reached.

๐Ÿ”ง Technical Architecture

1. AI Integration (Gemini 2.5 Flash)

The application uses Google's Gemini AI with function calling capabilities:

GenerativeModel(
  model: 'gemini-2.5-flash',
  apiKey: apiKey,
  tools: [
    captureScreenshotTool,
    detectElementTool,
    moveMouseTool,
    clickMouseTool,
    typeTextTool,
    pressKeysTool,
    waitTool,
  ],
)

The AI can:

  • Understand natural language instructions
  • Reason about multi-step tasks
  • Call automation functions with appropriate parameters
  • Process visual information from screenshots

2. Computer Vision (UI Element Detection)

The VisionService supports two vision providers for UI element detection:

Gemini Vision API (Default)

  • Uses Google's Gemini 2.5 Flash model
  • Integrated with Google AI Studio
  • Fast and reliable for most use cases

Qwen Vision API (Alternative)

  • Uses Alibaba Cloud's Qwen 2.5 VL 72B Instruct model
  • OpenAI-compatible API format
  • Provides image size detection and confidence scores
  • Configurable resolution parameters

How it works:

  1. Takes a screenshot of the current screen
  2. Sends the image + element description to the selected vision API
  3. AI analyzes the image and returns pixel coordinates
  4. Returns a DetectionResult with x, y coordinates and confidence score

Example:

final result = await VisionService.detectElementPosition(
  imageBytes,
  "blue Submit button",
);
// Returns: {x: 450, y: 320, confidence: 0.95}

Switching Providers: Edit lib/config/app_config.dart:

static const String visionProvider = 'qwen';  // or 'gemini'
static const String qwenApiKey = 'sk-your-qwen-api-key';

See QWEN_INTEGRATION.md for detailed setup instructions.

3. Input Automation

Uses the bixat_key_mouse package (custom Rust-based FFI) for:

  • Mouse Control: Move cursor, click, double-click, right-click
  • Keyboard Control: Type text, press keys, keyboard shortcuts
  • Screen Capture: Take screenshots via screen_capturer

4. State Management (Provider)

The AppState class manages:

  • Current task execution state
  • Execution logs and thought logs
  • Screenshot data
  • Task history from database
  • ReAct agent state (iteration count, current thought, observations)

5. Data Persistence (Isar Database)

Tasks are stored locally using Isar (NoSQL database):

@collection
class Task {
  Id id = Isar.autoIncrement;
  String prompt = '';
  List<String> thoughts = [];  // AI reasoning steps
  List<String> steps = [];     // Executed actions
  bool completed = false;
  DateTime createdAt = DateTime.now();
}

๐Ÿ“ฆ Dependencies

Core AI & Automation

  • google_generative_ai (^0.4.3): Gemini AI integration with function calling
  • bixat_key_mouse: Custom Rust-based FFI package for mouse/keyboard control
  • screen_capturer (^0.2.1): Cross-platform screen capture functionality

State Management & Storage

  • provider (^6.1.1): State management using ChangeNotifier pattern
  • isar (^3.1.0+1): Fast, local NoSQL database for task persistence
  • isar_flutter_libs (^3.1.0+1): Isar platform-specific bindings

UI & Design

  • flutter_animate (^4.5.0): Declarative animations and transitions
  • google_fonts (^6.1.0): Inter font family for typography
  • Material Design 3: Modern design system with gradient themes

Utilities

  • path_provider (^2.1.1): Access to file system paths
  • uuid (^4.2.1): Generate unique identifiers for tasks
  • image (^4.5.4): Image processing and manipulation

Development

  • build_runner (^2.4.6): Code generation for Isar
  • isar_generator (^3.1.0+1): Generates Isar database code

๐Ÿš€ Getting Started

Prerequisites

Installation

  1. Clone the repository

    git clone <repository-url>
    cd nextdesk
  2. Install dependencies

    flutter pub get
  3. Set up the bixat_key_mouse package

    Ensure the bixat_key_mouse_dart package is available at:

    ../bixat_key_mouse_dart
    
  4. Configure API key

    Copy the example config file and add your API key:

    cp lib/config/app_config.dart.example lib/config/app_config.dart

    Then open lib/config/app_config.dart and replace the API key:

    static const String geminiApiKey = 'YOUR_GEMINI_API_KEY_HERE';
  5. Generate Isar database code

    flutter pub run build_runner build
  6. Run the application

    flutter run -d macos  # or windows/linux

๐Ÿ’ก Usage Examples

Example 1: Simple Web Search

Input: "Open Chrome and search for Flutter documentation"

ReAct Process:
THOUGHT: I need to open Chrome first
ACTION: pressKeys(['cmd', 'space'])
OBSERVATION: Spotlight opened

THOUGHT: Now type Chrome to search for it
ACTION: typeText('Chrome')
OBSERVATION: Text typed successfully

THOUGHT: Press Enter to open Chrome
ACTION: pressKeys(['enter'])
OBSERVATION: Chrome is opening

THOUGHT: Wait for Chrome to load
ACTION: wait(2)
OBSERVATION: Waited 2 seconds

THOUGHT: Now I need to click on the address bar
ACTION: captureScreenshot()
OBSERVATION: Screenshot captured

THOUGHT: Detect the address bar
ACTION: detectElementPosition('address bar at the top')
OBSERVATION: Found at x:500, y:100

THOUGHT: Click on the address bar
ACTION: moveMouse(500, 100)
ACTION: clickMouse('left', 'click')
OBSERVATION: Clicked successfully

THOUGHT: Type the search query
ACTION: typeText('Flutter documentation')
ACTION: pressKeys(['enter'])
OBSERVATION: Task complete

Example 2: File Operations

Input: "Create a new text file named 'notes.txt' on the desktop"

Example 3: Application Control

Input: "Take a screenshot and save it"

๐ŸŽฏ Key Features

โœ… Implemented

  • โœ… Natural language task understanding
  • โœ… ReAct reasoning framework (Thought โ†’ Action โ†’ Observation)
  • โœ… AI-powered UI element detection using computer vision
  • โœ… Mouse and keyboard automation
  • โœ… Screenshot capture and analysis
  • โœ… Task history and persistence (Isar database)
  • โœ… Multi-step task execution with iteration control
  • โœ… Real-time execution logs and thought visualization
  • โœ… Responsive desktop interface

๐Ÿ”ฎ Future Enhancements

  • Multi-monitor support
  • Task templates and macros
  • Voice command input
  • Task scheduling and automation
  • Error recovery and retry logic
  • Performance optimization
  • Plugin system for custom actions
  • Cloud sync for task history
  • Dark/Light theme toggle
  • Export task history to JSON/CSV

๐Ÿ›๏ธ Code Organization

The project follows a clean, modular architecture with clear separation of concerns:

  • Models: Data structures for tasks, detection results, and agent state
  • Services: AI integration, vision processing, and automation functions
  • Providers: State management using Provider pattern
  • Screens: Main UI with responsive layout
  • Widgets: Reusable UI components
  • Config: Centralized theme and design system

๐Ÿ”’ Security & Privacy

  • API Key: Store your Gemini API key securely (use environment variables in production)
  • Local Processing: All automation runs locally on your machine
  • Data Storage: Task history is stored locally using Isar database
  • Screenshots: Temporary screenshots are kept in memory and not persisted
  • No Telemetry: No data is sent to external servers except Gemini API calls
  • Permissions: Requires accessibility permissions for automation (user-controlled)

โš ๏ธ Known Limitations

โš ๏ธ Vision-Based Element Detection (NOT READY)

The detectElementPosition function uses AI vision to locate UI elements, but it is currently unreliable and NOT recommended for use:

  • โŒ Not Production Ready: This feature is experimental and under active development
  • โŒ Accuracy Issues: Detection may be off by several pixels or fail entirely
  • โŒ Inconsistent Results: Same element may be detected differently across runs
  • โŒ Complex UIs: Elements in dense or overlapping layouts are very difficult to detect
  • โŒ Similar Elements: May confuse similar-looking buttons or icons
  • โŒ Performance: Vision API calls are slow and may timeout

โœ… RECOMMENDED APPROACH:

  • Use keyboard shortcuts (pressKeys) whenever possible - much more reliable
  • Use getShortcuts tool to dynamically fetch keyboard shortcuts for applications
  • Avoid vision-based detection until this feature is stabilized in future releases

This is a known limitation of the current implementation and AI vision models. We are actively working on improving this feature.

๐Ÿ› Troubleshooting

Common Issues

  1. "Failed to detect element"

    • Note: Element detection is not always precise and may fail
    • Use keyboard shortcuts instead of mouse clicks when possible
    • Ensure the element description is very clear and specific
    • Try taking a screenshot first to verify the UI state
    • Check that the element is visible on screen
    • Improve description with more details (e.g., "blue Submit button in bottom right corner with white text")
  2. "API key error"

    • Verify your Gemini API key is valid
    • Check your internet connection
    • Ensure you haven't exceeded API quotas
    • Update the API key in lib/services/gemini_service.dart
  3. Mouse/keyboard not working

    • Grant accessibility permissions to the app (System Preferences โ†’ Security & Privacy)
    • Check that bixat_key_mouse package is properly installed
    • Verify platform-specific permissions
    • Restart the application after granting permissions

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

๐Ÿ“ง Contact

https://bixat.dev


Built with โค๏ธ using Flutter and Google Gemini AI

About

NextDesk - AI-powered desktop automation with ReAct framework

Resources

Stars

Watchers

Forks

Packages

No packages published