Skip to content

Improve Observability of Environments in Curriculum #1808

@krypticmouse

Description

@krypticmouse

Summary

Implement a comprehensive monitoring system that continuously tracks the status and performance of each environment in the RL curriculum, with automated warnings for unexpected behavior patterns such as performance regression in graduated environments.

Background

The current curriculum system in Marin tracks environments through different states (locked → unlocked → graduated) and monitors performance during active training. However, once an environment graduates, there's limited visibility into its ongoing performance. This can lead to silent regressions where a previously mastered environment deteriorates without notice.

Problem Statement

  1. Limited Post-Graduation Visibility: Once an environment graduates (reaches stop_threshold and plateaus), we stop actively monitoring its performance
  2. No Regression Detection: If a graduated environment's performance degrades during subsequent training, this goes unnoticed
  3. Lack of Holistic View: No centralized view of curriculum health across all environments and their historical performance
  4. Missing Early Warning System: No proactive alerts for concerning patterns like:
    • Graduated environments showing performance drops
    • Environments stuck in training without progress
    • Unusual performance volatility

Proposed Solution

1. Continuous Performance Tracking

  • Track performance metrics for ALL environments (not just active ones) during periodic evaluations
  • Store historical performance data with timestamps
  • Maintain separate tracking for:
    • Training performance
    • Evaluation performance
    • Time spent in each state (locked/active/graduated)

2. Monitoring Dashboard/Metrics

Create a comprehensive monitoring system that provides:

@dataclass
class EnvironmentHealthMetrics:
    """Health metrics for a single environment."""
    env_id: str
    current_state: str  # "locked", "active", "graduated"
    
    # Performance tracking
    current_train_performance: float
    current_eval_performance: float
    peak_performance: float
    performance_trend: str  # "improving", "stable", "regressing"
    
    # State duration
    time_in_current_state: int
    total_training_steps: int
    
    # Health indicators
    health_status: str  # "healthy", "warning", "critical"
    warnings: list[str]  # List of active warnings

3. Automated Warning System

Implement configurable alerts for:

  • Graduated Environment Regression: Warn if eval performance drops by X% from graduation level
  • Training Stagnation: Alert if an active environment shows no progress for N steps
  • Unexpected State Transitions: Flag unusual patterns (e.g., rapid graduation followed by regression)
  • Performance Volatility: Detect and warn about unstable performance patterns

4. Implementation Details

a) Extend Curriculum class

class Curriculum:
    def __init__(self, ...):
        # ... existing init ...
        self.health_monitor = CurriculumHealthMonitor(config)
    
    def update_stats(self, ...):
        # ... existing update logic ...
        # Add health monitoring
        self.health_monitor.update(lesson_id, stats, mode)
        warnings = self.health_monitor.check_warnings()
        if warnings:
            self._handle_warnings(warnings)

b) Add HealthMonitor component

class CurriculumHealthMonitor:
    def __init__(self, config: HealthMonitorConfig):
        self.config = config
        self.historical_performance = defaultdict(list)
        self.graduation_baselines = {}
        self.active_warnings = defaultdict(list)
    
    def check_graduated_regression(self, env_id: str) -> Optional[Warning]:
        """Check if graduated environment has regressed."""
        if env_id not in self.graduation_baselines:
            return None
        
        current_perf = self.get_current_performance(env_id)
        baseline = self.graduation_baselines[env_id]
        
        if current_perf < baseline * self.config.regression_threshold:
            return Warning(
                type="graduated_regression",
                env_id=env_id,
                message=f"Graduated env {env_id} performance dropped from {baseline:.3f} to {current_perf:.3f}",
                severity="high"
            )

c) Logging and Visualization

  • Log warnings to wandb/tensorboard with appropriate tags
  • Create visualization dashboards showing:
    • Curriculum state diagram with performance overlays
    • Historical performance charts for each environment
    • Warning/alert timeline
    • Health status summary

5. Configuration

Add configuration options:

health_monitoring:
  enabled: true
  regression_threshold: 0.85  # Warn if performance drops below 85% of graduation level
  stagnation_window: 100  # Steps without progress before warning
  evaluation_frequency: 50  # How often to evaluate all environments
  warning_cooldown: 200  # Steps before re-issuing same warning
  metrics_retention_days: 30  # How long to keep historical data

Benefits

  1. Early Problem Detection: Catch performance regressions before they become critical
  2. Better Training Stability: Maintain consistent performance across all learned tasks
  3. Improved Debugging: Clear visibility into curriculum dynamics and problem areas
  4. Confidence in Deployment: Ensure model maintains competence across all trained environments

Success Criteria

  • All environments are continuously monitored regardless of state
  • Warnings are generated for configurable regression thresholds
  • Historical performance data is tracked and accessible
  • Integration with existing logging/visualization tools (wandb, tensorboard)
  • Minimal performance overhead (<5% additional compute)

Open Questions

  1. Should we automatically intervene when warnings are detected (e.g., re-activate graduated environments)?
  2. What's the appropriate balance between monitoring frequency and computational cost?
  3. Should health metrics influence curriculum sampling weights?
  4. How long should we retain historical performance data?

Related Issues/Context

  • Current curriculum implementation: src/marin/rl/curriculum.py
  • Environment evaluation: src/marin/rl/evaluate_environment.py
  • Consider integration with existing rollout stats tracking

Metadata

Metadata

Assignees

Labels

p1Do right now

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions