Improve Observability of Environments in Curriculum

## Summary
Implement a comprehensive monitoring system that continuously tracks the status and performance of each environment in the RL curriculum, with automated warnings for unexpected behavior patterns such as performance regression in graduated environments.

## Background
The current curriculum system in Marin tracks environments through different states (locked → unlocked → graduated) and monitors performance during active training. However, once an environment graduates, there's limited visibility into its ongoing performance. This can lead to silent regressions where a previously mastered environment deteriorates without notice.

## Problem Statement
1. **Limited Post-Graduation Visibility**: Once an environment graduates (reaches `stop_threshold` and plateaus), we stop actively monitoring its performance
2. **No Regression Detection**: If a graduated environment's performance degrades during subsequent training, this goes unnoticed
3. **Lack of Holistic View**: No centralized view of curriculum health across all environments and their historical performance
4. **Missing Early Warning System**: No proactive alerts for concerning patterns like:
   - Graduated environments showing performance drops
   - Environments stuck in training without progress
   - Unusual performance volatility

## Proposed Solution

### 1. Continuous Performance Tracking
- Track performance metrics for ALL environments (not just active ones) during periodic evaluations
- Store historical performance data with timestamps
- Maintain separate tracking for:
  - Training performance
  - Evaluation performance
  - Time spent in each state (locked/active/graduated)

### 2. Monitoring Dashboard/Metrics
Create a comprehensive monitoring system that provides:

```python
@dataclass
class EnvironmentHealthMetrics:
    """Health metrics for a single environment."""
    env_id: str
    current_state: str  # "locked", "active", "graduated"
    
    # Performance tracking
    current_train_performance: float
    current_eval_performance: float
    peak_performance: float
    performance_trend: str  # "improving", "stable", "regressing"
    
    # State duration
    time_in_current_state: int
    total_training_steps: int
    
    # Health indicators
    health_status: str  # "healthy", "warning", "critical"
    warnings: list[str]  # List of active warnings
```

### 3. Automated Warning System
Implement configurable alerts for:

- **Graduated Environment Regression**: Warn if eval performance drops by X% from graduation level
- **Training Stagnation**: Alert if an active environment shows no progress for N steps
- **Unexpected State Transitions**: Flag unusual patterns (e.g., rapid graduation followed by regression)
- **Performance Volatility**: Detect and warn about unstable performance patterns

### 4. Implementation Details

#### a) Extend Curriculum class
```python
class Curriculum:
    def __init__(self, ...):
        # ... existing init ...
        self.health_monitor = CurriculumHealthMonitor(config)
    
    def update_stats(self, ...):
        # ... existing update logic ...
        # Add health monitoring
        self.health_monitor.update(lesson_id, stats, mode)
        warnings = self.health_monitor.check_warnings()
        if warnings:
            self._handle_warnings(warnings)
```

#### b) Add HealthMonitor component
```python
class CurriculumHealthMonitor:
    def __init__(self, config: HealthMonitorConfig):
        self.config = config
        self.historical_performance = defaultdict(list)
        self.graduation_baselines = {}
        self.active_warnings = defaultdict(list)
    
    def check_graduated_regression(self, env_id: str) -> Optional[Warning]:
        """Check if graduated environment has regressed."""
        if env_id not in self.graduation_baselines:
            return None
        
        current_perf = self.get_current_performance(env_id)
        baseline = self.graduation_baselines[env_id]
        
        if current_perf < baseline * self.config.regression_threshold:
            return Warning(
                type="graduated_regression",
                env_id=env_id,
                message=f"Graduated env {env_id} performance dropped from {baseline:.3f} to {current_perf:.3f}",
                severity="high"
            )
```

#### c) Logging and Visualization
- Log warnings to wandb/tensorboard with appropriate tags
- Create visualization dashboards showing:
  - Curriculum state diagram with performance overlays
  - Historical performance charts for each environment
  - Warning/alert timeline
  - Health status summary

### 5. Configuration
Add configuration options:

```yaml
health_monitoring:
  enabled: true
  regression_threshold: 0.85  # Warn if performance drops below 85% of graduation level
  stagnation_window: 100  # Steps without progress before warning
  evaluation_frequency: 50  # How often to evaluate all environments
  warning_cooldown: 200  # Steps before re-issuing same warning
  metrics_retention_days: 30  # How long to keep historical data
```

## Benefits
1. **Early Problem Detection**: Catch performance regressions before they become critical
2. **Better Training Stability**: Maintain consistent performance across all learned tasks
3. **Improved Debugging**: Clear visibility into curriculum dynamics and problem areas
4. **Confidence in Deployment**: Ensure model maintains competence across all trained environments

## Success Criteria
- [ ] All environments are continuously monitored regardless of state
- [ ] Warnings are generated for configurable regression thresholds
- [ ] Historical performance data is tracked and accessible
- [ ] Integration with existing logging/visualization tools (wandb, tensorboard)
- [ ] Minimal performance overhead (<5% additional compute)

## Open Questions
1. Should we automatically intervene when warnings are detected (e.g., re-activate graduated environments)?
2. What's the appropriate balance between monitoring frequency and computational cost?
3. Should health metrics influence curriculum sampling weights?
4. How long should we retain historical performance data?

## Related Issues/Context
- Current curriculum implementation: `src/marin/rl/curriculum.py`
- Environment evaluation: `src/marin/rl/evaluate_environment.py`
- Consider integration with existing rollout stats tracking


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve Observability of Environments in Curriculum #1808

Summary

Background

Problem Statement

Proposed Solution

1. Continuous Performance Tracking

2. Monitoring Dashboard/Metrics

3. Automated Warning System

4. Implementation Details

a) Extend Curriculum class

b) Add HealthMonitor component

c) Logging and Visualization

5. Configuration

Benefits

Success Criteria

Open Questions

Related Issues/Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve Observability of Environments in Curriculum #1808

Description

Summary

Background

Problem Statement

Proposed Solution

1. Continuous Performance Tracking

2. Monitoring Dashboard/Metrics

3. Automated Warning System

4. Implementation Details

a) Extend Curriculum class

b) Add HealthMonitor component

c) Logging and Visualization

5. Configuration

Benefits

Success Criteria

Open Questions

Related Issues/Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions