Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
206 changes: 206 additions & 0 deletions adr/ADR-0011-java-to-python-service-migration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
---
num: 11
title: Java to Python Service Migration
status: "Draft"
authors:
- "ruivieira"
tags:
- "service"
- "python"
- "java"
- "migration"
---

## Title

TrustyAI Service Migration from Java to Python

## Context and Problem Statement

The current TrustyAI service uses Java with Quarkus and provides explainability, fairness and drift monitoring for machine learning models in OpenShift/Kubernetes environments.
However, the machine learning ecosystem is mostly Python-based, with most AI safety libraries, explainability tools, data science kits, and model frameworks built in Python.
This includes traditional ML explainability and fairness metrics, plus generative AI safety tools, multi-modal assessment frameworks, and AI governance solutions.
The Java version creates barriers for integration with Python-native libraries and limits our ability to use the broader Python AI environment.

Additionally, maintaining both Java core and Python bindings creates complexity and extra maintenance work.

## Goals

* Provide a drop-in replacement Python service that maintains API compatibility with the existing Java version
* Use native Python AI tools for AI safety, including explainability, fairness metrics, generative AI safety, and multi-modal assessment
* Reduce maintenance work by removing the need for Java-Python bindings
* Keep compatibility with existing deployment patterns and integrations
* Enable easier extension and customisation of AI safety algorithms across traditional ML and generative AI
* Keep existing OpenShift/Kubernetes deployment features
* Remove the Java dependency

## Non-goals


* Change the external API contract or deployment model
* Build endpoints that are not actively used in production or downstream deployments (some endpoints may be omitted if they are not used in practice)
* Provide feature parity for all Java-specific internal code

## Current situation

The Java TrustyAI service provides a complete REST API with the following main endpoint categories:

**Implemented Endpoints:**
- Data Upload/Download (`/data/upload`, `/data/download`)
- Drift Metrics (`/metrics/drift/*` - ApproxKSTest, FourierMMD, KSTest, Meanshift)
- Fairness Metrics (`/metrics/group/fairness/*` - DIR, SPD)
- Explainers (`/explainers/global/*`, `/explainers/local/*` - LIME, SHAP, Counterfactual, PDP, TSSaliency)
- Identity Metrics (`/metrics/identity/*`)
- Consumer Endpoints (`/consumer/kserve/v2`)
- Service Metadata (`/info/*`)
- Legacy Endpoints (`/metrics/dir`, `/metrics/spd`)

**Python components:**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to identify which upstream components provide the implementations that give us parity with the status quo?

Copy link
Member Author

@ruivieira ruivieira Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dahlem good point. IMO the plan is to (later on) replace any algorithm with TrustyAI code/sdk providers.
e.g. we would have in these endpoints something like:

from trustyai.providers import FairnessProvider

# /metrics/spd
spd = FairnessProvider(provider="<your_favourite_library", ...).spd(data)

with the concrete metrics backend selected by the user via configuration, but I agree it's a good idea to have specify which will the providers by default (e.g. scikit-learn, AIF360, etc)

The Python service will be built using FastAPI and include:
- Basic service structure with health checks and Prometheus metrics
- Partial fairness metrics (DIR, SPD)
- Basic drift metrics
- Explainer endpoints (global and local)
- Data upload/download features
- Consumer endpoint for KServe v2 payloads
- Service metadata endpoints
- Complete algorithmic code for all metrics and explainers
- Full data storage and retrieval
- Complete KServe v2 payload processing
- Scheduled metrics computation
- Additional endpoints as needed for production use cases

## Proposal

### Migration Strategy

1. **Incremental Migration**: We will complete the Python service to achieve full API compatibility
2. **Drop-in Replacement**: We will ensure the Python service can be deployed as a direct replacement for the Java version
3. **Algorithmic Preservation**: We will maintain the same algorithmic behaviour for consistency
4. **Selective Focus**: We will focus on endpoints that are actively used in downstream deployments

### Technical Implementation

**Framework**: We will use FastAPI as it provides:
- Automatic OpenAPI documentation generation
- High performance async features
- Type hints and validation

**Endpoint Coverage**: We will build all endpoints documented in the [TrustyAI Service API Reference](https://trustyai-explainability.github.io/trustyai-site/main/trustyai-service-api-reference.html) except where otherwise noted, including:

- **DataUpload**: `POST /data/upload`
- **DownloadEndpoint**: `POST /data/download`
- **DriftMetrics**: Complete ApproxKSTest, FourierMMD, KSTest, and Meanshift
- **FairnessMetrics**: Full DIR and SPD with group fairness features
- **ExplainersGlobal**: LIME and PDP global explainers
- **ExplainersLocal**: LIME, SHAP, Counterfactual, and TSSaliency local explainers
- **IdentityEndpoint**: Identity metrics computation
- **ServiceMetadata**: Complete service information and metadata endpoints
- **MetricsInformation**: Metrics scheduling and information endpoints
- **LegacyEndpoints**: Keep backward compatibility with legacy metric endpoints

**Data Storage**: We will build equivalent data persistence using:
- HDF5 for data storage
- Pandas for data manipulation
- Prometheus client for metrics exposure
- MariaDB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Rui 👋, quick question, does it have to be MariaDB or could we just say any SQL based DB provider?

Copy link
Member Author

@ruivieira ruivieira Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sheltoncyril , good question.
It does not have to be MariaDB exclusively, but for each supported storage backend we would need to write an interface for it. As a starting point, I think we should do MariaDB first, since that's the supported one for the Java service. But in theory, we could have either TrustyAI (or community) interfaces for any storage type in the future.


**Deployment Compatibility**: We will keep compatibility with:
- OpenShift/Kubernetes deployments
- TrustyAI Operator management
- Existing service discovery and network patterns
- Health check endpoints (`/q/health/ready`, `/q/health/live`)
- Prometheus metrics endpoint (`/q/metrics`)

### Algorithmic Implementation

**Native Python Libraries**: We will use existing Python libraries where appropriate:
- scikit-learn for standard ML metrics
- SHAP library for SHAP explanations
- LIME library for LIME explanations
- NumPy/SciPy for statistical computations
- AIF360 library for fairness

**Custom Code**: We will build TrustyAI-specific algorithms to keep consistency with the Java version

**Testing**: We will ensure algorithmic parity through complete testing against Java results

## Alternatives Considered / Rejected

### Keep Dual Implementation
Continue maintaining both Java and Python versions was rejected due to:
- Increased maintenance work
- Complexity in keeping code synchronised
- Limited ability to use Python-native libraries in the Java service

### Hybrid Approach
Using JPype or similar Java-Python bridges was rejected due to:
- Complexity in deployment and container management
- Performance overhead from language bridges
- Limited ability to extend with pure Python libraries

### Gradual Feature Migration
A gradual migration with feature flags was rejected due to:
- Complexity in managing two codebases
- Potential for inconsistent behaviour
- Difficulty in keeping API compatibility

## Challenges

### Technical Challenges
- **Algorithmic Parity**: Ensuring Python code produces identical results to Java code
- **Performance**: Keeping comparable performance characteristics
- **Memory Management**: Handling large datasets efficiently in Python

### Operational Challenges
- **Deployment Transition**: Ensuring smooth transition for existing deployments
- **Documentation**: Updating all documentation and examples
- **Testing**: Complete testing to ensure feature parity
- **Monitoring**: Keeping observability and monitoring features

### Integration Challenges
- **Operator Compatibility**: Ensuring TrustyAI Operator continues to work with the Python service
- **Third-party Integrations**: Keeping compatibility with KServe, ModelMesh, and other integrations
- **API Compatibility**: Ensuring exact API compatibility for existing clients

## Dependencies

### Development Dependencies
- Completion of Python service algorithmic code
- Complete test suite development
- Performance benchmarking and optimisation
- Documentation updates

### Deployment Dependencies
- Container image updates
- Operator compatibility testing
- Downstream integration testing
- Migration documentation and tooling

### Environment Dependencies
- Python ML library environment stability
- FastAPI framework continued support
- Container base image security and maintenance

## Consequences if not completed

### Negative Consequences
- **Limited Extensibility**: Continued inability to easily integrate Python-native ML libraries
- **Maintenance Burden**: Ongoing complexity of keeping Java-Python bindings
- **Innovation Constraints**: Reduced ability to use Python ML research innovations
- **Community Barriers**: Difficulty for Python-focused contributors to contribute to the project

### Missed Opportunities
- **Environment Integration**: Lost opportunities to integrate with Python ML environment
- **Simplified Architecture**: Continued complexity from dual-language setup
- **Developer Experience**: Reduced developer productivity due to language barriers

## Success Criteria

The migration will be considered successful when:
1. **API Compatibility**: All documented endpoints are built with identical behaviour
2. **Performance Parity**: Performance characteristics match the Java version
3. **Deployment Compatibility**: Drop-in replacement capability is verified
4. **Algorithmic Consistency**: Results are equivalent to Java version
5. **Integration Testing**: All downstream integrations work without modification
6. **Documentation**: Complete documentation and migration guides are available
3 changes: 2 additions & 1 deletion adr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,5 @@
# Approved ADRs

* [ADR-0001: TrustyAI external library integration](ADR-0001-trustyai-external-library-integration.md)
* [ADR-0002: Metrics and XAI namespaces](ADR-0002-metrics-and-xai-namespaces.md)
* [ADR-0002: Metrics and XAI namespaces](ADR-0002-metrics-and-xai-namespaces.md)
* [ADR-0011: Java to Python Service Migration](ADR-0011-java-to-python-service-migration.md)