Implemented HTTP Artifact Streaming to Prevent OOM Errors with Large Files #12394

hbelmiro · 2025-10-27T20:44:31Z

This PR implements HTTP streaming for KFP artifacts, preventing out-of-memory (OOM) errors when downloading large files through the API server. The implementation streams artifacts directly from blob storage to the HTTP response without loading them into memory.

Description of your changes:

1. Migrated to Cloud-Native Blob Storage (`gocloud.dev/blob`)

Replaced MinIO-specific implementation with provider-agnostic gocloud.dev/blob
Supports S3, GCS, Azure Blob, and MinIO through a unified interface
Enables true streaming with io.Copy pattern

2. Implemented Streaming Throughout the Stack

Storage Layer: BlobObjectStore.GetFileReader() returns an io.ReadCloser for streaming
Resource Manager: Uses io.Copy(httpWriter, reader) to stream in chunks
HTTP Endpoint: Streams response without buffering entire content

3. Refactored Client Management

Consolidated blob storage initialization in client_manager.go
Improved credential handling for both environment variables and Kubernetes secrets
Added support for MinIO path-style URLs with use_path_style=true
Reused existing Kubernetes client to avoid redundant connections

4. Improved Code Organization

Extracted TokenRefresher into dedicated package
Introduced artifact.Client for better abstraction
Removed deprecated ReadArtifact functions
Cleaned up redundant MinIO implementation

Key Changes

Core Implementation

backend/src/apiserver/storage/blob_object_store.go: New streaming-capable storage implementation
backend/src/apiserver/resource/resource_manager.go: Updated StreamArtifact to use streaming reader
backend/src/apiserver/server/run_artifact_server.go: HTTP endpoint now streams responses
backend/src/apiserver/client_manager/client_manager.go: Refactored blob storage initialization

Testing

Added streaming tests:
- TestBlobObjectStore_GetFileReader_ChunkedStreaming: Validates chunked reading mechanics
- TestBlobObjectStore_Streaming_1GB_File: Proves 1GB files stream without OOM
- TestStreamArtifactV1_ChunkedResponse: Validates HTTP endpoint streams in chunks

Testing Instructions

Run Unit Tests

# Storage layer streaming tests
go test -v -run "TestBlobObjectStore.*Streaming" ./backend/src/apiserver/storage/

# HTTP endpoint streaming test  
go test -v -run TestStreamArtifactV1_ChunkedResponse ./backend/src/apiserver/server/

Checklist:

You have signed off your commits
The title for your pull request (PR) should follow our title convention. Learn more about the pull request title convention used in this repository.

google-oss-prow · 2025-10-27T20:44:34Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

google-oss-prow · 2025-10-27T20:44:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from hbelmiro. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

backend/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

backend/src/apiserver/server/run_artifact_server.go

backend/src/apiserver/client_manager/client_manager.go

backend/src/apiserver/server/run_artifact_server_test.go

backend/src/agent/persistence/client/artifactclient/client.go

backend/src/agent/persistence/client/tokenrefresher/token_refresher.go

hbelmiro · 2025-11-11T13:22:29Z

/retest

backend/src/apiserver/main.go

backend/src/agent/persistence/client/artifact/client.go

backend/src/apiserver/storage/minio_client.go

backend/src/apiserver/client_manager/client_manager.go

HumairAK · 2025-11-11T22:01:17Z

backend/src/apiserver/client_manager/client_manager.go

+	if shouldEnsureObjectBucket(bucketName, config) {
+		if err := ensureMinioBucketExists(ctx, config, bucketName, blobConfig.accessKey, blobConfig.secretKey, k8sClient); err != nil {
+			glog.Warningf("Failed to ensure MinIO bucket exists (may already exist): %v", err)
+		}
+	}


most gocloud blob drivers (s3, gcs, azure) will throw an error at blob.OpenBucket() stage, so I think this check is unnecessary

ensureBucketExists still exists, I think we can remove this, again OpenBucket should hit an error if the bucket exists on most common drivers

Oh, what I removed was actually shouldEnsureObjectBucket.
But anyway, blob.OpenBucket() succeeds when the bucket doesn't exist. The error only appears during the writing.

backend/src/apiserver/server/run_artifact_server.go

…Files Signed-off-by: Helber Belmiro <[email protected]>

Simplified and modularized logic by moving bucket existence check (for MinIO and S3) into a dedicated `shouldEnsureObjectBucket` function. Signed-off-by: Helber Belmiro <[email protected]>

…s for clarity Simplified handling of the default region by directly setting "us-east-1" where needed. Updated comments to improve readability and provide context. Signed-off-by: Helber Belmiro <[email protected]>

Added JSON error response validation and introduced cases for missing all parameters in `StreamArtifactV1` tests. Signed-off-by: Helber Belmiro <[email protected]>

Signed-off-by: Helber Belmiro <[email protected]>

…anager Signed-off-by: Helber Belmiro <[email protected]>

…object storage config Signed-off-by: Helber Belmiro <[email protected]>

Signed-off-by: Helber Belmiro <[email protected]>

…sed `useDirectBucket` logic Signed-off-by: Helber Belmiro <[email protected]>

…bernetes client dependency for MinIO config Signed-off-by: Helber Belmiro <[email protected]>

…mpatibility Signed-off-by: Helber Belmiro <[email protected]>

Signed-off-by: Helber Belmiro <[email protected]>

…` package to `artifactclient` Signed-off-by: Helber Belmiro <[email protected]>

HumairAK · 2025-11-12T19:51:30Z

backend/src/apiserver/client_manager/client_manager.go

+	if err := os.Setenv("AWS_ACCESS_KEY_ID", accessKey); err != nil {
+		return nil, fmt.Errorf("failed to set AWS_ACCESS_KEY_ID: %w", err)
+	}
+	if err := os.Setenv("AWS_SECRET_ACCESS_KEY", secretKey); err != nil {
+		return nil, fmt.Errorf("failed to set AWS_SECRET_ACCESS_KEY: %w", err)
+	}


This is a bit too passive, can we use a more declarative approach instead of passing credentials in the environment and passively consuming them?

For example, you can do something like the following (AI generated, untested):

// Create AWS config with explicit credentials (no environment variables) cfg, err := awsv2cfg.LoadDefaultConfig(ctx, awsv2cfg.WithRegion(config.region), awsv2cfg.WithCredentialsProvider(credentials.NewStaticCredentialsProvider( config.accessKey, config.secretKey, "", // session token (empty for static credentials) )), ) if err != nil { return fmt.Errorf("failed to create AWS config: %w", err) } // Create S3 client with custom endpoint endpointWithProtocol := ensureProtocol(config.endpoint, config.secure) s3Client := s3.NewFromConfig(cfg, func(o *s3.Options) { o.BaseEndpoint = awsv2.String(endpointWithProtocol) o.UsePathStyle = true // Important for MinIO and other S3-compatible storage }) // Open bucket using s3blob with explicit client bucket, err = s3blob.OpenBucketV2(ctx, s3Client, config.bucketName, nil) return err

we already import the aws sdk in the go mod, so it's available to use.

HumairAK · 2025-11-12T19:55:13Z

backend/src/apiserver/main.go

+	runArtifactServer := server.NewRunArtifactServer(resourceManager)
+	topMux.HandleFunc("/apis/v1beta1/runs/{run_id}/nodes/{node_id}/artifacts/{artifact_name}:read", runArtifactServer.ReadArtifactV1)
+	topMux.HandleFunc("/apis/v2beta1/runs/{run_id}/nodes/{node_id}/artifacts/{artifact_name}:read", runArtifactServer.ReadArtifact)


the original GRPC service had only support for GET method, but right now this endpoint looks like it's served on all methods, can you explicitly add only support for GET?

// Artifact reading endpoints (implemented with streaming for memory efficiency) runArtifactServer := server.NewRunArtifactServer(resourceManager) topMux.HandleFunc("/apis/v1beta1/runs/{run_id}/nodes/{node_id}/artifacts/{artifact_name}:read", runArtifactServer.ReadArtifactV1).Methods("GET") topMux.HandleFunc("/apis/v2beta1/runs/{run_id}/nodes/{node_id}/artifacts/{artifact_name}:read", runArtifactServer.ReadArtifact).Methods("GET")

can you also add an integration test to ensure this endpoint continues to work as intended, I think a new root context for "Verifying Read Artifact" would be good, this will also be useful in helping us maintain this endpoint for the long term support of 2.5 - wdyt?

cc @nsingla would this be the right place for it?

I think i need more context, what is this endpoint used for? and where does it get used?
Any idea if UI is using it?

And our e2e test are at a higher level, so we won;t be able to test the endpoint directly, for direct testing, we will have to add a new test to api tests. If no front end test exist, then we should atleast verify that manually (and if it has impact on Dashboard UI in ODH/RHOAI, then we need to create a ticket for the dashboard team to address with a timeline coinciding with the rebase of upstream to midstream)

google-oss-prow bot added the do-not-merge/work-in-progress label Oct 27, 2025

google-oss-prow bot requested a review from alyssacgoins October 27, 2025 20:44

google-oss-prow bot requested a review from HumairAK October 27, 2025 20:44

google-oss-prow bot added the size/XXL label Oct 27, 2025

hbelmiro force-pushed the large-objects branch 11 times, most recently from bfa78f5 to 7bdea1b Compare November 3, 2025 20:40

hbelmiro changed the title ~~WIP - memory optimization~~ WIP - Implemented HTTP Artifact Streaming to Prevent OOM Errors with Large Files Nov 4, 2025

hbelmiro force-pushed the large-objects branch 3 times, most recently from f8e1d02 to 41597e0 Compare November 4, 2025 14:54

hbelmiro marked this pull request as ready for review November 4, 2025 16:59

google-oss-prow bot requested a review from VaniHaripriya November 4, 2025 16:59

hbelmiro changed the title ~~WIP - Implemented HTTP Artifact Streaming to Prevent OOM Errors with Large Files~~ Implemented HTTP Artifact Streaming to Prevent OOM Errors with Large Files Nov 4, 2025

google-oss-prow bot removed the do-not-merge/work-in-progress label Nov 4, 2025

alyssacgoins reviewed Nov 10, 2025

View reviewed changes

hbelmiro force-pushed the large-objects branch from 41597e0 to 151cf9b Compare November 11, 2025 12:39

HumairAK reviewed Nov 11, 2025

View reviewed changes

hbelmiro mentioned this pull request Nov 12, 2025

chore(<backend>): Encapsulate v2/objectstore package to prevent usage outside v2 execution engine #12439

Open

hbelmiro added 13 commits November 12, 2025 14:36

Implemented HTTP Artifact Streaming to Prevent OOM Errors with Large …

ec4a066

…Files Signed-off-by: Helber Belmiro <[email protected]>

refactor(backend): Extract bucket existence check into helper function

a71403b

Simplified and modularized logic by moving bucket existence check (for MinIO and S3) into a dedicated `shouldEnsureObjectBucket` function. Signed-off-by: Helber Belmiro <[email protected]>

refactor(backend): Remove defaultRegion constant and update comment…

5f7e75e

…s for clarity Simplified handling of the default region by directly setting "us-east-1" where needed. Updated comments to improve readability and provide context. Signed-off-by: Helber Belmiro <[email protected]>

test(backend): Expand test coverage for StreamArtifactV1 error handling

fd0edbc

Added JSON error response validation and introduced cases for missing all parameters in `StreamArtifactV1` tests. Signed-off-by: Helber Belmiro <[email protected]>

refactor(backend): Rename artifact package to artifactclient

de0f733

Signed-off-by: Helber Belmiro <[email protected]>

refactor(backend): Remove MinIO client and fake client implementations

d6df03c

Signed-off-by: Helber Belmiro <[email protected]>

refactor(backend): Deduplicate MinIO constant definitions in client_m…

e593903

…anager Signed-off-by: Helber Belmiro <[email protected]>

refactor(backend): Add error handling for environment variable-based …

f971cd0

…object storage config Signed-off-by: Helber Belmiro <[email protected]>

refactor(backend): Enforced required config validations

4250bfd

Signed-off-by: Helber Belmiro <[email protected]>

refactor(backend): Simplify blob storage configuration and remove unu…

d8c17e1

…sed `useDirectBucket` logic Signed-off-by: Helber Belmiro <[email protected]>

refactor(backend): Simplify blob storage initialization and remove Ku…

cfe9ad6

…bernetes client dependency for MinIO config Signed-off-by: Helber Belmiro <[email protected]>

refactor(backend): Renamed :stream endpoint to :read for backwards co…

c11dbed

…mpatibility Signed-off-by: Helber Belmiro <[email protected]>

refactor(backend): Swap v1 and v2 artifact reading implementations

ef743fe

Signed-off-by: Helber Belmiro <[email protected]>

hbelmiro force-pushed the large-objects branch from 151cf9b to ef743fe Compare November 12, 2025 18:17

refactor(backend): Remove unused MinIO constants and rename `artifact…

4c338e7

…` package to `artifactclient` Signed-off-by: Helber Belmiro <[email protected]>

hbelmiro requested review from HumairAK and alyssacgoins November 12, 2025 18:59

HumairAK reviewed Nov 12, 2025

View reviewed changes

Implemented HTTP Artifact Streaming to Prevent OOM Errors with Large Files #12394

Are you sure you want to change the base?

Implemented HTTP Artifact Streaming to Prevent OOM Errors with Large Files #12394

Conversation

hbelmiro commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Migrated to Cloud-Native Blob Storage (gocloud.dev/blob)

2. Implemented Streaming Throughout the Stack

3. Refactored Client Management

4. Improved Code Organization

Key Changes

Core Implementation

Testing

Testing Instructions

Run Unit Tests

Uh oh!

google-oss-prow bot commented Oct 27, 2025

Uh oh!

google-oss-prow bot commented Oct 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hbelmiro commented Nov 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hbelmiro commented Oct 27, 2025 •

edited

Loading

1. Migrated to Cloud-Native Blob Storage (`gocloud.dev/blob`)