Skip to content

Commit 5c98832

Browse files
yyassi-heartexrobot-ci-heartexmakseq
authored
feat: BROS-353: Databricks storage integration (#8245) (#8542)
Co-authored-by: robot-ci-heartex <[email protected]> Co-authored-by: makseq <[email protected]> Co-authored-by: robot-ci-heartex <[email protected]> Co-authored-by: makseq <[email protected]>
1 parent 0a157ae commit 5c98832

File tree

8 files changed

+300
-162
lines changed

8 files changed

+300
-162
lines changed

.cursor/rules/storage-provider.mdc

Lines changed: 124 additions & 131 deletions
Large diffs are not rendered by default.

docs/source/guide/storage.md

Lines changed: 83 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ Set up the following cloud and other storage systems with Label Studio:
1919
- [Microsoft Azure Blob storage](#Microsoft-Azure-Blob-storage)
2020
- [Redis database](#Redis-database)
2121
- [Local storage](#Local-storage) <div class="enterprise-only">(for On-prem only)</div>
22+
- [Databricks Files (UC Volumes)](#Databricks-Files-UC-Volumes)
23+
2224

2325
## Troubleshooting
2426

@@ -43,6 +45,7 @@ For more troubleshooting information, see [Troubleshooting Import, Export, & Sto
4345

4446
</div>
4547

48+
4649
## How external storage connections and sync work
4750

4851
You can add source storage connections to sync data from an external source to a Label Studio project, and add target storage connections to sync annotations from Label Studio to external storage. Each source and target storage setup is project-specific. You can connect multiple buckets, containers, databases, or directories as source or target storage for a project.
@@ -1483,7 +1486,86 @@ You can also create a storage connection using the Label Studio API.
14831486
If you're using Label Studio in Docker, you need to mount the local directory that you want to access as a volume when you start the Docker container. See [Run Label Studio on Docker and use local storage](https://labelstud.io/guide/start#Run-Label-Studio-on-Docker-and-use-Local-Storage).
14841487
14851488
1486-
### Troubleshooting cloud storage
1489+
1490+
1491+
## Databricks Files (UC Volumes)
1492+
1493+
<div class="enterprise-only">
1494+
1495+
Connect Label Studio Enterprise to Databricks Unity Catalog (UC) Volumes to import files as tasks and export annotations as JSON back to your volumes. This connector uses the Databricks Files API and operates only in proxy mode (no presigned URLs are supported by Databricks).
1496+
1497+
### Prerequisites
1498+
- A Databricks workspace URL (Workspace Host), for example `https://adb-12345678901234.1.databricks.com` (or Azure domain)
1499+
- A Databricks Personal Access Token (PAT) with permission to access the Files API
1500+
- A UC Volume path under `/Volumes/<catalog>/<schema>/<volume>` with files you want to label
1501+
1502+
References:
1503+
- Databricks workspace: https://docs.databricks.com/en/getting-started/index.html
1504+
- Personal access tokens: https://docs.databricks.com/en/dev-tools/auth/pat.html
1505+
- Unity Catalog and Volumes: https://docs.databricks.com/en/files/volumes.html
1506+
1507+
### Set up connection in the Label Studio UI
1508+
1. Open Label Studio → project → **Settings > Cloud Storage**.
1509+
2. Click **Add Source Storage**. Select **Databricks Files (UC Volumes)**.
1510+
3. Configure the connection:
1511+
- Workspace Host: your Databricks workspace base URL (no trailing slash)
1512+
- Access Token: your PAT
1513+
- Catalog / Schema / Volume: Unity Catalog coordinates
1514+
- Click **Next** to open Import Settings & Preview
1515+
4. Import Settings & Preview:
1516+
- Bucket Prefix (optional): relative subpath under the volume (e.g., `images/train`)
1517+
- File Name Filter (optional): regex to filter files (e.g., `.*\.json$`)
1518+
- Scan all sub-folders: enable for recursive listing; disable to list only current folder
1519+
- Click **Load preview** to verify files
1520+
5. Click **Save** (or **Save & Sync**) to create the connection and sync tasks.
1521+
1522+
### Target storage (export)
1523+
1. Open **Settings > Cloud Storage** → **Add Target Storage** → **Databricks Files (UC Volumes)**.
1524+
2. Use the same Workspace Host/Token and UC coordinates.
1525+
3. Set an Export Prefix (e.g., `exports/${project_id}`).
1526+
4. Click **Save** and then **Sync** to push annotations as JSON files to your volume.
1527+
1528+
!!! note "URI schema"
1529+
To reference Databricks files directly in task JSON (without using an Import Storage), use Label Studio’s Databricks URI scheme:
1530+
1531+
`dbx://Volumes/<catalog>/<schema>/<volume>/<path>`
1532+
1533+
Example:
1534+
1535+
```
1536+
{ "image": "dbx://Volumes/main/default/dataset/images/1.jpg" }
1537+
```
1538+
1539+
1540+
!!! note "Troubleshooting"
1541+
- If listing returns zero files, verify the path under `/Volumes/<catalog>/<schema>/<volume>/<prefix?>` and your PAT permissions.
1542+
- Ensure the Workspace Host has no trailing slash and matches your workspace domain.
1543+
- If previews work but media fails to load, confirm proxy mode is allowed for your organization in Label Studio and network egress allows Label Studio to reach Databricks.
1544+
1545+
1546+
!!! warning "Proxy and security"
1547+
This connector streams data **through the Label Studio backend** with HTTP Range support. Databricks does not support presigned URLs, so this option is also not available in Label Studio.
1548+
1549+
1550+
</div>
1551+
1552+
<div class="opensource-only">
1553+
1554+
### Use Databricks Files in Label Studio Enterprise
1555+
1556+
Databricks Unity Catalog (UC) Volumes integration is available in Label Studio Enterprise. It lets you:
1557+
1558+
- Import files directly from UC Volumes under `/Volumes/<catalog>/<schema>/<volume>`
1559+
- Stream media securely via the platform proxy (no presigned URLs)
1560+
- Export annotations back to your Databricks Volume as JSON
1561+
1562+
Learn more and see the full setup guide in the Enterprise documentation: [Databricks Files (UC Volumes)](https://docs.humansignal.com/guide/storage#Databricks-Files-UC-Volumes). If your organization needs governed access to Databricks data with Unity Catalog, consider [Label Studio Enterprise](https://humansignal.com/).
1563+
1564+
</div>
1565+
1566+
1567+
1568+
## Troubleshooting cloud storage
14871569
14881570
<div class="opensource-only">
14891571

label_studio/io_storages/README.md

Lines changed: 83 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,21 @@
11
# Cloud Storages
22

3-
There are 3 basic types of cloud storages:
3+
Cloud storage is used for importing tasks and exporting annotations in Label Studio. There are 2 basic types of cloud storages:
44

55
1. Import Storages (aka Source Cloud Storages)
66
2. Export Storages (aka Target Cloud Storages)
7-
3. Dataset Storages (available in enterprise)
87

98
Also Label Studio has Persistent storages where LS storage export files, user avatars and UI uploads. Do not confuse `Cloud Storages` and `Persistent Storage`, they have completely different codebase and tasks. Cloud Storages are implemented in `io_storages`, Persistent Storage uses django-storages and it is installed in Django settings environment variables (see `base.py`).
109

10+
Note: Dataset Storages were implemented in the enterprise codebase only. They are **deprecated and not used**.
1111

12+
## Basic hierarchy
1213

14+
This section uses GCS storage as an example, and the same logic can be applied to other storages.
1315

14-
## Basic hierarchy
15-
16-
### Import and Dataset Storages
16+
### Import Storages
1717

18-
This diagram is based on Google Cloud Storage (GCS) and other storages are implemented the same way.
18+
This storage type is designed for importing tasks FROM cloud storage to Label Studio. This diagram is based on Google Cloud Storage (GCS), and other storages are implemented in the same way:
1919

2020
```mermaid
2121
graph TD;
@@ -28,7 +28,7 @@ This diagram is based on Google Cloud Storage (GCS) and other storages are imple
2828
GCSImportStorageBase-->GCSImportStorage;
2929
GCSImportStorageBase-->GCSDatasetStorage;
3030
31-
DatasetStorageMixin-->GCSDatasetStorage;
31+
GCSImportStorageLink-->ImportStorageLink
3232
3333
subgraph Google Cloud Storage
3434
GCSImportStorage;
@@ -37,7 +37,52 @@ This diagram is based on Google Cloud Storage (GCS) and other storages are imple
3737
end
3838
```
3939

40+
- **Storage** (`label_studio/io_storages/base_models.py`): Abstract base for all storages. Inherits status/progress from `StorageInfo`. Defines `validate_connection()` contract and common metadata fields.
41+
42+
- **ImportStorage** (`label_studio/io_storages/base_models.py`): Abstract base for source storages. Defines core contracts used by sync and proxy:
43+
- `iter_objects()`, `iter_keys()` to enumerate objects
44+
- `get_unified_metadata(obj)` to normalize provider metadata
45+
- `get_data(key)` to produce `StorageObject`(s) for task creation
46+
- `generate_http_url(url)` to resolve provider URL -> HTTP URL (presigned or direct)
47+
- `resolve_uri(...)` and `can_resolve_url(...)` used by the Storage Proxy
48+
- `scan_and_create_links()` to create `ImportStorageLink`s for tasks
49+
50+
- **ImportStorageLink** (`label_studio/io_storages/base_models.py`): Link model created per-task for imported objects. Fields: `task` (1:1), `key` (external key), `row_group`/`row_index` (parquet/JSONL indices), `object_exists`, timestamps. Helpers: `n_tasks_linked(key, storage)` and `create(task, key, storage, row_index=None, row_group=None)`.
51+
52+
- **ProjectStorageMixin** (`label_studio/io_storages/base_models.py`): Adds `project` FK and permission checks. Used by project-scoped storages (e.g., `GCSImportStorage`).
53+
54+
- **GCSImportStorageBase** (`label_studio/io_storages/gcs/models.py`): GCS-specific import base. Sets `url_scheme='gs'`, implements listing (`iter_objects/iter_keys`), data loading (`get_data`), URL generation (`generate_http_url`), URL resolution checks, and metadata helpers. Reused by both project imports and enterprise datasets.
55+
56+
- **GCSImportStorage** (`label_studio/io_storages/gcs/models.py`): Concrete project-scoped GCS import storage combining `ProjectStorageMixin` + `GCSImportStorageBase`.
4057

58+
- **GCSImportStorageLink** (`label_studio/io_storages/gcs/models.py`): Provider-specific `ImportStorageLink` with `storage` FK to `GCSImportStorage`. Created during sync to associate a task with the original GCS object key.
59+
60+
### Export Storages
61+
62+
This storage type is designed for exporting tasks or annotations FROM Label Studio to cloud storage.
63+
64+
```mermaid
65+
graph TD;
66+
67+
Storage-->ExportStorage;
68+
69+
ProjectStorageMixin-->ExportStorage;
70+
ExportStorage-->GCSExportStorage;
71+
GCSStorageMixin-->GCSExportStorage;
72+
73+
ExportStorageLink-->GCSExportStorageLink;
74+
```
75+
76+
- **ExportStorage** (`label_studio/io_storages/base_models.py`): Abstract base for target storages. Project-scoped; orchestrates export jobs and progress. Key methods:
77+
- `save_annotation(annotation)` provider-specific write
78+
- `save_annotations(queryset)`, `save_all_annotations()`, `save_only_new_annotations()` helpers
79+
- `sync(save_only_new_annotations=False)` background export via RQ
80+
81+
- **GCSExportStorage** (`label_studio/io_storages/gcs/models.py`): Concrete target storage for GCS. Serializes data via `_get_serialized_data(...)`, computes key via `GCSExportStorageLink.get_key(...)`, uploads to GCS; can auto-export on annotation save when configured.
82+
83+
- **ExportStorageLink** (`label_studio/io_storages/base_models.py`): Base link model connecting exported objects to `Annotation`s. Provides `get_key(annotation)` logic (task-based or annotation-based via FF) and `create(...)` helper.
84+
85+
- **GCSExportStorageLink** (`label_studio/io_storages/gcs/models.py`): Provider-specific link model holding FK to `GCSExportStorage`.
4186

4287
## How validate_connection() works
4388

@@ -50,32 +95,43 @@ Run this command with try/except:
5095
Target storages use the same validate_connection() function, but without any prefix.
5196

5297

53-
## Google Cloud Storage (GCS)
98+
## Key Storage Insights
99+
100+
### 1. **Primary Methods**
101+
- **Import storages**: `iter_objects()`, `get_data()`
102+
- **Export storages**: `save_annotation()`, `save_annotations()`
103+
104+
### 2. **Automatic vs Manual Operation**
105+
- **Import storages**: Require manual sync via API calls or UI
106+
- **Export storages**: Manual sync via API/UI | Manual sync via API/UI and automatic via Django signals when annotations are submitted or updated
54107

55-
### Credentials
108+
### 3. **Connection Validation Differences**
109+
- **Import storages**: Must validate that prefix contains files during `validate_connection()`
110+
- **Export storages**: Only validate bucket access, NOT prefix (prefixes are created automatically)
56111

57-
There are two methods for setting GCS credentials:
58-
1. Through the Project => Cloud Storage settings in the Label Studio user interface.
59-
2. Through Google Application Default Credentials (ADC). This involves the following steps:
112+
### 4. **Data Serialization**
113+
Export storages use `_get_serialized_data()` which returns different formats based on feature flags:
114+
- **Default**: Only annotation data (backward compatibility)
115+
- **With `fflag_feat_optic_650_target_storage_task_format_long` or `FUTURE_SAVE_TASK_TO_STORAGE`**: Full task + annotations data instead of annotation per file output.
60116

61-
2.1. Leave the Google Application Credentials field in the Label Studio UI blank.
62-
63-
2.2. Set an environment variable which will apply to all Cloud Storages. This can be done using the following command:
64-
```bash
65-
export GOOGLE_APPLICATION_CREDENTIALS=google_credentials.json
66-
```
67-
2.3. Alternatively, use the following command:
68-
```bash
69-
gcloud auth application-default login
70-
```
71-
2.4. Another option is to use credentials provided by the Google App Engine or Google Compute Engine metadata server, if the code is running on either GAE or GCE.
117+
### 5. **Built-in Threading**
118+
- Export storages inherit `save_annotations()` with built-in parallel processing
119+
- Uses ThreadPoolExecutor with configurable `max_workers` (default: min(8, cpu_count * 4))
120+
- Includes progress tracking and automatic batch processing
72121

73-
Note: If Cloud Storage credentials are set in the Label Studio UI, these will take precedence over other methods.
122+
### 6. **Storage Links & Key Generation**
123+
- **Import links**: Track task imports with custom keys
124+
- **Export links**: Track annotation exports with keys based on feature flags:
125+
- Default: `annotation.id`
126+
- With feature flag: `task.id` + optional `.json` extension
74127

75-
128+
### 7. **Optional Deletion Support**
129+
- Export storages can implement `delete_annotation()`
130+
- Controlled by `can_delete_objects` field
131+
- Automatically called when annotations are deleted from Label Studio
76132

77133

78-
## Storage statuses and how they are processed
134+
## StorageInfo statuses and how they are processed
79135

80136
Storage (Import and Export) have different statuses of synchronization (see `class StorageInfo.Status`):
81137

@@ -94,7 +150,7 @@ Storage (Import and Export) have different statuses of synchronization (see `cla
94150
InProgress-->Completed;
95151
```
96152

97-
Additionally, StorageInfo contains counters and debug information that will be displayed in storages:
153+
Additionally, class **StorageInfo** contains counters and debug information that will be displayed in storages:
98154

99155
* last_sync - time of the last successful sync
100156
* last_sync_count - number of objects that were successfully synced

label_studio/io_storages/api.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,7 @@ def create(self, request, *args, **kwargs):
174174
from .functions import validate_storage_instance
175175

176176
instance = validate_storage_instance(request, self.serializer_class)
177-
limit = request.data.get('limit', settings.DEFAULT_STORAGE_LIST_LIMIT)
177+
limit = int(request.data.get('limit', settings.DEFAULT_STORAGE_LIST_LIMIT))
178178

179179
try:
180180
files = []

label_studio/io_storages/proxy_api.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -199,7 +199,12 @@ def prepare_headers(self, response, metadata, request, project):
199199
if metadata.get('ContentRange'):
200200
response.headers['Content-Range'] = metadata['ContentRange']
201201
if metadata.get('LastModified'):
202-
response.headers['Last-Modified'] = metadata['LastModified'].strftime('%a, %d %b %Y %H:%M:%S GMT')
202+
last_mod = metadata['LastModified']
203+
# Accept either datetime-like (has strftime) or preformatted string
204+
if hasattr(last_mod, 'strftime'):
205+
response.headers['Last-Modified'] = last_mod.strftime('%a, %d %b %Y %H:%M:%S GMT')
206+
else:
207+
response.headers['Last-Modified'] = str(last_mod)
203208

204209
# Always enable range requests
205210
response.headers['Accept-Ranges'] = 'bytes'

web/libs/app-common/src/blocks/StorageProviderForm/hooks/useStorageApi.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -187,11 +187,11 @@ export const useStorageApi = ({ target, storage, project, onSubmit, onClose }: U
187187

188188
if (isDefined(storage?.id)) {
189189
body.id = storage.id;
190+
body.limit = 30;
190191
}
191192

192193
return api.callApi<{ files: any[] }>("storageFiles", {
193194
params: {
194-
limit: 10,
195195
target,
196196
type: previewData.provider,
197197
},
Lines changed: 1 addition & 0 deletions
Loading

web/libs/ui/src/assets/icons/index.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -261,3 +261,4 @@ export { ReactComponent as IconCloudProviderS3 } from "./cloud-provider-s3.svg";
261261
export { ReactComponent as IconCloudProviderRedis } from "./cloud-provider-redis.svg";
262262
export { ReactComponent as IconCloudProviderGCS } from "./cloud-provider-gcs.svg";
263263
export { ReactComponent as IconCloudProviderAzure } from "./cloud-provider-azure.svg";
264+
export { ReactComponent as IconCloudProviderDatabricks } from "./cloud-provider-databricks.svg";

0 commit comments

Comments
 (0)