You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -43,6 +45,7 @@ For more troubleshooting information, see [Troubleshooting Import, Export, & Sto
43
45
44
46
</div>
45
47
48
+
46
49
## How external storage connections and sync work
47
50
48
51
You can add source storage connections to sync data from an external source to a Label Studio project, and add target storage connections to sync annotations from Label Studio to external storage. Each source and target storage setup is project-specific. You can connect multiple buckets, containers, databases, or directories as source or target storage for a project.
@@ -1483,7 +1486,86 @@ You can also create a storage connection using the Label Studio API.
1483
1486
If you're using Label Studio in Docker, you need to mount the local directory that you want to access as a volume when you start the Docker container. See [Run Label Studio on Docker and use local storage](https://labelstud.io/guide/start#Run-Label-Studio-on-Docker-and-use-Local-Storage).
1484
1487
1485
1488
1486
-
### Troubleshooting cloud storage
1489
+
1490
+
1491
+
## Databricks Files (UC Volumes)
1492
+
1493
+
<div class="enterprise-only">
1494
+
1495
+
Connect Label Studio Enterprise to Databricks Unity Catalog (UC) Volumes to import files as tasks and export annotations as JSON back to your volumes. This connector uses the Databricks Files API and operates only in proxy mode (no presigned URLs are supported by Databricks).
1496
+
1497
+
### Prerequisites
1498
+
- A Databricks workspace URL (Workspace Host), for example `https://adb-12345678901234.1.databricks.com` (or Azure domain)
1499
+
- A Databricks Personal Access Token (PAT) with permission to access the Files API
1500
+
- A UC Volume path under `/Volumes/<catalog>/<schema>/<volume>` with files you want to label
- If listing returns zero files, verify the path under `/Volumes/<catalog>/<schema>/<volume>/<prefix?>` and your PAT permissions.
1542
+
- Ensure the Workspace Host has no trailing slash and matches your workspace domain.
1543
+
- If previews work but media fails to load, confirm proxy mode is allowed for your organization in Label Studio and network egress allows Label Studio to reach Databricks.
1544
+
1545
+
1546
+
!!! warning "Proxy and security"
1547
+
This connector streams data **through the Label Studio backend** with HTTP Range support. Databricks does not support presigned URLs, so this option is also not available in Label Studio.
1548
+
1549
+
1550
+
</div>
1551
+
1552
+
<div class="opensource-only">
1553
+
1554
+
### Use Databricks Files in Label Studio Enterprise
1555
+
1556
+
Databricks Unity Catalog (UC) Volumes integration is available in Label Studio Enterprise. It lets you:
1557
+
1558
+
- Import files directly from UC Volumes under `/Volumes/<catalog>/<schema>/<volume>`
1559
+
- Stream media securely via the platform proxy (no presigned URLs)
1560
+
- Export annotations back to your Databricks Volume as JSON
1561
+
1562
+
Learn more and see the full setup guide in the Enterprise documentation: [Databricks Files (UC Volumes)](https://docs.humansignal.com/guide/storage#Databricks-Files-UC-Volumes). If your organization needs governed access to Databricks data with Unity Catalog, consider [Label Studio Enterprise](https://humansignal.com/).
Copy file name to clipboardExpand all lines: label_studio/io_storages/README.md
+83-27Lines changed: 83 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,21 +1,21 @@
1
1
# Cloud Storages
2
2
3
-
There are 3 basic types of cloud storages:
3
+
Cloud storage is used for importing tasks and exporting annotations in Label Studio. There are 2 basic types of cloud storages:
4
4
5
5
1. Import Storages (aka Source Cloud Storages)
6
6
2. Export Storages (aka Target Cloud Storages)
7
-
3. Dataset Storages (available in enterprise)
8
7
9
8
Also Label Studio has Persistent storages where LS storage export files, user avatars and UI uploads. Do not confuse `Cloud Storages` and `Persistent Storage`, they have completely different codebase and tasks. Cloud Storages are implemented in `io_storages`, Persistent Storage uses django-storages and it is installed in Django settings environment variables (see `base.py`).
10
9
10
+
Note: Dataset Storages were implemented in the enterprise codebase only. They are **deprecated and not used**.
11
11
12
+
## Basic hierarchy
12
13
14
+
This section uses GCS storage as an example, and the same logic can be applied to other storages.
13
15
14
-
## Basic hierarchy
15
-
16
-
### Import and Dataset Storages
16
+
### Import Storages
17
17
18
-
This diagram is based on Google Cloud Storage (GCS) and other storages are implemented the same way.
18
+
This storage type is designed for importing tasks FROM cloud storage to Label Studio. This diagram is based on Google Cloud Storage (GCS), and other storages are implemented in the same way:
19
19
20
20
```mermaid
21
21
graph TD;
@@ -28,7 +28,7 @@ This diagram is based on Google Cloud Storage (GCS) and other storages are imple
28
28
GCSImportStorageBase-->GCSImportStorage;
29
29
GCSImportStorageBase-->GCSDatasetStorage;
30
30
31
-
DatasetStorageMixin-->GCSDatasetStorage;
31
+
GCSImportStorageLink-->ImportStorageLink
32
32
33
33
subgraph Google Cloud Storage
34
34
GCSImportStorage;
@@ -37,7 +37,52 @@ This diagram is based on Google Cloud Storage (GCS) and other storages are imple
37
37
end
38
38
```
39
39
40
+
-**Storage** (`label_studio/io_storages/base_models.py`): Abstract base for all storages. Inherits status/progress from `StorageInfo`. Defines `validate_connection()` contract and common metadata fields.
41
+
42
+
-**ImportStorage** (`label_studio/io_storages/base_models.py`): Abstract base for source storages. Defines core contracts used by sync and proxy:
43
+
-`iter_objects()`, `iter_keys()` to enumerate objects
44
+
-`get_unified_metadata(obj)` to normalize provider metadata
45
+
-`get_data(key)` to produce `StorageObject`(s) for task creation
46
+
-`generate_http_url(url)` to resolve provider URL -> HTTP URL (presigned or direct)
47
+
-`resolve_uri(...)` and `can_resolve_url(...)` used by the Storage Proxy
48
+
-`scan_and_create_links()` to create `ImportStorageLink`s for tasks
49
+
50
+
-**ImportStorageLink** (`label_studio/io_storages/base_models.py`): Link model created per-task for imported objects. Fields: `task` (1:1), `key` (external key), `row_group`/`row_index` (parquet/JSONL indices), `object_exists`, timestamps. Helpers: `n_tasks_linked(key, storage)` and `create(task, key, storage, row_index=None, row_group=None)`.
51
+
52
+
-**ProjectStorageMixin** (`label_studio/io_storages/base_models.py`): Adds `project` FK and permission checks. Used by project-scoped storages (e.g., `GCSImportStorage`).
53
+
54
+
-**GCSImportStorageBase** (`label_studio/io_storages/gcs/models.py`): GCS-specific import base. Sets `url_scheme='gs'`, implements listing (`iter_objects/iter_keys`), data loading (`get_data`), URL generation (`generate_http_url`), URL resolution checks, and metadata helpers. Reused by both project imports and enterprise datasets.
-**GCSImportStorageLink** (`label_studio/io_storages/gcs/models.py`): Provider-specific `ImportStorageLink` with `storage` FK to `GCSImportStorage`. Created during sync to associate a task with the original GCS object key.
59
+
60
+
### Export Storages
61
+
62
+
This storage type is designed for exporting tasks or annotations FROM Label Studio to cloud storage.
63
+
64
+
```mermaid
65
+
graph TD;
66
+
67
+
Storage-->ExportStorage;
68
+
69
+
ProjectStorageMixin-->ExportStorage;
70
+
ExportStorage-->GCSExportStorage;
71
+
GCSStorageMixin-->GCSExportStorage;
72
+
73
+
ExportStorageLink-->GCSExportStorageLink;
74
+
```
75
+
76
+
-**ExportStorage** (`label_studio/io_storages/base_models.py`): Abstract base for target storages. Project-scoped; orchestrates export jobs and progress. Key methods:
-`sync(save_only_new_annotations=False)` background export via RQ
80
+
81
+
-**GCSExportStorage** (`label_studio/io_storages/gcs/models.py`): Concrete target storage for GCS. Serializes data via `_get_serialized_data(...)`, computes key via `GCSExportStorageLink.get_key(...)`, uploads to GCS; can auto-export on annotation save when configured.
82
+
83
+
-**ExportStorageLink** (`label_studio/io_storages/base_models.py`): Base link model connecting exported objects to `Annotation`s. Provides `get_key(annotation)` logic (task-based or annotation-based via FF) and `create(...)` helper.
84
+
85
+
-**GCSExportStorageLink** (`label_studio/io_storages/gcs/models.py`): Provider-specific link model holding FK to `GCSExportStorage`.
41
86
42
87
## How validate_connection() works
43
88
@@ -50,32 +95,43 @@ Run this command with try/except:
50
95
Target storages use the same validate_connection() function, but without any prefix.
-**Import storages**: Require manual sync via API calls or UI
106
+
-**Export storages**: Manual sync via API/UI | Manual sync via API/UI and automatic via Django signals when annotations are submitted or updated
54
107
55
-
### Credentials
108
+
### 3. **Connection Validation Differences**
109
+
-**Import storages**: Must validate that prefix contains files during `validate_connection()`
110
+
-**Export storages**: Only validate bucket access, NOT prefix (prefixes are created automatically)
56
111
57
-
There are two methods for setting GCS credentials:
58
-
1. Through the Project => Cloud Storage settings in the Label Studio user interface.
59
-
2. Through Google Application Default Credentials (ADC). This involves the following steps:
112
+
### 4. **Data Serialization**
113
+
Export storages use `_get_serialized_data()` which returns different formats based on feature flags:
114
+
-**Default**: Only annotation data (backward compatibility)
115
+
-**With `fflag_feat_optic_650_target_storage_task_format_long` or `FUTURE_SAVE_TASK_TO_STORAGE`**: Full task + annotations data instead of annotation per file output.
60
116
61
-
2.1. Leave the Google Application Credentials field in the Label Studio UI blank.
62
-
63
-
2.2. Set an environment variable which will apply to all Cloud Storages. This can be done using the following command:
2.4. Another option is to use credentials provided by the Google App Engine or Google Compute Engine metadata server, if the code is running on either GAE or GCE.
117
+
### 5. **Built-in Threading**
118
+
- Export storages inherit `save_annotations()` with built-in parallel processing
0 commit comments