Skip to content

Commit be82e81

Browse files
committed
Improve the errors
1 parent 2495195 commit be82e81

39 files changed

+1682
-4983
lines changed

README.md

Lines changed: 40 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -20,22 +20,26 @@ Kaggle Datasets for DuckDB
2020

2121
Gaggle is a DuckDB extension that allows you to work with Kaggle datasets directly in SQL queries, as if
2222
they were DuckDB tables.
23-
It is written in Rust and uses the official Kaggle API to search, download, and manage datasets.
23+
It is written in Rust and uses the Kaggle API to search, download, and manage datasets.
2424

25-
Kaggle hosts a large collection of very useful datasets for data science and machine learning work.
26-
Accessing these datasets typically involves multiple steps including manually downloading a dataset (as a ZIP file),
25+
Kaggle hosts a large collection of very useful datasets for data science and machine learning.
26+
Accessing these datasets typically involves manually downloading a dataset (as a ZIP file),
2727
extracting it, loading the files in the dataset into your data science environment, and managing storage and dataset
2828
updates, etc.
29-
This workflow can be a at time.
30-
Gaggle tries to help simplify this process by integrating Kaggle dataset access directly into DuckDB.
29+
This workflow can be become complex, especially when working with multiple datasets or when datasets are updated
30+
frequently.
31+
Gaggle tries to help simplify this process by hiding the complexity and letting you work with datasets directly inside
32+
an analytical database like DuckDB that can handle fast queries.
33+
In essence, Gaggle makes DuckDB into a SQL-enabled frontend for Kaggle datasets.
3134

3235
### Features
3336

34-
- Has a simple API (just a handful of SQL functions)
35-
- Allows you search, download, update, and delete Kaggle datasets directly from DuckDB
36-
- Supports datasets made of CSV, JSON, and Parquet files
37+
- Has a simple API to interact with Kaggle datasets from DuckDB
38+
- Allows you to search, download, and read datasets from Kaggle
39+
- Supports datasets that contain CSV, Parquet, JSON, and XLSX files (XLSX requires DuckDB's Excel reader to be available in your DuckDB build)
3740
- Configurable and has built-in caching support
38-
- Thread-safe and memory-efficient
41+
- Thread-safe, fast, and has a low memory footprint
42+
- Supports dataset versioning and update checks
3943

4044
See the [ROADMAP.md](ROADMAP.md) for planned features and the [docs](docs) folder for detailed documentation.
4145

@@ -88,105 +92,54 @@ make release
8892
#### Trying Gaggle
8993

9094
```sql
91-
-- Load the Gaggle extension
92-
load 'build/release/extension/gaggle/gaggle.duckdb_extension';
95+
-- Load the Gaggle extension (only needed if you built from source)
96+
--load 'build/release/extension/gaggle/gaggle.duckdb_extension';
9397

94-
-- Set your Kaggle credentials (or use `~/.kaggle/kaggle.json`)
98+
-- Manually, set your Kaggle credentials (or use `~/.kaggle/kaggle.json`)
9599
select gaggle_set_credentials('your-username', 'your-api-key');
96100

97101
-- Get extension version
98102
select gaggle_version();
99103

100-
-- Prime cache by downloading the dataset locally (optional, but improves first-time performance)
101-
select gaggle_download('habedi/flickr-8k-dataset-clean');
102-
103104
-- List files in the downloaded dataset
105+
-- (Note that if the datasets is not downloaded yet, it will be downloaded and cached first)
104106
select *
105107
from gaggle_ls('habedi/flickr-8k-dataset-clean') limit 5;
106108

107-
-- Read a Parquet file from local cache using a prepared statement (no subquery in function args)
109+
-- Read a Parquet file from local cache using a prepared statement
110+
-- (Note that DuckDB doesn't support subquery in function arguments, so we use a prepared statement)
108111
prepare rp as select * from read_parquet(?) limit 10;
109-
execute rp(gaggle_file_paths('habedi/flickr-8k-dataset-clean', 'flickr8k.parquet'));
112+
execute rp(gaggle_file_path('habedi/flickr-8k-dataset-clean', 'flickr8k.parquet'));
110113

111-
-- Use the replacement scan to read directly via kaggle: URL
114+
-- Alternatively, we can use a replacement scan to read directly via `kaggle:` prefix
112115
select count(*)
113116
from 'kaggle:habedi/flickr-8k-dataset-clean/flickr8k.parquet';
114117

115118
-- Or glob Parquet files in a dataset directory
116119
select count(*)
117120
from 'kaggle:habedi/flickr-8k-dataset-clean/*.parquet';
118121

119-
-- Optionally, check cache info
122+
-- Optionally, we check cache info
120123
select gaggle_cache_info();
121124

122-
-- Enforce cache size limit manually (automatic with soft limit by default)
125+
-- Clear cache and enforce cache size limit manually
126+
select gaggle_clear_cache();
123127
select gaggle_enforce_cache_limit();
124128

125-
-- Check if cached dataset is current
129+
-- Check if cached dataset is current (is newest version?)
126130
select gaggle_is_current('habedi/flickr-8k-dataset-clean');
127131

128132
-- Force update to latest version if needed
129-
-- select gaggle_update_dataset('habedi/flickr-8k-dataset-clean');
133+
--select gaggle_update_dataset('habedi/flickr-8k-dataset-clean');
130134

131135
-- Download specific version (version pinning for reproducibility)
132-
-- select gaggle_download('habedi/flickr-8k-dataset-clean@v2');
136+
--select gaggle_download('habedi/flickr-8k-dataset-clean@v2');
133137
```
134138

135139
[![Simple Demo 1](https://asciinema.org/a/745806.svg)](https://asciinema.org/a/745806)
136140

137141
---
138142

139-
### Configuration
140-
141-
Gaggle can be configured using environment variables:
142-
143-
#### Cache Management
144-
145-
```bash
146-
# Set cache size limit (default: 100GB = 102400 MB)
147-
export GAGGLE_CACHE_SIZE_LIMIT_MB=51200 # 50GB
148-
149-
# Set unlimited cache
150-
export GAGGLE_CACHE_SIZE_LIMIT_MB=unlimited
151-
152-
# Set cache directory (default: ~/.cache/gaggle_cache or platform-specific)
153-
export GAGGLE_CACHE_DIR=/path/to/cache
154-
155-
# Enable hard limit mode (prevents downloads when limit reached, default: soft limit)
156-
export GAGGLE_CACHE_HARD_LIMIT=true
157-
```
158-
159-
#### Network Configuration
160-
161-
```bash
162-
# HTTP timeout in seconds (default: 30)
163-
export GAGGLE_HTTP_TIMEOUT=60
164-
165-
# HTTP retry attempts (default: 3)
166-
export GAGGLE_HTTP_RETRY_ATTEMPTS=5
167-
168-
# HTTP retry delay in milliseconds (default: 1000)
169-
export GAGGLE_HTTP_RETRY_DELAY_MS=500
170-
171-
# HTTP retry max delay in milliseconds (default: 30000)
172-
export GAGGLE_HTTP_RETRY_MAX_DELAY_MS=60000
173-
```
174-
175-
#### Authentication
176-
177-
```bash
178-
# Kaggle API credentials (alternative to ~/.kaggle/kaggle.json)
179-
export KAGGLE_USERNAME=your-username
180-
export KAGGLE_KEY=your-api-key
181-
```
182-
183-
> [!TIP]
184-
> **Soft Limit (Default):** Downloads complete even if they exceed the cache limit, then oldest datasets are automatically evicted using LRU (Least Recently Used) policy until under limit.
185-
>
186-
> **Hard Limit:** Would prevent downloads when limit is reached (not yet fully implemented).
187-
188-
---
189-
190143
### Documentation
191144

192145
Check out the [docs](docs/README.md) directory for the API documentation, how to build Gaggle from source, and more.
@@ -197,6 +150,19 @@ Check out the [examples](docs/examples) directory for SQL scripts that show how
197150

198151
---
199152

153+
### Configuration
154+
155+
See [CONFIGURATION.md](docs/CONFIGURATION.md) for full details. Main environment variables:
156+
157+
- `GAGGLE_CACHE_DIR` — cache directory path (default: `~/.cache/gaggle`)
158+
- `GAGGLE_HTTP_TIMEOUT` — HTTP timeout (in seconds)
159+
- `GAGGLE_HTTP_RETRY_ATTEMPTS` — retry attempts after the initial try
160+
- `GAGGLE_HTTP_RETRY_DELAY_MS` — initial backoff delay (in milliseconds)
161+
- `GAGGLE_HTTP_RETRY_MAX_DELAY_MS` — maximum backoff delay cap (in milliseconds)
162+
- `GAGGLE_LOG_LEVEL` — structured log level for the Rust core (like `INFO` or `DEBUG`)
163+
- `GAGGLE_OFFLINE` — disable network; only use cached data (downloads fail fast if not cached)
164+
- `KAGGLE_USERNAME`, `KAGGLE_KEY` — Kaggle credentials (alternative to the SQL call)
165+
200166
### Contributing
201167

202168
See [CONTRIBUTING.md](CONTRIBUTING.md) for details on how to make a contribution.

ROADMAP.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ It outlines features to be implemented and their current status.
3939
* [x] CSV and TSV file reading.
4040
* [x] Parquet file reading.
4141
* [x] JSON file reading.
42-
* [ ] Excel and XLSX file reading.
42+
* [ ] Excel (XLSX) file reading. (Available when DuckDB is built with the Excel reader; replacement scan routes `.xlsx` to `read_excel`.)
4343
* **Querying Datasets**
4444
* [x] Replacement scan for `kaggle:` URLs.
4545
* [ ] Virtual table support for lazy loading.
@@ -49,7 +49,7 @@ It outlines features to be implemented and their current status.
4949
* **Concurrency Control**
5050
* [x] Thread-safe credential storage.
5151
* [x] Thread-safe cache access.
52-
* [x] Concurrent dataset downloads (with per-dataset serialization to prevent race conditions).
52+
* [x] Concurrent dataset downloads.
5353
* **Network Optimization**
5454
* [x] Configurable HTTP timeouts.
5555
* [x] Retry logic with backoff for failed requests.
@@ -63,11 +63,11 @@ It outlines features to be implemented and their current status.
6363
* [x] Clear error messages for invalid credentials.
6464
* [x] Clear error messages for missing datasets.
6565
* [x] Clear error messages for `NULL` inputs.
66-
* [ ] Detailed error codes for programmatic error handling.
66+
* [x] Detailed error codes for programmatic error handling.
6767
* **Resilience**
6868
* [x] Automatic retry on network failures.
6969
* [ ] Graceful degradation when Kaggle API is unavailable.
70-
* [ ] Local-only mode for cached datasets.
70+
* [x] Local-only mode for cached datasets (via `GAGGLE_OFFLINE`).
7171

7272
### 6. Documentation and Distribution
7373

@@ -80,8 +80,13 @@ It outlines features to be implemented and their current status.
8080
* **Testing**
8181
* [x] Unit tests for core modules (Rust).
8282
* [x] SQL integration tests (DuckDB shell).
83-
* [ ] End-to-end integration tests with mocked HTTP.
83+
* [x] End-to-end integration tests with mocked HTTP (basic coverage).
8484
* [ ] Performance benchmarks.
8585
* **Distribution**
8686
* [ ] Pre-compiled extension binaries for Linux, macOS, and Windows.
8787
* [ ] Submission to the DuckDB Community Extensions repository.
88+
89+
### 7. Observability
90+
91+
* **Logging**
92+
* [x] Structured logging via `tracing` with `GAGGLE_LOG_LEVEL`.

0 commit comments

Comments
 (0)