Skip to content

Commit f0b7e2b

Browse files
committed
Simplify the project
1 parent 694987b commit f0b7e2b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+3242
-4695
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,3 +95,4 @@ _rust.h
9595
uv.lock
9696
tests/temp_models/
9797
*.cast
98+
*.proptest-regressions

Makefile

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ rust-format: ## Format Rust files
5151
.PHONY: rust-test
5252
rust-test: rust-format ## Run tests
5353
@echo "Running the unit tests for Gaggle..."
54-
@cargo test --manifest-path gaggle/Cargo.toml --all-targets --features "expose_internal" -- --nocapture
54+
@cargo test --manifest-path gaggle/Cargo.toml --all-targets -- --nocapture
5555

5656
.PHONY: rust-coverage
5757
rust-coverage: ## Generate code coverage report for Gaggle crate
@@ -151,8 +151,3 @@ examples: ## Run SQL examples for Gaggle extension
151151
./build/release/duckdb < $$sql_file; \
152152
echo "============================================================================"; \
153153
done
154-
155-
.PHONY: itest
156-
itest: release ## Run Python integration test against built DuckDB shell
157-
@echo "Running integration tests..."
158-
@uv run -q python3 test/integration/test_duckdb_integration.py

README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
[![Tests](https://img.shields.io/github/actions/workflow/status/CogitatorTech/gaggle/tests.yml?label=tests&style=flat&labelColor=282c34&logo=github)](https://github.com/CogitatorTech/gaggle/actions/workflows/tests.yml)
1010
[![Code Quality](https://img.shields.io/codefactor/grade/github/CogitatorTech/gaggle?label=quality&style=flat&labelColor=282c34&logo=codefactor)](https://www.codefactor.io/repository/github/CogitatorTech/gaggle)
1111
[![Examples](https://img.shields.io/badge/examples-view-green?style=flat&labelColor=282c34&logo=github)](https://github.com/CogitatorTech/gaggle/tree/main/docs/examples)
12-
[![Docs](https://img.shields.io/badge/docs-view-blue?style=flat&labelColor=282c34&logo=read-the-docs)](https://github.com/CogitatorTech/gaggle/tree/main/docs)
12+
[![Docs](https://img.shields.io/badge/docs-read-blue?style=flat&labelColor=282c34&logo=read-the-docs)](https://github.com/CogitatorTech/gaggle/tree/main/docs)
1313
[![License](https://img.shields.io/badge/license-MIT%2FApache--2.0-007ec6?style=flat&labelColor=282c34&logo=open-source-initiative)](https://github.com/CogitatorTech/gaggle)
1414

1515
Kaggle Datasets for DuckDB
@@ -18,19 +18,20 @@ Kaggle Datasets for DuckDB
1818

1919
---
2020

21-
Gaggle is a DuckDB extension that allows you to read and write Kaggle datasets directly in SQL queries, as if
21+
Gaggle is a DuckDB extension that allows you to work with Kaggle datasets directly in SQL queries, as if
2222
they were DuckDB tables.
2323
It is written in Rust and uses the official Kaggle API to search, download, and manage datasets.
2424

2525
Kaggle hosts a large collection of very useful datasets for data science and machine learning work.
2626
Accessing these datasets typically involves multiple steps including manually downloading a dataset (as a ZIP file),
2727
extracting it, loading the files in the dataset into your data science environment, and managing storage and dataset
2828
updates, etc.
29-
This workflow can be simplified and optimized by bringing the datasets directly into DuckDB.
29+
This workflow can be a at time.
30+
Gaggle tries to help simplify this process by integrating Kaggle dataset access directly into DuckDB.
3031

3132
### Features
3233

33-
- Has a simple API (just a few SQL functions)
34+
- Has a simple API (just a handful of SQL functions)
3435
- Allows you search, download, update, and delete Kaggle datasets directly from DuckDB
3536
- Supports datasets made of CSV, JSON, and Parquet files
3637
- Configurable and has built-in caching support

ROADMAP.md

Lines changed: 9 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,14 @@ It outlines features to be implemented and their current status.
1010

1111
* **Authentication**
1212
* [x] Set Kaggle API credentials programmatically.
13-
* [x] Support environment variables (using `KAGGLE_USERNAME` and `KAGGLE_KEY`).
14-
* [x] Support `~/.kaggle/kaggle.json file`.
13+
* [x] Support environment variables for authentication (`KAGGLE_USERNAME` and `KAGGLE_KEY`).
14+
* [x] Support reading credentials from `~/.kaggle/kaggle.json file`.
1515
* **Dataset Operations**
16-
* [x] Search for datasets.
16+
* [x] Search for datasets on Kaggle.
1717
* [x] Download datasets from Kaggle.
1818
* [x] List files in a dataset.
1919
* [x] Get dataset metadata.
20-
* [ ] Upload datasets to Kaggle.
21-
* [ ] Delete datasets from Kaggle.
20+
* [ ] Upload DuckDB tables to Kaggle.
2221

2322
### 2. Caching and Storage
2423

@@ -28,7 +27,6 @@ It outlines features to be implemented and their current status.
2827
* [x] Get cache information (size and storage location).
2928
* [ ] Set cache size limit.
3029
* [ ] Cache expiration policies.
31-
* [ ] Support for partial file downloads and resumes.
3230
* **Storage**
3331
* [x] Store datasets in configurable directory.
3432
* [ ] Support for cloud storage backends (S3, GCS, and Azure).
@@ -37,24 +35,22 @@ It outlines features to be implemented and their current status.
3735

3836
* **File Format Support**
3937
* [x] CSV and TSV file reading.
40-
* [x] JSON file reading.
4138
* [x] Parquet file reading.
39+
* [x] JSON file reading.
4240
* [ ] Excel and XLSX file reading.
43-
* **Direct Query Integration**
41+
* **Querying Datasets**
4442
* [x] Replacement scan for `kaggle:` URLs.
45-
* [ ] Direct SQL queries on remote datasets without full download (true streaming).
46-
* [ ] Streaming data from Kaggle without caching.
4743
* [ ] Virtual table support for lazy loading.
4844

4945
### 4. Performance and Concurrency
5046

5147
* **Concurrency Control**
5248
* [x] Thread-safe credential storage.
5349
* [x] Thread-safe cache access.
54-
* [ ] Concurrent dataset downloads.
50+
* [x] Concurrent dataset downloads (with per-dataset serialization to prevent race conditions).
5551
* **Network Optimization**
5652
* [x] Configurable HTTP timeouts.
57-
* [ ] Retry logic with backoff (configurable attempts/delay; planned).
53+
* [x] Retry logic with backoff for failed requests.
5854
* **Caching Strategy**
5955
* [ ] Incremental cache updates.
6056
* [ ] Background cache synchronization.
@@ -67,7 +63,7 @@ It outlines features to be implemented and their current status.
6763
* [x] Clear error messages for `NULL` inputs.
6864
* [ ] Detailed error codes for programmatic error handling.
6965
* **Resilience**
70-
* [ ] Automatic retry on network failures (planned with backoff settings).
66+
* [x] Automatic retry on network failures.
7167
* [ ] Graceful degradation when Kaggle API is unavailable.
7268
* [ ] Local-only mode for cached datasets.
7369

@@ -87,4 +83,3 @@ It outlines features to be implemented and their current status.
8783
* **Distribution**
8884
* [ ] Pre-compiled extension binaries for Linux, macOS, and Windows.
8985
* [ ] Submission to the DuckDB Community Extensions repository.
90-
* [ ] Docker image with Gaggle pre-installed.

0 commit comments

Comments
 (0)