-
Notifications
You must be signed in to change notification settings - Fork 64
Reorganize raw2json and QA code location #2006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reorganize raw2json and QA code location #2006
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR reorganizes the raw2json code by moving it from its separate location into the transform directory, refactors it to use the zephyr library for data processing, and adds end-to-end tests for the transformation pipeline.
- Moves
raw2jsoncode tomarin.transform.huggingface.dataset_to_eval - Refactors implementation to use zephyr's Dataset API instead of fsspec for file I/O
- Adds comprehensive tests for both evaluation and decontamination output formats
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| lib/marin/src/marin/transform/huggingface/dataset_to_eval.py | Refactored to use zephyr for data processing; extracted transformation logic into separate functions; renamed main function to hf_dataset_to_jsonl with raw2json as backward-compatible alias |
| tests/transform/test_huggingface_dataset_to_eval.py | Added new end-to-end tests for evaluation and decontamination format transformations with mock datasets |
| run_test.py | Added simple test runner script for running the new tests without pytest |
| experiments/train_test_overlap/eval_datasets_overlap.py | Updated import statement to reference new module location |
| experiments/exp412_download_and_raw2json_hf_qa.py | Updated import statement to reference new module location |
| experiments/exp1342_gemstones_scaling_law.py | Updated import statement to reference new module location |
| experiments/eval_datasets.py | Updated import statement to reference new module location |
| data_browser/src/ExperimentPage.js | Added step renderers for both old (raw2json) and new (hf_dataset_to_jsonl) function names |
| .pyrefly-baseline.json | Updated file path in baseline configuration |
Comments suppressed due to low confidence (4)
lib/marin/src/marin/transform/huggingface/dataset_to_eval.py:463
- The comment says "infer answer_label (e.g. A) from answer_label" but should say "infer answer_label (e.g. A) from answer_idx" since it's using
answer_idxto index intoanswer_labels.
lib/marin/src/marin/transform/huggingface/dataset_to_eval.py:482 - Bug: This condition checks
if answer_idx:which will evaluate to False whenanswer_idxis 0 (a valid index). This means when the correct answer is the first option (index 0), the metadata won't be set. The condition should beif answer_idx is not None:to correctly handle the case where answer_idx is 0.
lib/marin/src/marin/transform/huggingface/dataset_to_eval.py:35 - [nitpick] The example usage shows running with
uv run zephyr --backend=syncbut the script can also be run directly withuv run python. Consider clarifying when to use zephyr CLI vs direct execution, or update the example to show the standard way to run this script (e.g.,python -m marin.transform.huggingface.dataset_to_eval).
lib/marin/src/marin/transform/huggingface/dataset_to_eval.py:101 - Spelling error: "decontaminaton" should be "decontamination" (missing 'i').
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
|
|
||
| # Mock load_datasets to return our mock dataset | ||
| import marin.transform.huggingface.dataset_to_eval as module |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Module 'marin.transform.huggingface.dataset_to_eval' is imported with both 'import' and 'import from'.
Move marin/raw2json/huggingface/qa/raw2json.py to marin/transform/huggingface/dataset_to_eval.py to: 1. Follow established codebase pattern of organizing transforms by data source (wikipedia, stackexchange, lingoly, etc.) 2. Remove misleading "raw2json" naming - this code converts HuggingFace datasets to evaluation/decontamination formats, not "raw" to JSON 3. Improve discoverability by placing HuggingFace transforms near other transform modules Updated all import statements across: - experiments/eval_datasets.py - experiments/exp412_download_and_raw2json_hf_qa.py - experiments/exp1342_gemstones_scaling_law.py - experiments/train_test_overlap/eval_datasets_overlap.py - data_browser/src/ExperimentPage.js - .pyrefly-baseline.json No functional changes to the code itself.
Changes: 1. Renamed main function from raw2json to hf_dataset_to_jsonl for clarity 2. Refactored to use zephyr's Dataset API for transformation pipeline: - Added transform_example_to_qa() function following zephyr pattern - Uses Dataset.from_iter() -> .map() -> .filter() -> .write_jsonl() - Leverages zephyr's built-in gzip compression support 3. Kept raw2json as backward compatibility alias 4. Updated ExperimentPage.js to support both old and new function names 5. Added comprehensive module docstring with usage examples Benefits: - Consistent with other transform modules (lavita_to_dolma, lingoly/to_dolma) - Better separation of concerns (transform logic in dedicated function) - More maintainable and testable code structure - Leverages zephyr's efficient parallel processing capabilities
Added test file: tests/transform/test_huggingface_dataset_to_eval.py Test coverage: - Helper functions: * get_nested_item: nested dictionary key extraction * is_kv_list: key-value list detection * standardize_options: options standardization (dict/list/kv formats) * format_prompt_response: prompt/response formatting - Transform function: * transform_example_to_qa for evaluation format * transform_example_to_qa for decontamination format * Handling answer labels vs indices * Nested key extraction - End-to-end tests: * Complete pipeline with mock HuggingFace dataset * Both evaluation and decontamination output formats * Gzipped JSONL output verification Test results (using uv run python test_runner.py): ✓ 10 tests passed ✓ All core transformation logic verified ✓ Both output formats tested
Removed helper function tests, keeping only: - test_hf_dataset_to_jsonl_evaluation_format - test_hf_dataset_to_jsonl_decontamination_format Both tests are now top-level functions that verify the complete transformation pipeline.
… transform
- Split transform logic: transform_example_to_qa (core) and wrap_transform (error handling)
- Use dict format for enumerate output: {"idx": ..., "example": ...} for clarity
- Add {shard:05d} placeholder to output pattern for multi-shard support
- Update tests to handle sharded output files using glob patterns
- All tests passing (2/2)
Tests now run successfully with uv run pytest.
- Remove backward compatibility alias raw2json - Update data_browser to only use hf_dataset_to_jsonl - Remove implementation details from docstrings - Refactor tests to use proper mocking with context managers - Fix linting issues: unused imports, loop variable binding - Format code with black
Update all experiments importing raw2json to use the new function name hf_dataset_to_jsonl. The function was renamed during the recent refactor but some experiment files were not updated. Fixes import errors in: - exp412_download_and_raw2json_hf_qa.py - exp1342_gemstones_scaling_law.py - eval_datasets.py - train_test_overlap/eval_datasets_overlap.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
ba22f34 to
4414b11
Compare
The raw2json code was a bit of an oddity, living separately from the other transformation code.
This moves it to transform, uses zephyr for the data processing and adds a simple test.