Reorganize raw2json and QA code location #2006

rjpower · 2025-11-14T19:03:30Z

The raw2json code was a bit of an oddity, living separately from the other transformation code.

This moves it to transform, uses zephyr for the data processing and adds a simple test.

Copilot

Pull Request Overview

This PR reorganizes the raw2json code by moving it from its separate location into the transform directory, refactors it to use the zephyr library for data processing, and adds end-to-end tests for the transformation pipeline.

Moves raw2json code to marin.transform.huggingface.dataset_to_eval
Refactors implementation to use zephyr's Dataset API instead of fsspec for file I/O
Adds comprehensive tests for both evaluation and decontamination output formats

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
lib/marin/src/marin/transform/huggingface/dataset_to_eval.py	Refactored to use zephyr for data processing; extracted transformation logic into separate functions; renamed main function to `hf_dataset_to_jsonl` with `raw2json` as backward-compatible alias
tests/transform/test_huggingface_dataset_to_eval.py	Added new end-to-end tests for evaluation and decontamination format transformations with mock datasets
run_test.py	Added simple test runner script for running the new tests without pytest
experiments/train_test_overlap/eval_datasets_overlap.py	Updated import statement to reference new module location
experiments/exp412_download_and_raw2json_hf_qa.py	Updated import statement to reference new module location
experiments/exp1342_gemstones_scaling_law.py	Updated import statement to reference new module location
experiments/eval_datasets.py	Updated import statement to reference new module location
data_browser/src/ExperimentPage.js	Added step renderers for both old (`raw2json`) and new (`hf_dataset_to_jsonl`) function names
.pyrefly-baseline.json	Updated file path in baseline configuration

Comments suppressed due to low confidence (4)

lib/marin/src/marin/transform/huggingface/dataset_to_eval.py:463

The comment says "infer answer_label (e.g. A) from answer_label" but should say "infer answer_label (e.g. A) from answer_idx" since it's using answer_idx to index into answer_labels.
lib/marin/src/marin/transform/huggingface/dataset_to_eval.py:482
Bug: This condition checks if answer_idx: which will evaluate to False when answer_idx is 0 (a valid index). This means when the correct answer is the first option (index 0), the metadata won't be set. The condition should be if answer_idx is not None: to correctly handle the case where answer_idx is 0.
lib/marin/src/marin/transform/huggingface/dataset_to_eval.py:35
[nitpick] The example usage shows running with uv run zephyr --backend=sync but the script can also be run directly with uv run python. Consider clarifying when to use zephyr CLI vs direct execution, or update the example to show the standard way to run this script (e.g., python -m marin.transform.huggingface.dataset_to_eval).
lib/marin/src/marin/transform/huggingface/dataset_to_eval.py:101
Spelling error: "decontaminaton" should be "decontamination" (missing 'i').

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

run_test.py

Copilot · 2025-11-14T19:09:34Z

tests/transform/test_huggingface_dataset_to_eval.py

+    )
+
+    # Mock load_datasets to return our mock dataset
+    import marin.transform.huggingface.dataset_to_eval as module


Module 'marin.transform.huggingface.dataset_to_eval' is imported with both 'import' and 'import from'.

tests/transform/test_huggingface_dataset_to_eval.py

Move marin/raw2json/huggingface/qa/raw2json.py to marin/transform/huggingface/dataset_to_eval.py to: 1. Follow established codebase pattern of organizing transforms by data source (wikipedia, stackexchange, lingoly, etc.) 2. Remove misleading "raw2json" naming - this code converts HuggingFace datasets to evaluation/decontamination formats, not "raw" to JSON 3. Improve discoverability by placing HuggingFace transforms near other transform modules Updated all import statements across: - experiments/eval_datasets.py - experiments/exp412_download_and_raw2json_hf_qa.py - experiments/exp1342_gemstones_scaling_law.py - experiments/train_test_overlap/eval_datasets_overlap.py - data_browser/src/ExperimentPage.js - .pyrefly-baseline.json No functional changes to the code itself.

Changes: 1. Renamed main function from raw2json to hf_dataset_to_jsonl for clarity 2. Refactored to use zephyr's Dataset API for transformation pipeline: - Added transform_example_to_qa() function following zephyr pattern - Uses Dataset.from_iter() -> .map() -> .filter() -> .write_jsonl() - Leverages zephyr's built-in gzip compression support 3. Kept raw2json as backward compatibility alias 4. Updated ExperimentPage.js to support both old and new function names 5. Added comprehensive module docstring with usage examples Benefits: - Consistent with other transform modules (lavita_to_dolma, lingoly/to_dolma) - Better separation of concerns (transform logic in dedicated function) - More maintainable and testable code structure - Leverages zephyr's efficient parallel processing capabilities

Added test file: tests/transform/test_huggingface_dataset_to_eval.py Test coverage: - Helper functions: * get_nested_item: nested dictionary key extraction * is_kv_list: key-value list detection * standardize_options: options standardization (dict/list/kv formats) * format_prompt_response: prompt/response formatting - Transform function: * transform_example_to_qa for evaluation format * transform_example_to_qa for decontamination format * Handling answer labels vs indices * Nested key extraction - End-to-end tests: * Complete pipeline with mock HuggingFace dataset * Both evaluation and decontamination output formats * Gzipped JSONL output verification Test results (using uv run python test_runner.py): ✓ 10 tests passed ✓ All core transformation logic verified ✓ Both output formats tested

Removed helper function tests, keeping only: - test_hf_dataset_to_jsonl_evaluation_format - test_hf_dataset_to_jsonl_decontamination_format Both tests are now top-level functions that verify the complete transformation pipeline.

… transform - Split transform logic: transform_example_to_qa (core) and wrap_transform (error handling) - Use dict format for enumerate output: {"idx": ..., "example": ...} for clarity - Add {shard:05d} placeholder to output pattern for multi-shard support - Update tests to handle sharded output files using glob patterns - All tests passing (2/2)

Tests now run successfully with uv run pytest.

- Remove backward compatibility alias raw2json - Update data_browser to only use hf_dataset_to_jsonl - Remove implementation details from docstrings - Refactor tests to use proper mocking with context managers - Fix linting issues: unused imports, loop variable binding - Format code with black

Update all experiments importing raw2json to use the new function name hf_dataset_to_jsonl. The function was renamed during the recent refactor but some experiment files were not updated. Fixes import errors in: - exp412_download_and_raw2json_hf_qa.py - exp1342_gemstones_scaling_law.py - eval_datasets.py - train_test_overlap/eval_datasets_overlap.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

rjpower requested review from Helw150 and Copilot November 14, 2025 19:03

Copilot started reviewing on behalf of rjpower November 14, 2025 19:03 View session

Copilot finished reviewing on behalf of rjpower November 14, 2025 19:08

Copilot AI reviewed Nov 14, 2025

View reviewed changes

rjpower requested a review from dlwh November 15, 2025 00:34

dlwh approved these changes Nov 20, 2025

View reviewed changes

claude and others added 8 commits November 20, 2025 10:56

Remove temporary test runner file

e6b9e1f

Tests now run successfully with uv run pytest.

rjpower force-pushed the claude/refactor-raw2json-huggingface-qa-01JBFxavUzHAHMQXFiRHx3jK branch from ba22f34 to 4414b11 Compare November 20, 2025 18:56

rjpower merged commit 841aa35 into main Nov 20, 2025
9 checks passed

rjpower deleted the claude/refactor-raw2json-huggingface-qa-01JBFxavUzHAHMQXFiRHx3jK branch November 20, 2025 19:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reorganize raw2json and QA code location #2006

Reorganize raw2json and QA code location #2006

Uh oh!

rjpower commented Nov 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Nov 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Reorganize raw2json and QA code location #2006

Reorganize raw2json and QA code location #2006

Uh oh!

Conversation

rjpower commented Nov 14, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants