Skip to content

Conversation

@rjpower
Copy link
Collaborator

@rjpower rjpower commented Nov 14, 2025

The raw2json code was a bit of an oddity, living separately from the other transformation code.

This moves it to transform, uses zephyr for the data processing and adds a simple test.

@rjpower rjpower requested review from Helw150 and Copilot November 14, 2025 19:03
Copilot finished reviewing on behalf of rjpower November 14, 2025 19:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR reorganizes the raw2json code by moving it from its separate location into the transform directory, refactors it to use the zephyr library for data processing, and adds end-to-end tests for the transformation pipeline.

  • Moves raw2json code to marin.transform.huggingface.dataset_to_eval
  • Refactors implementation to use zephyr's Dataset API instead of fsspec for file I/O
  • Adds comprehensive tests for both evaluation and decontamination output formats

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
lib/marin/src/marin/transform/huggingface/dataset_to_eval.py Refactored to use zephyr for data processing; extracted transformation logic into separate functions; renamed main function to hf_dataset_to_jsonl with raw2json as backward-compatible alias
tests/transform/test_huggingface_dataset_to_eval.py Added new end-to-end tests for evaluation and decontamination format transformations with mock datasets
run_test.py Added simple test runner script for running the new tests without pytest
experiments/train_test_overlap/eval_datasets_overlap.py Updated import statement to reference new module location
experiments/exp412_download_and_raw2json_hf_qa.py Updated import statement to reference new module location
experiments/exp1342_gemstones_scaling_law.py Updated import statement to reference new module location
experiments/eval_datasets.py Updated import statement to reference new module location
data_browser/src/ExperimentPage.js Added step renderers for both old (raw2json) and new (hf_dataset_to_jsonl) function names
.pyrefly-baseline.json Updated file path in baseline configuration
Comments suppressed due to low confidence (4)

lib/marin/src/marin/transform/huggingface/dataset_to_eval.py:463

  • The comment says "infer answer_label (e.g. A) from answer_label" but should say "infer answer_label (e.g. A) from answer_idx" since it's using answer_idx to index into answer_labels.
    lib/marin/src/marin/transform/huggingface/dataset_to_eval.py:482
  • Bug: This condition checks if answer_idx: which will evaluate to False when answer_idx is 0 (a valid index). This means when the correct answer is the first option (index 0), the metadata won't be set. The condition should be if answer_idx is not None: to correctly handle the case where answer_idx is 0.
    lib/marin/src/marin/transform/huggingface/dataset_to_eval.py:35
  • [nitpick] The example usage shows running with uv run zephyr --backend=sync but the script can also be run directly with uv run python. Consider clarifying when to use zephyr CLI vs direct execution, or update the example to show the standard way to run this script (e.g., python -m marin.transform.huggingface.dataset_to_eval).
    lib/marin/src/marin/transform/huggingface/dataset_to_eval.py:101
  • Spelling error: "decontaminaton" should be "decontamination" (missing 'i').

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

)

# Mock load_datasets to return our mock dataset
import marin.transform.huggingface.dataset_to_eval as module
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module 'marin.transform.huggingface.dataset_to_eval' is imported with both 'import' and 'import from'.

Copilot uses AI. Check for mistakes.
@rjpower rjpower requested a review from dlwh November 15, 2025 00:34
claude and others added 8 commits November 20, 2025 10:56
Move marin/raw2json/huggingface/qa/raw2json.py to
marin/transform/huggingface/dataset_to_eval.py to:

1. Follow established codebase pattern of organizing transforms by data source
   (wikipedia, stackexchange, lingoly, etc.)
2. Remove misleading "raw2json" naming - this code converts HuggingFace
   datasets to evaluation/decontamination formats, not "raw" to JSON
3. Improve discoverability by placing HuggingFace transforms near other
   transform modules

Updated all import statements across:
- experiments/eval_datasets.py
- experiments/exp412_download_and_raw2json_hf_qa.py
- experiments/exp1342_gemstones_scaling_law.py
- experiments/train_test_overlap/eval_datasets_overlap.py
- data_browser/src/ExperimentPage.js
- .pyrefly-baseline.json

No functional changes to the code itself.
Changes:
1. Renamed main function from raw2json to hf_dataset_to_jsonl for clarity
2. Refactored to use zephyr's Dataset API for transformation pipeline:
   - Added transform_example_to_qa() function following zephyr pattern
   - Uses Dataset.from_iter() -> .map() -> .filter() -> .write_jsonl()
   - Leverages zephyr's built-in gzip compression support
3. Kept raw2json as backward compatibility alias
4. Updated ExperimentPage.js to support both old and new function names
5. Added comprehensive module docstring with usage examples

Benefits:
- Consistent with other transform modules (lavita_to_dolma, lingoly/to_dolma)
- Better separation of concerns (transform logic in dedicated function)
- More maintainable and testable code structure
- Leverages zephyr's efficient parallel processing capabilities
Added test file: tests/transform/test_huggingface_dataset_to_eval.py

Test coverage:
- Helper functions:
  * get_nested_item: nested dictionary key extraction
  * is_kv_list: key-value list detection
  * standardize_options: options standardization (dict/list/kv formats)
  * format_prompt_response: prompt/response formatting

- Transform function:
  * transform_example_to_qa for evaluation format
  * transform_example_to_qa for decontamination format
  * Handling answer labels vs indices
  * Nested key extraction

- End-to-end tests:
  * Complete pipeline with mock HuggingFace dataset
  * Both evaluation and decontamination output formats
  * Gzipped JSONL output verification

Test results (using uv run python test_runner.py):
✓ 10 tests passed
✓ All core transformation logic verified
✓ Both output formats tested
Removed helper function tests, keeping only:
- test_hf_dataset_to_jsonl_evaluation_format
- test_hf_dataset_to_jsonl_decontamination_format

Both tests are now top-level functions that verify the complete transformation pipeline.
… transform

- Split transform logic: transform_example_to_qa (core) and wrap_transform (error handling)
- Use dict format for enumerate output: {"idx": ..., "example": ...} for clarity
- Add {shard:05d} placeholder to output pattern for multi-shard support
- Update tests to handle sharded output files using glob patterns
- All tests passing (2/2)
Tests now run successfully with uv run pytest.
- Remove backward compatibility alias raw2json
- Update data_browser to only use hf_dataset_to_jsonl
- Remove implementation details from docstrings
- Refactor tests to use proper mocking with context managers
- Fix linting issues: unused imports, loop variable binding
- Format code with black
Update all experiments importing raw2json to use the new function name
hf_dataset_to_jsonl. The function was renamed during the recent refactor
but some experiment files were not updated.

Fixes import errors in:
- exp412_download_and_raw2json_hf_qa.py
- exp1342_gemstones_scaling_law.py
- eval_datasets.py
- train_test_overlap/eval_datasets_overlap.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@rjpower rjpower force-pushed the claude/refactor-raw2json-huggingface-qa-01JBFxavUzHAHMQXFiRHx3jK branch from ba22f34 to 4414b11 Compare November 20, 2025 18:56
@rjpower rjpower merged commit 841aa35 into main Nov 20, 2025
9 checks passed
@rjpower rjpower deleted the claude/refactor-raw2json-huggingface-qa-01JBFxavUzHAHMQXFiRHx3jK branch November 20, 2025 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants