Skip to content

Conversation

@rjpower
Copy link
Collaborator

@rjpower rjpower commented Nov 14, 2025

This major refactoring consolidates all HTML/Markdown conversion functionality into a single, cohesive marin.convert package, resolving architectural issues and improving code organization.

Changes

New Structure

  • Created marin/convert/ package with clean public API
    • config.py - Extraction configs (ExtractionConfig, TrafilaturaConfig, etc.)
    • html.py - HTML conversion functions (merged from web/convert.py + web/utils.py)
    • markdown.py - Markdown conversion utilities
    • _code_detection.py - Code language detection (internal)
    • data/ - Supporting files (model, xsl, languages.json)

Removed

  • marin/schemas/web/ - Moved to convert/config.py
  • marin/web/convert.py - Merged into convert/html.py
  • marin/web/utils.py - Merged into convert/html.py
  • marin/markdown/ - Moved to convert/markdown.py

Cleaned Up

  • marin/web/ now only contains actual web utilities (rpv2.py, lookup_cc.py)
  • Removed circular dependency between web, markdown, and schemas modules

Updated Imports

Updated 23 files across:

  • Transform modules (6 files)
  • Tests (11 files)
  • Experiments (5 files)
  • Utilities (1 file)

All imports now use: from marin.convert import ...

Benefits

  1. Resolved circular dependencies - All conversion code in one package
  2. Improved discoverability - Single import location for all conversion
  3. Better organization - Logical grouping of related functionality
  4. Cleaner web/ package - Only contains actual web utilities
  5. Future-proof - Easy to extend with new converters (PDF, DOCX, etc.)

Migration Guide

Old imports:

from marin.schemas.web.convert import HtmlToMarkdownConfig
from marin.web.convert import convert_page
from marin.markdown import to_markdown

New imports:

from marin.convert import HtmlToMarkdownConfig, convert_page, to_markdown

Resolves discussion about marin/schemas organization and consolidates scattered conversion functionality.

Description

Fixes #(issue number)

[Please include a summary of the changes and the related issue.]

Checklist

  • You ran uv run python infra/pre-commit.py --all-files to lint/format your code
  • You ran 'pytest' to test your code
  • Delete this checklist

This major refactoring consolidates all HTML/Markdown conversion functionality
into a single, cohesive `marin.convert` package, resolving architectural issues
and improving code organization.

## Changes

### New Structure
- Created `marin/convert/` package with clean public API
  - `config.py` - Extraction configs (ExtractionConfig, TrafilaturaConfig, etc.)
  - `html.py` - HTML conversion functions (merged from web/convert.py + web/utils.py)
  - `markdown.py` - Markdown conversion utilities
  - `_code_detection.py` - Code language detection (internal)
  - `data/` - Supporting files (model, xsl, languages.json)

### Removed
- `marin/schemas/web/` - Moved to convert/config.py
- `marin/web/convert.py` - Merged into convert/html.py
- `marin/web/utils.py` - Merged into convert/html.py
- `marin/markdown/` - Moved to convert/markdown.py

### Cleaned Up
- `marin/web/` now only contains actual web utilities (rpv2.py, lookup_cc.py)
- Removed circular dependency between web, markdown, and schemas modules

### Updated Imports
Updated 23 files across:
- Transform modules (6 files)
- Tests (11 files)
- Experiments (5 files)
- Utilities (1 file)

All imports now use: `from marin.convert import ...`

## Benefits

1. **Resolved circular dependencies** - All conversion code in one package
2. **Improved discoverability** - Single import location for all conversion
3. **Better organization** - Logical grouping of related functionality
4. **Cleaner web/ package** - Only contains actual web utilities
5. **Future-proof** - Easy to extend with new converters (PDF, DOCX, etc.)

## Migration Guide

Old imports:
```python
from marin.schemas.web.convert import HtmlToMarkdownConfig
from marin.web.convert import convert_page
from marin.markdown import to_markdown
```

New imports:
```python
from marin.convert import HtmlToMarkdownConfig, convert_page, to_markdown
```

Resolves discussion about marin/schemas organization and consolidates
scattered conversion functionality.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants