Skip to content

Conversation

@manikyarathore
Copy link
Contributor

Overview Fixes #689

This expands the HTML parser test suite by adding new unit tests for modern attributes commonly used for lazy loading and responsive images. These attributes are widely adopted across the web, and ensuring Heritrix extracts URLs correctly is essential for consistent crawling.

What’s Included

This update adds dedicated tests for URL extraction from:

  • data-src
  • data-full-src
  • data-lazy-srcset
  • srcset (additional coverage)

Each test verifies that Heritrix’s ExtractorHTML module correctly identifies and normalizes URLs from these attributes.

Related Issue

Fixes #689

Copy link
Collaborator

@ato ato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this might be missing an import:

Error:  Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.14.1:testCompile (default-testCompile) on project heritrix-modules: Compilation failure: Compilation failure: 
Error:  /home/runner/work/heritrix3/heritrix3/modules/src/test/java/org/archive/modules/extractor/ExtractorHTMLTest.java:[685,48] cannot find symbol
Error:    symbol:   class IOException
Error:    location: class org.archive.modules.extractor.ExtractorHTMLTest
Error:  /home/runner/work/heritrix3/heritrix3/modules/src/test/java/org/archive/modules/extractor/ExtractorHTMLTest.java:[693,52] cannot find symbol
Error:    symbol:   class IOException
Error:    location: class org.archive.modules.extractor.ExtractorHTMLTest
Error:  /home/runner/work/heritrix3/heritrix3/modules/src/test/java/org/archive/modules/extractor/ExtractorHTMLTest.java:[701,55] cannot find symbol
Error:    symbol:   class IOException
Error:    location: class org.archive.modules.extractor.ExtractorHTMLTest
Error:  /home/runner/work/heritrix3/heritrix3/modules/src/test/java/org/archive/modules/extractor/ExtractorHTMLTest.java:[710,57] cannot find symbol
Error:    symbol:   class IOException
Error:    location: class org.archive.modules.extractor.ExtractorHTMLTest

@ato
Copy link
Collaborator

ato commented Dec 5, 2025

I'm a bit confused by this PR, it almost seems corrupted in some way. There seems to be a lot of unrelated commits with identical messages which aren't showing in the full diff and GitHub is still showing it as having test failures. I'm not going to risk merging it in this state, so if you'd like this change merged, please open a new clean PR with just the intended change. :-)

@ato ato closed this Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fixes on unit test on HTML parser

2 participants