-
-
Notifications
You must be signed in to change notification settings - Fork 141
Description
When fetching a site that contains code blocks, the formatting of these blocks (indentation, line breaks) is lost in the final output file. This makes the code difficult to read and use.
Steps to Reproduce
- Run the following command:
npx sitefetch https://google.github.io/adk-docs/ -o adk_complete_docs.md
- Open the output file
adk_complete_docs.md. - Observe that the code examples are not formatted correctly.
Expected Behavior
Code blocks should be preserved in the output file with their original formatting, enclosed in Markdown code fences (```).
Cause of the Issue
The issue appears to be in how the content is extracted from the HTML. The tool currently uses article.textContent, which strips all HTML tags and does not preserve the formatting of <pre> and <code> elements.
Suggested Solution
To fix this, we can use a library like turndown to convert the HTML content to Markdown. This will correctly handle code blocks and other formatting.
Here's a proposed change in the fetchPage function in src/index.ts:
import { JSDOM } from "jsdom";
import { Readability } from "@mozilla/readability";
import TurndownService from "turndown";
// ...
const turndownService = new TurndownService();
// In the fetchPage function, replace 'content: article.textContent' with:
const content = turndownService.turndown(article.content);
// ...This change will ensure that the HTML is converted to Markdown, preserving the code block formatting.