Skip to content

Bug report: Code block formatting is lost during site fetching #26

@rajeshkanaka

Description

@rajeshkanaka

When fetching a site that contains code blocks, the formatting of these blocks (indentation, line breaks) is lost in the final output file. This makes the code difficult to read and use.

Steps to Reproduce

  1. Run the following command:
    npx sitefetch https://google.github.io/adk-docs/ -o adk_complete_docs.md
  2. Open the output file adk_complete_docs.md.
  3. Observe that the code examples are not formatted correctly.

Expected Behavior

Code blocks should be preserved in the output file with their original formatting, enclosed in Markdown code fences (```).

Cause of the Issue

The issue appears to be in how the content is extracted from the HTML. The tool currently uses article.textContent, which strips all HTML tags and does not preserve the formatting of <pre> and <code> elements.

Suggested Solution

To fix this, we can use a library like turndown to convert the HTML content to Markdown. This will correctly handle code blocks and other formatting.

Here's a proposed change in the fetchPage function in src/index.ts:

import { JSDOM } from "jsdom";
import { Readability } from "@mozilla/readability";
import TurndownService from "turndown";

// ...

const turndownService = new TurndownService();

// In the fetchPage function, replace 'content: article.textContent' with:
const content = turndownService.turndown(article.content);

// ...

This change will ensure that the HTML is converted to Markdown, preserving the code block formatting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions