Skip to content

A robust, simple Python scraper that takes a list of sitemap URLs as an input in a CSV, and returns another CSV with the same sitemaps and all the URLs that have been scraped from them. Only takes <loc> values from each sitemap.

Notifications You must be signed in to change notification settings

garbacciojp/python-sitemap-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

python-sitemap-scraper

A robust, simple Python scraper that takes a list of sitemap URLs as an input in a CSV, and returns another CSV with the same sitemaps and all the URLs that have been scraped from them. Only takes <loc> values from each sitemap.

Steps To Use:

Note: Python 3 is needed to run this script, as well as all libraries and packages contained within the script. As of 15 Nov 2023, there is no requirements.txt file created to easily download all packages.

  1. Download the folder to your computer.
  2. Open the CSV labelled 'input-urls.csv' and paste all the sitemap URLs that you'd like to scrape.
  3. Run the script - it is advised to do this with an IDE like VS Code.
  4. Wait for the script to execute.
  5. Check the script folder and you'll find an output file with all the data.

Simple as that! Any issues, contact me @ https://www.linkedin.com/in/jpgarbaccio/.

About

A robust, simple Python scraper that takes a list of sitemap URLs as an input in a CSV, and returns another CSV with the same sitemaps and all the URLs that have been scraped from them. Only takes <loc> values from each sitemap.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages