Selenium 4 middleware for Scrapy

Fork of scrapy-selenium working with the recent versions of Selenium 4.

All settings except SELENIUM_DRIVER_NAME are now optional. The middleware should still work with existing Scrapy projects integrating the upstream package.

Requirements

Tested with python3.12. You will need a Selenium 4 compatible browser.

Installation

With Poetry

poetry add git+https://github.com/jirpok/scrapy-selenium4.git

Configuration

(Edge and Safari are also supported.)

Firefox

SELENIUM_DRIVER_NAME = "firefox"
SELENIUM_DRIVER_ARGUMENTS = ["-headless"]
SELENIUM_BROWSER_FF_PREFS = {
    "javascript.enabled": False,    # disable JavaScript
    "permissions.default.image": 2  # block all images from loading
}

Firefox/CommandLineOptions

Proxy settings

SOCKS5

SELENIUM_BROWSER_FF_PREFS = {
    "network.proxy.type": 1,
    "network.proxy.socks_remote_dns": True,
    "network.proxy.socks": "<HOST>",
    "network.proxy.socks_port": <PORT>
}

HTTP(S)

SELENIUM_BROWSER_FF_PREFS = {
    "network.proxy.type": 1,
    "network.proxy.http": "<HOST>",
    "network.proxy.http_port": <PORT>,
    "network.proxy.ssl": "<HOST>",
    "network.proxy.ssl_port": <PORT>
}

Chrome

SELENIUM_DRIVER_NAME = "chrome"
SELENIUM_DRIVER_ARGUMENTS=["--headless=new"]

Optional settings

Specify path to the browser executable

SELENIUM_BROWSER_EXECUTABLE_PATH = "path/to/browser/executable"

Specify path to the local driver

Selenium requires a driver (GeckoDriver, ChromeDriver, …) to interface with the chosen browser. Recent versions of Selenium 4 ship with the Selenium Manager, automatically handling these dependencies.

SELENIUM_DRIVER_EXECUTABLE_PATH = "path/to/driver/executable"

Specify remote driver

SELENIUM_COMMAND_EXECUTOR = "http://localhost:4444/wd/hub"

(Do not set SELENIUM_DRIVER_EXECUTABLE_PATH along with SELENIUM_COMMAND_EXECUTOR.)

Include in `DOWNLOADER_MIDDLEWARE`

DOWNLOADER_MIDDLEWARES = {
    "scrapy_selenium4.SeleniumMiddleware": 800
}

Usage

Use the scrapy_selenium4.SeleniumRequest instead of the scrapy built-in Request:

from scrapy_selenium4 import SeleniumRequest

yield SeleniumRequest(url=url, callback=self.parse)

The request will have an additional meta key driver containing the Selenium driver.

def parse(self, response):
    print(response.request.meta["driver"].title)

Additional arguments

`wait_time`, `wait_until`

Explicit wait before returning the response to the spider.

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

yield SeleniumRequest(
    url=url,
    callback=self.parse,
    wait_time=10,
    wait_until=EC.element_to_be_clickable((By.ID, "some_id"))
)

`screenshot`

Take a screenshot of the page and add the binary data of the captured .png to the response meta.

yield SeleniumRequest(
    url=url,
    callback=self.parse,
    screenshot=True
)

def parse(self, response):
    with open("image.png", "wb") as image_file:
        image_file.write(response.meta["screenshot"])

`script`

Execute custom JavaScript code.

yield SeleniumRequest(
    url=url,
    callback=self.parse,
    script="window.scrollTo(0, document.body.scrollHeight);",
)

Dev

Run tests

pytest

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.vscode		.vscode
docs		docs
requirements		requirements
scrapy_selenium4		scrapy_selenium4
tests		tests
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pylintrc		.pylintrc
LICENCE		LICENCE
MANIFEST.in		MANIFEST.in
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Selenium 4 middleware for Scrapy

Requirements

Installation

Configuration

Firefox

Proxy settings

SOCKS5

HTTP(S)

Chrome

Optional settings

Specify path to the browser executable

Specify path to the local driver

Specify remote driver

Include in `DOWNLOADER_MIDDLEWARE`

Usage

Additional arguments

`wait_time`, `wait_until`

`screenshot`

`script`

Dev

About

Uh oh!

Releases

Packages

Languages

License

jirpok/scrapy-selenium4

Folders and files

Latest commit

History

Repository files navigation

Selenium 4 middleware for Scrapy

Requirements

Installation

Configuration

Firefox

Proxy settings

SOCKS5

HTTP(S)

Chrome

Optional settings

Specify path to the browser executable

Specify path to the local driver

Specify remote driver

Include in DOWNLOADER_MIDDLEWARE

Usage

Additional arguments

wait_time, wait_until

screenshot

script

Dev

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Include in `DOWNLOADER_MIDDLEWARE`

`wait_time`, `wait_until`

`screenshot`

`script`

Packages