Fork of scrapy-selenium working with the recent versions of Selenium 4.
All settings except SELENIUM_DRIVER_NAME are now optional. The middleware
should still work with existing Scrapy projects integrating the upstream
package.
Tested with python3.12. You will need a Selenium 4 compatible browser.
With Poetry
poetry add git+https://github.com/jirpok/scrapy-selenium4.git(Edge and Safari are also supported.)
SELENIUM_DRIVER_NAME = "firefox"
SELENIUM_DRIVER_ARGUMENTS = ["-headless"]
SELENIUM_BROWSER_FF_PREFS = {
"javascript.enabled": False, # disable JavaScript
"permissions.default.image": 2 # block all images from loading
}SELENIUM_BROWSER_FF_PREFS = {
"network.proxy.type": 1,
"network.proxy.socks_remote_dns": True,
"network.proxy.socks": "<HOST>",
"network.proxy.socks_port": <PORT>
}SELENIUM_BROWSER_FF_PREFS = {
"network.proxy.type": 1,
"network.proxy.http": "<HOST>",
"network.proxy.http_port": <PORT>,
"network.proxy.ssl": "<HOST>",
"network.proxy.ssl_port": <PORT>
}SELENIUM_DRIVER_NAME = "chrome"
SELENIUM_DRIVER_ARGUMENTS=["--headless=new"]SELENIUM_BROWSER_EXECUTABLE_PATH = "path/to/browser/executable"Selenium requires a driver (GeckoDriver, ChromeDriver, …) to interface with the chosen browser. Recent versions of Selenium 4 ship with the Selenium Manager, automatically handling these dependencies.
SELENIUM_DRIVER_EXECUTABLE_PATH = "path/to/driver/executable"SELENIUM_COMMAND_EXECUTOR = "http://localhost:4444/wd/hub"(Do not set SELENIUM_DRIVER_EXECUTABLE_PATH along with
SELENIUM_COMMAND_EXECUTOR.)
DOWNLOADER_MIDDLEWARES = {
"scrapy_selenium4.SeleniumMiddleware": 800
}Use the scrapy_selenium4.SeleniumRequest instead of the scrapy built-in
Request:
from scrapy_selenium4 import SeleniumRequest
yield SeleniumRequest(url=url, callback=self.parse)The request will have an additional meta key driver containing the Selenium
driver.
def parse(self, response):
print(response.request.meta["driver"].title)Explicit wait before returning the response to the spider.
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
yield SeleniumRequest(
url=url,
callback=self.parse,
wait_time=10,
wait_until=EC.element_to_be_clickable((By.ID, "some_id"))
)Take a screenshot of the page and add the binary data of the captured .png to
the response meta.
yield SeleniumRequest(
url=url,
callback=self.parse,
screenshot=True
)
def parse(self, response):
with open("image.png", "wb") as image_file:
image_file.write(response.meta["screenshot"])Execute custom JavaScript code.
yield SeleniumRequest(
url=url,
callback=self.parse,
script="window.scrollTo(0, document.body.scrollHeight);",
)Run tests
pytest