Selenium Firefox Python - Geckodriver – this combination of tools is one of the most potent out there when it comes to serious web scraping and automation jobs. This stack is definitely not the most lightweight way to do web scraping or automation but what it lacks in lightness it makes up for in its robustness. In my multiple years of web scraping experience this Selenium Firefox Python - Geckodriver stack has come the closest to mimicking the behavior of a real human when writing scripts meant to behave as such.
Let’s look at a simple example to learn how to build a Selenium Firefox Python - Geckodriver based scraping script.
The first two things you need to scrape the web using this approach is a browser and a driver for that browser – in our case Firefox and geckodriver respectively. Firefox as you probably know is a powerful modern web browser with support for the latest fancy web specifications that you might encounter surfing a modern web app. Geckodriver is a driver for Firefox, meaning it provides an API for other programs to launch and control the behavior of Firefox through it. It provides a bridge between our program and the Firefox browser to allow two-way communication.
The next thing you need is Selenium which is a browser automation tool providing a rich set of APIs to perform and solve commonly encountered browser automation problems, such as clicking an element, finding an element using X-Path or CSS selector query, waiting for an element to load before interacting with it and so on.
The next thing we need of course is a programming language to encode what we really want to do with the browser, and Python is an excellent choice for this kind of task due to multiple reasons. One major benefit of using Python for these tasks is that we can plug directly to the interpreter using the standard Python interpreter CLI tool, or IPython or Jupyter notebook and then run individual commands to ensure the program is behaving exactly the way we want. If this wasn’t possible then we would need to fire up the program multiple times, wasting a lot of time and hitting rate limits in order to fine tune the behavior of the program based on the website behaves based on our browser operations.
For this example we will build a simple scraper which will go to r/wallpapers on Reddit, search for “cat” using the search widget in sidebar and download the first ten images from the result of the search.
First we will need to ensure we have Firefox installed on the machine we wish to run this script on. We also need to make a note of which version of Firefox we are running. You can find it in the “About Firefox” section. Then we will need to head over to the geckodriver release page and pick a latest version which supports the Firefox version installed on the machine. Download the geckodriver archive, extract it and place the geckodriver binary at some place, in this walkthrough let’s assume it’s kept in the project directory.
With these in place we can move to the familiar world of Python. We will need to install the Python Selenium library and a library called requests (for downloading static image files) by running
pip install selenium requests.
Now we can get to actually writing some code to see our Selenium Firefox Python - Geckodriver stack in action.
from typing import List from os import path, mkdir import time import requests from selenium import webdriver from selenium.webdriver.remote.webelement import WebElement IMG_LIMIT = 10 def get_driver() -> webdriver.Firefox: return webdriver.Firefox(executable_path=path.abspath(path.join( path.dirname(__file__), "geckodriver"))) def visit_r_wallpapers(d: webdriver.Firefox) -> None: d.get("https://old.reddit.com/r/wallpapers/") def get_search_widget(d: webdriver.Firefox) -> WebElement: return d.find_element_by_xpath("//input[@placeholder='search']") def submit_search(d: webdriver.Firefox, search_element: WebElement) -> None: search_element.send_keys("cat") time.sleep(0.6) d.find_element_by_xpath("//input[@name='restrict_sr']").click() time.sleep(0.6) search_element.submit() def extract_image_links(d: webdriver.Firefox) -> List[str]: time.sleep(2) links: List[WebElement] = d.find_elements_by_css_selector("a") results: List[str] =  for l in links: href: str = l.text if href.startswith("https://i.redd.it/") and href.endswith(".jpg"): results.append(href) return results[:IMG_LIMIT] def download_images(links: List[str], dir_path="img") -> None: if not path.exists(dir_path): mkdir(dir_path) for i, l in enumerate(links): print("downloading link:", l) res = requests.get(l) with open(path.join(dir_path, "cat_pic_%d.jpg" % i), "wb") as f: f.write(res.content) def main() -> None: with get_driver() as d: print("visiting r/wallpapers") visit_r_wallpapers(d) ele = get_search_widget(d) submit_search(d, ele) print("visiting search results page") links = extract_image_links(d) print("Links:\n" + "\n> ".join(links)) download_images(links) if __name__ == "__main__": main()
The code is quite simple. We create a new Selenium webdriver instance for Firefox using the geckodriver binary. Then we visit the r/wallpapers website, locate the search widget, click it, enter our search query, check the “limit to this subreddit” checkbox and then submit the search form. Then we are taken to the results page where the images we are looking for are hosted on the i.redd.it domain. So we simply search for the links on the page which start with this domain and have “.jpg” at the end as we only want images and not GIFs or other files or links. After extracting ten relevant image links we use the
requests library to download the images present at these respective URLs. And we could do all this in a very easy manner with just about fifty lines of code thanks to our awesome our Selenium Firefox Python - Geckodriver stack!
It makes sense to employ Selenium Firefox Python - Geckodriver stack when there are certain scenarios at play:
1) The target website isn’t known to be friendly to bots and uses advanced tech but your use case demands it, think Twitter, Instagram and so on – these websites don’t want bots operating on their platforms and have sophisticated ways to detect and weed out bot activities.
3) You need one or only a handful of scraping/automation tasks running at once and thus the RAM usage or speed isn’t a major concern for you – instead of starting out with an HTTP client and HTML parser and trying to solve the problem with it before realizing it might fall short, you can just start off with a slower and fatter but more reliable technique and call it a day with the browser based technique which also handles a lot of things by itself.
You shouldn’t use Selenium Firefox Python - Geckodriver when you are looking to run a large number of scraping/automation jobs simultaneously because running a real Firefox or Chrome browser instance hogs up lots of memory and is much slower than a programmatic HTTP client. Likewise if you have memory constraints you will find it difficult to run multiple or even a single instance of Selenium Firefox Python - Geckodriver instance. For example, on the basic $5 VPS provided by Digital Ocean you get 1 GB of RAM and if you some other processes running it can be difficult to squeeze in even a single Firefox instance.
Another issue to keep in mind when using this browser based approach is that it requires access to a display medium. Selenium Firefox Python - Geckodriver can run in headless mode but on a server without any display you will still need to set up an X server or the likes to get the browser to work. Although it’s not too much work but it can be frustrating in the beginning if you are new to this area and unfamiliar with the Linux ecosystem.
Another similar issue is when you want to distribute your Selenium Firefox Python - Geckodriver based program to other people as a desktop application. When using a programmatic HTTP client the program is self-contained, however using this browser based approach will require the proper provisioning of the Firefox browser, geckodriver and so on on your user’s desktop. This adds an additional layer of complexity.