Webscraping With Selenium
Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract data and save it for further analysis (web scraping). There are many methods available in the Selenium API to select elements on the page.
- Web Scraping With Selenium Python
- Web Scraper Free
- Python Selenium Scrape Table
- Web Scraping With Selenium
- Web Scraping With Selenium Java
- Web Scraping With Selenium And Beautiful Soup
- Web Scraping Tutorial
In part 1 of this blog post series we mentioned the most common approach to web scraping and its issues. We also made a small example on how to start web scraping with C#, Selenium and QueryStorm in Excel. Now we’ll expand on the example from part 1 and create a more useful web scraper.
- As with every “web scraping with Selenium” tutorial, you have to download the appropriate driver to interface with the browser you’re going to use for scraping. Since we’re using Chrome, download the driver for the version of Chrome you’re using. The next step is to add it to the system path.
- Make webscraping faster: use an API. 12 seconds is a fantastic amount of time for UI automation to execute. I frequently run scripts that take anywhere from 1 minute (minimum) to 15 minutes max. Rendering a browser and HTML content on a page requires response times from the website you are automating - Selenium / Python is actually the fastest.
Navigating to and scraping paginated items
It’s time to kick the web scraping up a notch. For instance, let’s scrape the names and prices of the top items on the home page, navigate to the laptops category and scrape all of the laptops as well.
Preparing the table
We should delete the current table rows as they are irrelevant. We can use ResultsTable.Clear()
to delete all current table entries instead of deleting them by hand.
In addition, we should also edit the ResultsTable
by renaming the Results
column to Product Name
and by adding a new column named Price
.
Getting the price
To get the price along with the title of the top items, our script needs only minor modifications. First, we find all of the items by their CSS selector (div.thumbnail
). Then we find the name and the price of the items by finding their respective elements (name CSS selector – h4 > a
, price CSS selector – div.caption > h4.pull-right.price
) inside of the parent item element.
Preventing CSS selector issues
Just a heads up – without any changes to the default driver initializer, the browser will open as a small window. That means that there’s a chance that the page will have a mobile/tablet layout so your CSS selectors (that are copied from the DevTools of a maximized browser window) will be invalid. To prevent this issue, we start the driver with some options where we specify that the browser should start maximized.
Page navigation
The next step is navigating – first to the Computers page and then to the Laptops page.
We can navigate by clicking on the Computers menu item and waiting for the Computers page to load. Subsequently, we should click on the Laptops menu item and wait for the Laptops page to load.
Note: We could just navigate to the URL https://webscraper.io/test-sites/e-commerce/ajax/computers/laptops instead of clicking on the side menu items, but I feel it’s better to demonstrate how to click and wait for the page to load as it is a pretty common problem in web scraping.
Clicking the button
Clicking is easy – we find the element and call its Click
method.
Waiting for a page to load
Waiting itself is not an issue as we can use the WebDriverWait
class that provides us with a way to wait a certain amount of time until an arbitrary condition happens. However, this condition can prove to be a problem.
In our case, the condition is to wait until the new page has loaded. To do that we need to determine when exactly has an old page unloaded and a new page has loaded. The most robust way to achieve this would be to wait for an element on the old page to go “stale” (no longer attached to the DOM). We also have to wait for an element on the new page to be displayed.
As a sort of a helping hand, we could install and use the DotNetSeleniumExtras.WaitHelpers
NuGet package to check if a new page has loaded. However, the project is no longer maintained and the relevant code isn’t complicated, so we can write the code for the conditions ourselves.
The WebDriverWait
‘s Until
method has a parameter of type Func<IWebDriver, TResult>
. Therefore, we have to create a NewPageLoaded
method that returns the specified delegate to the Until
method. The code can look something like this…
Web Scraping With Selenium Python
To complete the NewPageLoaded
method, we need to replace the dots with concrete staleness and visibility checks. These checks can also return a delegate so they can be used as regular methods and by the Until
method. So, let’s define the methods to check for staleness and visibility.
Element staleness
An element is stale if any of these conditions are met:
- The element is disabled
- The element is missing (null)
- Accessing the element throws a
StaleElementReferenceException
Element visibility
Also, an element is visible if:
- The driver can find the element
- The element is displayed
Page loaded condition and navigation
Finally, the NewPageLoaded
method looks like this:
And once we decide what elements on the pages we’re going to use to identify if a new page has loaded, we’re ready to navigate to the Computers page and the Laptops page. I chose the following:
Now we can finally perform navigation:
Scraping paginated laptop items
Since we’ve navigated to the Laptops page, we can now scrape the laptop items. We need a couple of things to do that.
First of all, we need a reference to the “Next” button element – by clicking it we can load the items, page by page (button CSS selector – button.btn.btn-default.next
).
The second thing to have in mind is that we have to wait until the next page of items is loaded. Luckily, we’ve made a method to check the staleness of elements, so we can infer that a new page of items has loaded when the laptop items from the current page go stale.
Web Scraper Free
And lastly, we should check whether the “Next” button is enabled or disabled, so we know if we’ve reached the last items page or not.
Python Selenium Scrape Table
We are almost done with our scraper! Let’s run the script with F5 and wait a couple of seconds. As a result of running the script, we can see 120 scraped products in our table. However, we should do one more thing – refactor the code a bit.
Finishing steps
First of all, the code for saving home items and laptop items is the same. Therefore, we can extract a method for saving items.
Web Scraping With Selenium
We can also extract a method for page navigation. We just need to pass different CSS selectors when calling the method.
And lastly, to keep the main part of the script nice and readable, we can do two things. We can create a new method just for scraping laptop items. Also, we can create a new method for direct navigation to the Laptops page.
Web Scraping With Selenium Java
We’re done!
Web Scraping With Selenium And Beautiful Soup
Finally, here’s the full code for the tutorial:
Web Scraping Tutorial
In the next and final part of this web scraping tutorial, we’ll turn our script into a shareable workbook-application that any user with the QueryStorm Runtime can execute.