SCRAPING WITH GEB
By Sergio del Amo Caballero, OCI Grails Team
OCTOBER 2017
INTRODUCTION
Geb is a browser automation solution. It brings together the power of WebDriver, the elegance of jQuery content selection, the robustness of Page Object modeling, and the expressiveness of the Groovy language.
Geb is often used as a functional/web/acceptance testing solution via integration with testing frameworks such as Spock, JUnit & TestNG. In this article, we are going to show how Geb can be used to Screen Scraping.
The next steps describe how to create a Webbot that scrapes the OCI Training Website.
Project configuration
Create a Groovy Library with Gradle:
- $ mkdir scraper
- $ cd scraper
- $ gradle init --type groovy-library
Replace the content of build.gradle
with:
- apply plugin: 'groovy'
-
- repositories {
- jcenter()
- }
-
- version projectVersion
- group "com.objectcomputing"
-
- dependencies {
- compile 'org.codehaus.groovy:groovy-all:2.4.12'
- testCompile 'org.spockframework:spock-core:1.0-groovy-2.4'
- compile "org.gebish:geb-core:1.1.1"
- compile "org.seleniumhq.selenium:selenium-firefox-driver:2.53.1"
- compile "org.seleniumhq.selenium:selenium-support:2.53.1"
- compile "net.sourceforge.htmlunit:htmlunit:2.18"
- compile "org.seleniumhq.selenium:selenium-htmlunit-driver:2.47.1"
- }
-
- test {
- systemProperties System.properties
- }
Geb builds on the WebDriver browser automation library, which means that Geb can work with any browser that WebDriver can. The previous build.gradle
includes dependencies for Geb, Selenium, and the two browsers (HTMLUnit and Firefox) used in this webbot.
Geb attempts to load a ConfigSlurper script named GebConfig.groovy
from the default package (in other words, in the root of a directory that is on the classpath).
- import org.openqa.selenium.firefox.FirefoxDriver
- import org.openqa.selenium.htmlunit.HtmlUnitDriver
-
- environments {
-
- htmlUnit {
- driver = { new HtmlUnitDriver() }
- }
-
- firefox {
- driver = { new FirefoxDriver() }
- }
- }
The Groovy ConfigSlurper mechanism has built-in support for environment sensitive configuration, and Geb leverages this by using the geb.env
system property to determine the environment to use. An effective use of this mechanism is to configure different drivers based on the designated Geb “environment.”
To use Firefox driver, supply the System property geb.env
with value firefox
. To use HTMLUnit (a GUI-LESS browser) driver, supply the System property geb.env
with value htmlUnit.
Understanding the page
If you visit https://objectcomputing.com/services/training/schedule/ and inspect the Track selector, you can check the different Track Ids.
If you visit https://objectcomputing.com/services/training/schedule?track=11
, the training offerings are filtered by Grails Training. The track parameter takes the track ID as a value.
Clicking a training offering opens a modal window. That modal window includes important information. For example, if the course is Sold Out.
Clicking a training offering is equivalent to visiting https://objectcomputing.com//training/schedule/#schedule-offering-48
. The last number of the previous url is the offering id.
Map Information to a Model
Create the next classes to model OCI’s training offering:
- package com.objectcomputing.model
-
- import groovy.transform.CompileStatic
- import groovy.transform.ToString
-
- @ToString
- @CompileStatic
- class Offering {
- Long id
- String course
- String dates
- String time
- String instructors
- String hours
- Track track
-
- String getEnrollmentLink() {
- "https://objectcomputing.com/index.php/training/register/offering/$id"
- }
- boolean soldOut
- }
- package com.objectcomputing.model
-
- import groovy.transform.CompileStatic
- import groovy.transform.ToString
-
- @ToString
- @CompileStatic
- class Track {
- Long id
- String name
- }
Encapsulate different areas with Pages
The Page Object Pattern gives us a common-sense way to model content in a reusable and maintainable way. From the WebDriver wiki page on the Page Object Pattern:
"Within your web app’s UI there are areas that your tests interact with. A Page Object simply models these as objects within the test code. This reduces the amount of duplicated code and means that if the UI changes, the fix need only be applied in one place."
Creating a Geb Page is as simple as creating a class that extends geb.Page
.
Geb features a DSL for defining page content in a templated fashion, which allows very concise yet flexible page definitions. Pages define a static closure property called content
that describes the page content.
- package com.objectcomputing.geb
-
- import com.objectcomputing.model.Track
- import geb.Page
-
- class TrackSelectorPage extends Page {
-
- static url = '/training/schedule/'
-
- static content = {
- trackSelector { $("select", name: 'track') }
- trackOptions { trackSelector.$('option') }
- }
-
- Set<Track> tracks() {
- trackOptions.collect {
- Long id
- try {
- id = it.getAttribute('value') as Long
- } catch(NumberFormatException e) {
-
- }
- new Track(id: id, name: it.text())
- } as Set<Track>
- }
-
- }
The next page traverses the training offerings table.
- package com.objectcomputing.geb
-
- import com.objectcomputing.model.Offering
- import geb.Page
-
- class TrainingSchedulePage extends Page {
- public static final String INTERNAL_LINK = '#schedule-offering-'
- public static final String WINDOW_LOCATION = "window.location = '$INTERNAL_LINK"
-
- static url = '/training/schedule'
-
- @Override
- String convertToPath(Object[] args) {
- if ( args.size() > 0 ) {
- return "?track=${args[0]}"
- }
- }
-
- static content = {
- offeringRows { $('table.offerings tbody tr') }
- }
-
- Set<Offering> offerings() {
- Set<Offering> offerings = []
- for ( int i = 0; i < offeringRows.size(); i++ ) {
- def offeringRow = offeringRows.getAt(i)
- def offeringId = offeringRow.getAttribute('onclick')
- .replaceAll(WINDOW_LOCATION, '')
- .replaceAll('\';', '') as Long
-
- Offering offering = new Offering()
- offering.with {
- id = offeringId
- course = offeringRow.$('td', 0).text()
- dates = offeringRow.$('td', 1).text()
- time = offeringRow.$('td', 2).text()
- instructors = offeringRow.$('td', 3).text()
- hours = offeringRow.$('td', 4).text()
- }
- offerings << offering
- }
- offerings
- }
- }
The next page checks if the text "Sold Out" appears in the modal window content.
- package com.objectcomputing.geb
-
- import geb.Page
-
- class TrainingScheduleModalPage extends Page {
-
- static url = '/training/schedule'
-
- static content = {
- modalWindow(required: false) { $('.ws-modal-dialog', 0) }
- }
-
- @Override
- String convertToPath(Object[] args) {
- if ( args.size() > 1 ) {
- return "?track=${args[0]}#schedule-offering-${args[1]}"
- }
- }
-
- boolean isSoldOut() {
- if ( !modalWindow.empty ) {
- return modalWindow.text().contains('Sold Out')
- }
- false
- }
- }
The previous two pages override the method convertToPath
which allows building dynamic urls.
You could have created a single page which encapsulated all the functionality described in the previous three pages. But having smaller pages makes the code easier to follow and maintain.
Orchestrate navigation
The next class organizes the navigation while capturing the training information.
- package com.objectcomputing.geb
-
- import com.objectcomputing.model.Offering
- import com.objectcomputing.model.Track
- import geb.Browser
-
- class TrainingScheduleBrowser {
-
- static Set<Offering> offerings() {
- Browser browser = new Browser()
- browser.baseUrl = 'https://objectcomputing.com'
-
- TrackSelectorPage selectorPage = browser.to TrackSelectorPage
- Set<Track> tracks = selectorPage.tracks().findAll { it.name != 'All Tracks' }
-
- Set<Offering> offerings = []
- for (Track track : tracks ) {
- Set<Offering> trackOfferings = fetchTrackOfferings(browser, track)
- trackOfferings.each { Offering offering ->
- populateOfferingSoldout(browser, track, offering)
- }
- offerings += trackOfferings
- }
- offerings
- }
-
- static Set<Offering> fetchTrackOfferings(Browser browser, Track track) {
- TrainingSchedulePage page = browser.to TrainingSchedulePage, track.id
- Set<Offering> offerings = page.offerings()
- offerings.each { it.track = track }
- offerings
- }
-
- static void populateOfferingSoldout(Browser browser, Track track, Offering offering) {
- TrainingScheduleModalPage page = browser.to TrainingScheduleModalPage, track.id, offering.id
- offering.soldOut = page.isSoldOut()
- }
- }
If you execute, Set offerings = TrainingScheduleBrowser.offerings()
and supply the -Dgeb.env=firefox
, you will see a browser popup and navigate the OCI Training offerings as displayed in the next video.
Next Steps
The next logical step would be to output the scraped information. We have developed a Grails Plugin which encapsulates this library. It executes the scraper each hour to get the latest training information. The scraped information is cached and exposed as a JSON API.
Each Grails Guide displays up-to-date training information thanks to this scraper which has transformed a static HTML page in a JSON API.
Happy scraping with Geb!
Software Engineering Tech Trends (SETT) is a regular publication featuring emerging trends in software engineering.