Scraping with Geb

SCRAPING WITH GEB 

By Sergio del Amo Caballero, OCI Grails Team

OCTOBER 2017

INTRODUCTION

Geb is a browser automation solution. It brings together the power of WebDriver, the elegance of jQuery content selection, the robustness of Page Object modeling, and the expressiveness of the Groovy language.

Geb is often used as a functional/web/acceptance testing solution via integration with testing frameworks such as Spock, JUnit & TestNG. In this article, we are going to show how Geb can be used to Screen Scraping.

The next steps describe how to create a Webbot that scrapes the OCI Training Website.

Project configuration

Create a Groovy Library with Gradle:

  1. $ mkdir scraper
  2. $ cd scraper
  3. $ gradle init --type groovy-library

Replace the content of build.gradle with:

build.gradle

  1. apply plugin: 'groovy'
  2.  
  3. repositories {
  4. jcenter()
  5. }
  6.  
  7. version projectVersion
  8. group "com.objectcomputing"
  9.  
  10. dependencies {
  11. compile 'org.codehaus.groovy:groovy-all:2.4.12'
  12. testCompile 'org.spockframework:spock-core:1.0-groovy-2.4'
  13. compile "org.gebish:geb-core:1.1.1"
  14. compile "org.seleniumhq.selenium:selenium-firefox-driver:2.53.1"
  15. compile "org.seleniumhq.selenium:selenium-support:2.53.1"
  16. compile "net.sourceforge.htmlunit:htmlunit:2.18"
  17. compile "org.seleniumhq.selenium:selenium-htmlunit-driver:2.47.1"
  18. }
  19.  
  20. test {
  21. systemProperties System.properties
  22. }

Geb builds on the WebDriver browser automation library, which means that Geb can work with any browser that WebDriver can. The previous build.gradle includes dependencies for Geb, Selenium, and the two browsers (HTMLUnit and Firefox) used in this webbot.

Geb attempts to load a ConfigSlurper script named GebConfig.groovy from the default package (in other words, in the root of a directory that is on the classpath).

src/main/groovy/GebConfig.groovy

  1. import org.openqa.selenium.firefox.FirefoxDriver
  2. import org.openqa.selenium.htmlunit.HtmlUnitDriver
  3.  
  4. environments {
  5.  
  6. htmlUnit {
  7. driver = { new HtmlUnitDriver() }
  8. }
  9.  
  10. firefox {
  11. driver = { new FirefoxDriver() }
  12. }
  13. }

The Groovy ConfigSlurper mechanism has built-in support for environment sensitive configuration, and Geb leverages this by using the geb.env system property to determine the environment to use. An effective use of this mechanism is to configure different drivers based on the designated Geb “environment.”

To use Firefox driver, supply the System property geb.env with value firefox. To use HTMLUnit (a GUI-LESS browser) driver, supply the System property geb.env with value htmlUnit.

Understanding the page

If you visit https://objectcomputing.com/services/training/schedule/ and inspect the Track selector, you can check the different Track Ids.

If you visit https://objectcomputing.com/services/training/schedule?track=11, the training offerings are filtered by Grails Training. The track parameter takes the track ID as a value.

Clicking a training offering opens a modal window. That modal window includes important information. For example, if the course is Sold Out.

Clicking a training offering is equivalent to visiting https://objectcomputing.com//training/schedule/#schedule-offering-48. The last number of the previous url is the offering id.

Map Information to a Model

Create the next classes to model OCI’s training offering:

src/main/groovy/com/objectcomputing/model/Offering.groovy

  1. package com.objectcomputing.model
  2.  
  3. import groovy.transform.CompileStatic
  4. import groovy.transform.ToString
  5.  
  6. @ToString
  7. @CompileStatic
  8. class Offering {
  9. Long id
  10. String course
  11. String dates
  12. String time
  13. String instructors
  14. String hours
  15. Track track
  16.  
  17. String getEnrollmentLink() {
  18. "https://objectcomputing.com/index.php/training/register/offering/$id"
  19. }
  20. boolean soldOut
  21. }

src/main/groovy/com/objectcomputing/model/Track.groovy

  1. package com.objectcomputing.model
  2.  
  3. import groovy.transform.CompileStatic
  4. import groovy.transform.ToString
  5.  
  6. @ToString
  7. @CompileStatic
  8. class Track {
  9. Long id
  10. String name
  11. }

Encapsulate different areas with Pages

The Page Object Pattern gives us a common-sense way to model content in a reusable and maintainable way. From the WebDriver wiki page on the Page Object Pattern:

"Within your web app’s UI there are areas that your tests interact with. A Page Object simply models these as objects within the test code. This reduces the amount of duplicated code and means that if the UI changes, the fix need only be applied in one place."

Creating a Geb Page is as simple as creating a class that extends geb.Page.

Geb features a DSL for defining page content in a templated fashion, which allows very concise yet flexible page definitions. Pages define a static closure property called content that describes the page content.

src/main/groovy/com/objectcomputing/geb/TrackSelectorPage.groovy

  1. package com.objectcomputing.geb
  2.  
  3. import com.objectcomputing.model.Track
  4. import geb.Page
  5.  
  6. class TrackSelectorPage extends Page {
  7.  
  8. static url = '/training/schedule/'
  9.  
  10. static content = {
  11. trackSelector { $("select", name: 'track') }
  12. trackOptions { trackSelector.$('option') }
  13. }
  14.  
  15. Set<Track> tracks() {
  16. trackOptions.collect {
  17. Long id
  18. try {
  19. id = it.getAttribute('value') as Long
  20. } catch(NumberFormatException e) {
  21.  
  22. }
  23. new Track(id: id, name: it.text())
  24. } as Set<Track>
  25. }
  26.  
  27. }

The next page traverses the training offerings table.

src/main/groovy/com/objectcomputing/geb/TrainingSchedulePage.groovy

  1. package com.objectcomputing.geb
  2.  
  3. import com.objectcomputing.model.Offering
  4. import geb.Page
  5.  
  6. class TrainingSchedulePage extends Page {
  7. public static final String INTERNAL_LINK = '#schedule-offering-'
  8. public static final String WINDOW_LOCATION = "window.location = '$INTERNAL_LINK"
  9.  
  10. static url = '/training/schedule'
  11.  
  12. @Override
  13. String convertToPath(Object[] args) {
  14. if ( args.size() > 0 ) {
  15. return "?track=${args[0]}"
  16. }
  17. }
  18.  
  19. static content = {
  20. offeringRows { $('table.offerings tbody tr') }
  21. }
  22.  
  23. Set<Offering> offerings() {
  24. Set<Offering> offerings = []
  25. for ( int i = 0; i < offeringRows.size(); i++ ) {
  26. def offeringRow = offeringRows.getAt(i)
  27. def offeringId = offeringRow.getAttribute('onclick')
  28. .replaceAll(WINDOW_LOCATION, '')
  29. .replaceAll('\';', '') as Long
  30.  
  31. Offering offering = new Offering()
  32. offering.with {
  33. id = offeringId
  34. course = offeringRow.$('td', 0).text()
  35. dates = offeringRow.$('td', 1).text()
  36. time = offeringRow.$('td', 2).text()
  37. instructors = offeringRow.$('td', 3).text()
  38. hours = offeringRow.$('td', 4).text()
  39. }
  40. offerings << offering
  41. }
  42. offerings
  43. }
  44. }

The next page checks if the text "Sold Out" appears in the modal window content.

src/main/groovy/com/objectcomputing/geb/TrainingScheduleModalPage.groovy

  1. package com.objectcomputing.geb
  2.  
  3. import geb.Page
  4.  
  5. class TrainingScheduleModalPage extends Page {
  6.  
  7. static url = '/training/schedule'
  8.  
  9. static content = {
  10. modalWindow(required: false) { $('.ws-modal-dialog', 0) }
  11. }
  12.  
  13. @Override
  14. String convertToPath(Object[] args) {
  15. if ( args.size() > 1 ) {
  16. return "?track=${args[0]}#schedule-offering-${args[1]}"
  17. }
  18. }
  19.  
  20. boolean isSoldOut() {
  21. if ( !modalWindow.empty ) {
  22. return modalWindow.text().contains('Sold Out')
  23. }
  24. false
  25. }
  26. }

The previous two pages override the method convertToPath which allows building dynamic urls.

You could have created a single page which encapsulated all the functionality described in the previous three pages. But having smaller pages makes the code easier to follow and maintain.

Orchestrate navigation

The next class organizes the navigation while capturing the training information.

src/main/groovy/com/objectcomputing/geb/TrainingScheduleBrowser.groovy

  1. package com.objectcomputing.geb
  2.  
  3. import com.objectcomputing.model.Offering
  4. import com.objectcomputing.model.Track
  5. import geb.Browser
  6.  
  7. class TrainingScheduleBrowser {
  8.  
  9. static Set<Offering> offerings() {
  10. Browser browser = new Browser()
  11. browser.baseUrl = 'https://objectcomputing.com'
  12.  
  13. TrackSelectorPage selectorPage = browser.to TrackSelectorPage
  14. Set<Track> tracks = selectorPage.tracks().findAll { it.name != 'All Tracks' }
  15.  
  16. Set<Offering> offerings = []
  17. for (Track track : tracks ) {
  18. Set<Offering> trackOfferings = fetchTrackOfferings(browser, track)
  19. trackOfferings.each { Offering offering ->
  20. populateOfferingSoldout(browser, track, offering)
  21. }
  22. offerings += trackOfferings
  23. }
  24. offerings
  25. }
  26.  
  27. static Set<Offering> fetchTrackOfferings(Browser browser, Track track) {
  28. TrainingSchedulePage page = browser.to TrainingSchedulePage, track.id
  29. Set<Offering> offerings = page.offerings()
  30. offerings.each { it.track = track }
  31. offerings
  32. }
  33.  
  34. static void populateOfferingSoldout(Browser browser, Track track, Offering offering) {
  35. TrainingScheduleModalPage page = browser.to TrainingScheduleModalPage, track.id, offering.id
  36. offering.soldOut = page.isSoldOut()
  37. }
  38. }

If you execute, Set offerings = TrainingScheduleBrowser.offerings() and supply the -Dgeb.env=firefox, you will see a browser popup and navigate the OCI Training offerings as displayed in the next video.

Next Steps

The next logical step would be to output the scraped information. We have developed a Grails Plugin which encapsulates this library. It executes the scraper each hour to get the latest training information. The scraped information is cached and exposed as a JSON API.

Each Grails Guide displays up-to-date training information thanks to this scraper which has transformed a static HTML page in a JSON API.

Happy scraping with Geb!

Software Engineering Tech Trends (SETT) is a regular publication featuring emerging trends in software engineering.