leo.blog();

Ruby web scraping

In today’s digital age, vast amounts of information are freely available on the web, but much of it is unstructured and hard to analyze. Web scraping solves this by automatically collecting and structuring data from web pages. Doing this well requires an understanding of HTTP, web server behavior, ethics, and a solid grasp of Ruby and its ecosystem (RubyGems, Bundler, and scraping libraries).

Understanding Web Scraping

Web scraping is the automated extraction of structured data from web pages. A script fetches a page (usually HTML), parses it, and pulls out specific pieces of information such as tables, lists, or profiles that are publicly accessible. Ruby is commonly used for this because it offers concise syntax and powerful parsing libraries.

Ethics Surrounding Web Scraping

Scraping must respect both legal and ethical boundaries:

Uses in Data Gathering

Ruby-based web scraping supports:

Concept of HTTP Requests & Server Responses

When you visit a site, your browser sends an HTTP request; the server replies with a response, usually containing HTML. A scraper does the same thing programmatically:

Knowing how to adjust requests (e.g., headers, query params) is crucial for getting consistent, correct responses.

Ruby and Web Scraping

Ruby’s succinct syntax and strong library support make it well-suited for scraping. Libraries like open-uri, Net::HTTP, and especially Nokogiri handle requests and HTML parsing. With CSS or XPath selectors, you can precisely target the content you need.

Introduction to Ruby

Understanding Ruby Language and Syntax

Ruby is a dynamic, object-oriented language designed for readability and productivity. Key building blocks include:

Ruby favors clear naming (snake_case for variables and methods) and minimal boilerplate, which keeps scraping scripts easy to read.

RubyGems and Bundler: Tools to Manage Ruby Packages

RubyGems is Ruby’s package manager. You install libraries (gems) with:

gem install gem_name

Bundler manages dependencies per project. You list gems in a Gemfile:

# Gemfile
source "https://rubygems.org"

gem "nokogiri"
gem "httparty"

Then install and lock versions:

bundle install

This ensures every environment runs the same gem versions (recorded in Gemfile.lock), keeping your scraper stable over time.

Web Scraping with Ruby

For scraping, Ruby commonly uses gems like nokogiri for parsing and httparty or open-uri for HTTP. Typical steps are:

  1. Fetch the page.
  2. Parse the HTML.
  3. Use CSS or XPath selectors to extract data.
  4. Store or process the results.

Always add delays between requests and follow site rules.

Ruby Libraries for Web Scraping

Ruby provides several libraries to simplify scraping tasks.

Nokogiri

Nokogiri is the standard Ruby library for parsing HTML and XML. It offers fast parsing and flexible ways to search a document (CSS, XPath, DOM traversal).

Install:

gem install nokogiri

Basic usage:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(URI.open("http://www.example.com"))
doc.css('h1').each do |heading|
  puts heading.text
end

This fetches the page, parses it, and prints the text of all <h1> elements.

Watir

Watir (“Web Application Testing in Ruby”) drives a real browser. It’s useful when:

Install:

gem install watir

Example:

require 'watir'

browser = Watir::Browser.new
browser.goto 'http://example.com'
puts browser.title
browser.close

Watir is slower than pure HTTP+Nokogiri scraping but can handle complex, dynamic pages.

Other Libraries

Depending on your needs:

Each has trade-offs in speed, simplicity, and ability to handle dynamic content.

Creating a Simple Web Scraper

Installing Required Libraries

For a basic static-page scraper, you’ll typically use nokogiri and httparty:

gem install nokogiri
gem install httparty

Setting Up Your Scraper

Require the libraries in your Ruby script:

require 'nokogiri'
require 'httparty'

Making HTTP Request and Parsing the Response

Use HTTParty to fetch the page and Nokogiri to parse it:

def scraper
  url = 'https://[your-target-url]'
  response = HTTParty.get(url)
  Nokogiri::HTML(response.body)
end

Finding Data and Extracting It

Use CSS selectors to grab the elements you care about:

def scraper
  url = 'https://[your-target-url]'
  doc = Nokogiri::HTML(HTTParty.get(url).body)
  paragraphs = doc.css('p').map(&:text)
  paragraphs
end

Storing the Scraped Data

You can return arrays or hashes, save to CSV or a database, or send results to an API. For example, collecting blog posts into hashes:

def scraper
  url = 'https://[your-target-url]'
  doc = Nokogiri::HTML(HTTParty.get(url).body)

  doc.css('.blog-post').map do |post|
    {
      title:   post.css('h1').text.strip,
      content: post.css('p').map(&:text).join(" ").strip
    }
  end
end

Always confirm that scraping is permitted and that your use of the data complies with laws and the site’s terms.

Handling Complex Web Scraping Tasks

Some sites are more complex, requiring sessions, cookies, or JavaScript execution.

Managing Cookies and Session Information

When a site uses sessions (e.g., login), you may need to send cookies with each request. HTTParty supports this:

cookies = HTTParty::CookieHash.new
cookies.add_cookies("user_session_id" => "1234")

response = HTTParty.get('http://example.com', cookies: cookies)

Reusing the same cookie hash across requests keeps your session active.

Scraping Dynamic Websites

Many modern sites load data via JavaScript after the initial HTML is delivered. Plain HTTP + Nokogiri only sees the original HTML, not the rendered page.

In those cases, use a browser automation tool like Watir or Selenium WebDriver:

require 'watir'

browser = Watir::Browser.new
browser.goto 'http://example.com'
puts browser.text    # text after JavaScript runs
browser.close

These tools are slower but simulate a real user, executing JavaScript and handling interactions.

Using APIs as Alternative Data Sources

If a site offers an API, it’s often better than scraping:

For example, using Twitter’s API via a Ruby client:

client = Twitter::REST::Client.new do |config|
  config.consumer_key        = "YOUR_CONSUMER_KEY"
  config.consumer_secret     = "YOUR_CONSUMER_SECRET"
  config.access_token        = "YOUR_ACCESS_TOKEN"
  config.access_token_secret = "YOUR_ACCESS_SECRET"
end

tweets = client.user_timeline("twitter_handle", count: 10)
tweets.each { |tweet| puts tweet.text }

APIs usually require authentication and enforce rate limits, so plan accordingly.

In the realm of data acquisition, web scraping is a key technique. With Ruby’s concise syntax and libraries like Nokogiri, HTTParty, and Watir, you can build scrapers from simple HTML parsers to full-featured automation scripts. Combined with careful attention to ethics, legality, and alternatives like official APIs, this skill lets you responsibly tap into the vast data resources of the modern web.

Leave a Comment