Web Scraping with Ruby: a 10 Step Guide

Scrape any site with Ruby & Nokogiri

What is web scraping?

Web scraping is the process of programmatically gathering data from the Internet. If you have ever copy-pasted from a site, you’ve technically scraped the web, albeit on a small scale.

Why would I want to web scrape?

Well, you can extract any piece of information you want from the web.

For example, you could scrape job boards for the latest postings in your field, monitor Amazon prices on hand sanitizer to ensure you aren’t getting ripped off too badly, scrape Craigslist for new apartments in your area under your budget, your options are only limited by your own imagination.

Or in this example… scraping the ADP (Average Draft Position) from a few fantasy football sites to get a better sense of how the First Round may shake out in my upcoming draft.

How do I get started?

APIs make extracting data from sites relatively painless but not every site may have one publicly available. So what happens if a site doesn’t have an API… enter web scraping, and in Ruby’s case, the Nokogiri gem.

We’ll need the following gems installed to get going:

Open-URI: to make our HTTP requests

Nokogiri: to parse our HTML and collect our data

Pry: for debugging & testing purposes

Nokogiri — Ruby’s Secret Weapon

Nokogiri describes itself as an “HTML, XML, SAX and Reader parser […] with the ability to search documents via XPath or CSS selectors”

In layman’s terms… this enables us to select items from our parsed document using their CSS selector. This really is where the magic happens.

Step-by-Step Guide

Step 1: Ensure you have the aforementioned gems installed (Nokogiri, Open-URI & Pry).

Step 2: After creating a new ruby file we’ll need to require all three of our gems.

Step 3: Let’s create a new class within our scraper.rb file. For simplicity’s sake (and a lack of creativity) we’ll call it Scraper.

Step 4: Within our Scraper class, let’s pass the url (in this case, CBS Sports ADP page) we are looking to scrape into open-uri’s “open” method and save it to a variable.

Step 5: Next, we need to parse our html. We accomplish this by passing our html variable into Nokogiri to return an object we can interact with.

Step 6: Identify the data we’d like to scrape and find the corresponding CSS selector. Easiest way to do this is to right click the relevant field and to inspect it.

This will open up Chrome’s Developer Tools.

In this example, I see that the field we’d like to scrape is held within an <a> tag and has a parent <span> element with a class of .CellPlayerName — long.

Step 7: From here we need to pass in our CSS selector into Nokogori’s .css method.

Step 8: Jumping into our code with a ‘binding.pry’ placed after our players variable we can call players.first to take a look at the Nokogiri object we’ll need to iterate over.

We see that the field “Christian McCaffrey” we want is housed within a children array and denoted with a Nokogiri #Text tag.

(Sidebar — if you’ve been lucky enough to snag your league’s Number 1 pick, don’t overthink it… this is your guy)

Step 9: Lucky for us, we can call another Nokogiri method (.text) to extract our player name here.

Step 10: Now, we need to iterate over the entirety of our players object, call that same .text method and collect those names. We can leverage Ruby’s map/collect method here.

Let’s validate that our player’s array is returning the full list of names by jumping into another ‘pry’ session.

Taking it to the Next Level

Now that we’ve successfully created our first scraper, let’s make it a bit more dynamic and user-friendly.

Step 11: What if I wanted to only print out the Top 10 players and not all 222 players that CBS has deemed worthy of a roster spot?

In order to do this we’ll need a way to limit our result by passing a variable into our newly built scraper. Let’s encapsulate all our functionality into a method (cbs_scraper) and pass in a variable (num_of_players) to limit our players array to our liking. (I know, I know… my creativity surprises me too sometimes)

Ruby’s take method works perfectly here.

Step 12: Last and final step is to call on our cbs_scraper method. To do this we’ll need to make a new class instance and call our method on that new Scraper instance.

And finally, after running our file we have our Top 10.

From here, we can rinse-repeat for the remainder of our sites using these same processes outlined above.

Hopefully now, strapped with your newfound skills and boundless confidence, your wheels are turning on some web scraping projects of your own.