Website Scraped

Earlier this week Alex Balk, a co-founder of The Awl, tweeted:

Of all the alt-texts that are about to disappear this may be my favorite: https://t.co/8HqTCyuXsC
— Alex Balk (@AlexBalk) May 4, 2016

For those who don’t know, The Awl is “the last weblog” on the internet. It was started in 2009 by Balk and Choire Sicha. I started reading it in college– I remember specifically this review of the movie 2012, titled “Flicked Off: ‘2012’ is Awesome and Haters Can Suck It” gave me a refreshing example of how much fun you could have writing.

One of the little fun secret things about The Awl is that the writers would often hide text in the alt text of images or links. It didn’t take long for this “secret” to be appreciated, and when I remember to, I place my mouse on images on the site and patiently wait the required number of seconds before the alt text pops up.

What is alt text?

On images, this text is stored in the alt attribute of the HTML <img> tag. Here’s how w3schools defines the attribute:

The required alt attribute specifies an alternate text for an image, if the image cannot be displayed.

The alt attribute provides alternative information for an image if a user for some reason cannot view it (because of slow connection, an error in the src attribute, or if the user uses a screen reader).

A Penn State website on accessibly adds that “The term “ALT tag” is a common shorthand term used to refer to the ALT attribute within in the IMG tag.” Alt text is important enough that Markdown allows for it.

The w3schools site adds that “To create a tooltip for an image, use the title attribute!” which “specifies extra information about an element.”

I was vaguely aware of the distinction between the <img> tags’ alt attribute and the more global title attribute, but given that both Balk’s mournful tweet and Bakes’ 2010 Tumblr post refer to text stored in the alt attribute, I proceeded with the assumption that, at least as far as images go, the fun stuff on The Awl was stored there.

What I tried

When I saw Balk’s tweet I assumed that, due to some change on the backend of the site, the alt text for images would be somehow removed or deleted. As of this writing the alt attributes are still there, and I don’t know if they’ll be deleted or just if a new CMS won’t let the authors add them going forward (which seems strange given the progressive nature of the attribute…). Either way, my assignment was clear: scrape all of the alt text and store it in some useful way.

So that night I started looking into ways pull down the data before it was too late(!) At first I tried parsing the RSS/XML feed that the AwlTags Twitter bot uses (full Github repo).

I’m apparently not great at accessing or parsing XML with Ruby, and I couldn’t figure out how to go back more than about 70 posts, but I had some fun:

"your organic masculinity" pic.twitter.com/DfuHe4BHqR
— Sam Schlinkert (@sts10) May 5, 2016

Then I figured I’d pull down every tweet from The Awl’s Twitter account and extract the post URLs that way, but turns out you can only go back roughly 3,600 Tweets in a given user’s account. Cue the big, red “Denied” message on the hacker montage that was my Wednesday night.

What I ended up doing

So finally I confronted the most-straight forward, but also dirtier solution of scraping the site directly using Nokogiri. This ended up working great– here’s my Github repo. The Awl’s pagination is nice and simple (perhaps a Wordpress standard?): the URL for 3 pages back is simply http://theawl.com/page/3. With some guessing and checking I found that the blog, as of when I ran the scraper, went back to page 2707.

Basically the code visits each page, pulls the desired code for each of posts it finds on that page, and pushes the post_url, image_src, and image_alt to an array.

base_url = "http://www.theawl.com/page/"

all_posts = []

# 2707 is last page as of today
total_number_of_pages_to_scrape = 2705
time_to_sleep_between_page_scrapes = 4

total_number_of_pages_to_scrape.times do |i|
  i = i + 1
  this_page_url = base_url + i.to_s
  
  this_page = Page.new(this_page_url)
  all_posts = all_posts + this_page.posts

  puts "Have scraped #{i} pages so far."
  sleep time_to_sleep_between_page_scrapes
end

The Post object:

class Post
  attr_reader :image_src, :image_alt, :post_url
  def initialize(post)
    post_image = post.css("div.post__body div p:first img:first")
    @image_src = post_image.attr("src")&.value
    @image_alt = post_image.attr("alt")&.value
    @post_url = post.css('h2 a').attr('href')&.value
  end
end

Note that the scraper ignores posts that do not have images in the first p tag OR if there’s no a tag in the h2.

 def make_posts
   @doc.search("div.reverse-chron__post").each do |post|
     if !post.css("div.post__body div p:first img:first").empty? && !post.css('h2 a').empty?
       this_post = Post.new(post)
       @posts << this_post
     end
   end
 end

The above snippets are slight simplifications of code from the runner.rb file if you want to read more.

Storing the scraped text and URLs

I wanted to store the scraped data in a nice, easy, and universal format, so I chose a comma separated value file (aka CSV), which is basically a minimalistic spreadsheet (you can open them with Excel). To be more thorough, I made the scraper make two CSV files: one with every post with an image, and one only with images with alt text.

That’s where I was Wednesday night. I set the time_to_sleep_between_page_scrapes to 2 seconds, started it, dimmed the monitor, and went to sleep a little after midnight.

When I woke up there was an error and my internet was out. In my groggy state I spent a second worried I had been penalized some how for accessing too many pages too quickly, but now I think what happened was I forgot to change the setting to tell my MacBook never to go to sleep.

Energy Saver yeah OK sure

And when it did go to sleep maybe the open internet request freaked the router out some how?

Anyway I unplugged and plugged in my router and after a shower it was working again– phew. I set “Computer sleep” to never and started up the scraper again, then left for work. When I got home Thursday evening I had two nice CSVs waiting for me. I gleefully tweeted a link to the data, but nobody seemed to care. That was fine, because next came the fun part.

Front end (ugh)

On the subway ride home from work Thursday night, assuming the scraping had gone well, I started to imagine ways that I would use this data stored in the CSV files. Here’s what I came up with (Github) after an hour or two.

Update: Unfortunately, since I create this site, The Awl has taken down or moved its hosted images, and thus breaking this particular front end implementation. Bummer!

The site pulls in the CSV data from Github. Each row of the CSV contains an image URL, the image’s alt text, and the URL of the Awl post that the image came from. The JavaScript in the site then chooses a random CSV row. Then it displays the alt text as a large, caption in the bottom-left corner of the image on a yellow background, kind of like a comic book.

I was tired enough to tweet something mildly sincere.

I made something silly because I really love The @Awl https://t.co/sH08wfLFEJ pic.twitter.com/gU3vXrdqaN
— Sam Schlinkert (@sts10) May 6, 2016

I mentioned @Awl hoping to catch Balk monitoring the account and just before I fell asleep got this reply:

@sts10 You poor thing.
— The Awl (@Awl) May 6, 2016

“Fuck him,” I thought. Silvia Killingsworth, their new editor from The New Yorker, will like it.

Sure enough, the next morning Siliva tweeted this high praise

Just...wow. @sts10 made an alt-tag site https://t.co/akDqt2VGM1
— Silvia Killingsworth (@silviakillings) May 6, 2016

along with a series of screenshots from the site. Woohoo!

Fun with URL Parameters

Today I added some more JavaScript to the site so that there’s effectively a URL parameter with the URL of the Awl post of the image. So as you’re clicking through the images, the URL on my site actually changes. That way if you find one you like, you can share the URL (something like http://samschlinkert.com/awl-alt-tags/?http://www.theawl.com/2010/05/the-awl-in-your-internet-mailbox) on social media or email or whatever, and others going to that URL will get the image and alt text that you intended to send them (rather than a random one).

Code-wise there’s two parts to this: (1) give the site the ability to read a URL from the URL’s parameters and display it, and (2) change the site’s URL whenever a new image is served.

From index.html, here’s the start of part 1:

var baseURL = window.location.toString();
if (baseURL.split("?")[1] !== undefined && baseURL.split("?")[1] !== ""){
  var givenURL = baseURL.split("?")[1];
} 

And the end of part 2:

// 4. write the post_url into the address bar
var baseURL = window.location.toString().split("?")[0];
history.replaceState({}, document.title, baseURL + "?" + post_url); 

This is a technique I first used on my GIF rank project, and I think it’s pretty sweet. I’ve also written about the idea of storing non-sensitive, user-specific data in URL parameters before.

Epilogue: Headlines with Node.js

Separately I’d been playing around with a JavaScript framework (I think that’s what it is) called Node.js this week. So on Friday, for a challenge (yolo), I figured I’d build a new scraper with Node to grab all the headlines from The Awl that contained exactly two words.

Why exactly two words? Because, similar to the alt text thing, The Awl sometimes uses a humorous device of writing headlines that follow a noun + adjective or noun + verb construction (ugh it feels like explaining a joke but OK). Also similar to the alt text thing, others had noticed and chronicled it a bit. A sampling: “Earth Pretty”, “Man Sweaty”, “Accomplishments Transitory”, “Goat Vexed”, etc.

Since I already knew the best way to scrape the data and what HTML to target, this task was more about the coding and learning how to use Node (I’m very new to it). Just getting Node installed was a bit of a trick for me, since I had haphazardly installed io.js on my machine a few months ago and struggled to un-install it.

For future reference, or anyone else facing this problem, I first consulted this Stack Overflow answer and ran all of the code therein to get rid of my previous io.js installation. Then I installed NVM (Node Version Manager) (which seems to work very much like RVM) and ran nvm install node. Now node -v gives me v6.0.0.

To scrape the HTML I used the Node’s http endpoint (is it called an endpoint?) and its get method. To parse the HTML I used a package called Cheerio. To write to a CSV file, I used a package called ya-csv, thanks to this helpful blog post, which notes, “While there seemed to be good Node packages available [for writing to CSVs] they lacked very good documentation.”

I’m more comfortable in Ruby than in JavaScript at this point, so some simple things took me a while. The stickiest part was how to make the scraper wait a second or two between calls to avoid a timeout. I had run into problems with asynchronous code before– the asynchronous capabilities of Node are both a reason I’m interested by it and apparently a conceptual headache for me. Anyway, after a good amount of trial and error I got it working with setInterval. Here’s that bit from app.js:

var i = 1;
var totalPagesToScrape = 2705;

var interval = setInterval(function(pageToScrape){
  getPage(i);
  console.log("ran the interval for the " + i + " time.");
  i = i + 1;
  if (i == totalPagesToScrape){
    clearInterval(interval);
  }
}, 1000, i);

I still don’t know why I never needed to refer to pageToScrape in the anonymous function… maybe because I made i global and just used that? In fact there’s a good amount of that code block I’d live to go over with someone who knows their stuff, but it worked!

I also don’t love how much code I have in the response.on('end', function(){ function. But that’s the only place where I know I’ve got a new page scraped and ready so I guess that’s how it goes with asynchronous.

The front end for the two-word headlines project (Github) is similar to the alt text one– if anything it’s simpler. I decided to allow the user to randomly swap out either the first word or the second word of the headline (or both).

$('#both-button').on("click", function(){
    newHeadline(data);
  });
$('#first-button').on("click", function(){
    newWord(data, 1);
  });
$('#second-button').on("click", function(){
    newWord(data, 2);
  });

Just the Links

Alt text scraper:

Two-word headline scraper: