SquawkBot Part 2: Extracting URLs

In my first post about SquawkBot (public GitHub repo), I went over how the app connects to the Twitter REST API. In this post, I’ll be discussing the second main part of the app: extracting the URLs from the tweets.

In part one we left off in the Timeline model and we were just about to push our custom Tweet objects into an instance variable called @tweets. Again, we’re not shoveling in the Tweet objects that Twitter gives us directly. We’re pushing in oru own Tweet objects. Let’s review this part for a second.

At the end of the make_tweets instance method in the Timeline model, we have this each loop, where timeline is the array of all the Tweet objects we got from Twitter:

timeline.each do |tweet_obj|
  @tweets << Tweet.new(tweet_obj)
end

However, these Tweet objects from Twitter didn’t really suit our needs. So we decided to make our own Tweet objects. To make this process as easy as possible, we had our Tweet objects accept Twitter’s Tweet object on intialization, as you can see in the above each loop.

Making our Own Tweet Object

Here is our Tweet model in its entirety:

class Tweet 
  attr_reader :tweet_id
  attr_reader :text
  attr_reader :user_name
  attr_reader :user_handle
  attr_reader :tweet_url
  attr_reader :created_at

  def initialize(tweet_obj)
    @tweet_id = tweet_obj.id
    @text = tweet_obj.text
    @user_name = tweet_obj.user.name
    @user_handle = tweet_obj.user.handle
    @tweet_url = tweet_obj.url
    @created_at = tweet_obj.created_at
    @expanded_urls = tweet_obj.urls.map { |url| url.attrs[:expanded_url] }    
  end

  def expanded_urls 
    @expanded_urls
  end
  
end

Remember: The idea here is to take everything we’re going to need from Twitter’s Tweet object (and nothing more) and making that information as easy to access as possible.

As we saw in the Tmieline model, our Tweet object accepts a Twitter Tweet object, tweet_obj, on initialization. So let’s start by looking at the initialize method.

The first line declares an instance variable called @tweet_id and assigns it to tweet_obj.id. Twitter’s Tweet object has an instance method called id that returns the id of that particular tweet. Since we’re going to need that later, we save it as @tweet_id. We also declare an attr_reader for tweet_id so we can read from it later.

We follow the same procedure for @text, @user_name, @user_handle, @tweet_url (the URL of the Tweet itself), and @created_at. These are the variables we’ll need to display the Tweet in the view.

The only one that is different is @expanded_urls. This is the most important move in this method. Digging into Twitter’s Tweet object, we evenually found an array of all the URLs contained in the text of that Tweet (some tweets have more than one URL in it). Obviously this is very important to us given the URL-collecting nature of our app.

We choose the expanded URL as opposed to the shortened or display URL because the same link can be shortened in a variety of ways, but it will usually have the same expanded URL. This makes it easily to compare tham later.

Finally, since @expanded_urls is an array, I wrote out a reader for it below the intialize method.

Making URL Objects

Now that @tweets is an array loaded-up with our custom-made Tweet objects, we’re ready to get the fun part– making the URL objects.

What’s a URL object? A URL object is an object that represents one row of Tweets on the SquawkBot results page. We’re going to comb through every tweet we got from the user’s timeline and everytime we see a URL, we’re either going to give it a “plus one” appearance, or, if we haven’t seen that URL before, we’re going to make a new URL object. Thus, one URL object may contain multiple tweets.

So let’s say this link to this article: http://www.theawl.com/2014/04/in-defense-of-explaining-things gets tweeted by 5 people who I follow. There will be URL object with an address of http://www.theawl.com/2014/04/in-defense-of-explaining-things and its appearances will be 5. The URL object will also have an array of Tweet objects that contain that URL.

Here is the Url model:

class Url
  attr_accessor :address, :appearances

  def initialize 
    @tweet_objs = []
    @appearances = 1
  end 

  def tweet_objs 
    @tweet_objs
  end 

  def add_tweet_obj(tweet_obj)
    @tweet_objs << tweet_obj
  end 
end 

I call the Tweet objects @tweet_objs just so we know what they are. I also defined a method add_tweet_obj that adds a Tweet to a URL. The other thing you may notice is that on intialization I set appearances = 1, which makes sense.

OK, now we’re ready to return to the Timeline model and see the all-important make_url_objs method.

def make_url_objs
  self.make_tweets

  @tweets.each do |tweet|
    tweet.expanded_urls.each do |url|
      url_obj = @url_objs.detect {|url_obj| url_obj.address == url } 
    
      if url_obj
        url_obj.appearances = url_obj.appearances + 1
        url_obj.add_tweet_obj(tweet)
      else
        new_url_obj = Url.new
        new_url_obj.address = url
        new_url_obj.add_tweet_obj(tweet)
        @url_objs << new_url_obj
      end
    end
  end

  # more code here...

end

Since make_url_objs is the only method we call in the show action of the Timelines controller (besides get_max_appearances, which isn’t super important), we need all the magic to happen here. Thus the first line of the method, which called the make_tweets method on self. As we’ve gone over, that method basically loads up the @tweets instance method with our custom Tweet objects.

Next we have nested each loops that go through each URL in each tweet (remember: some tweets have more than one URL in it). Now that we have a particular url, we want to check to see if we’re ever seen it before.

We assign the result of that little detect method to a new local variable called url_obj. If there was a match, url_obj will be the matching URL object. If there was no match, url_obj will equal nil.

So if url_obj exists, we know there was a match. Thus we want to increase the appearances of that URL by one and we want to add the Tweet we found the URL in to the tweet_objs array of that url_obj. Basically we just found another tweet the mentions that article.

If there’s no match, we have found a new URL and we want to make a new Url object. We give it the address of the url we’re currently on, add the tweet to its tweet_objs array, and add the Url object itself to @url_objs, so it will return a match next time we see it.

Coming out of those two each loops we’ll have an array of URL objects in @url_objs.

Next we’ll run a filtering method on this instance array. The method, called filter_url_objs, rejects any URLs with appearance > 1, and it also attempts to weed out multiple tweets by the same user. The use case here is when CNN tweets a link to the same article 4 times in a short amount of time– we decided that we didn’t want this to constitute a legit ‘squawk’ on its own.

We then sort the array by appearances (most appearances first) and we’re ready to send @url_objs back to the Timeslines controller and on to the show view.

Phew! I know that ran a little long but I hope it was helpful to some of you. The rest of the app is all about making the Tweets display nicely, the methods for which are in the tweets_helper and from the Twitter-Text Ruby gem (namely that auto_link method). @reply me on Twitter if you have any further questions about SquawkBot.