Blog Database Creation Process

From DML

Jump to: navigation, search

Contents

[edit] Overview

We created a large database of blog entries using the unofficial Google Reader API. The database included almost 20,000,000 entries from over 150,000 blogs for the period of a year, from July 1, 2006 to June 30, 2007. We determined which blogs to retrieve entries from by following the links in the blog entries, beginning with Robert Scoble’s blog. We began with Scoble because of the large amount and variety of content available on his blog. Our assumption is that because of this, within a few levels from Scoble we would find a wide range of social networks.

To retrieve a level of blog entries in database a three step process was followed. First, using the pyrfeed Google Reader interface entries were extracted for all blogs on a feed level. Second, all links were extracted from the blog entry content. Third, we determined whether or not the url in the link was to another feed by parsing the HTTP headers for a content-type that implied it was a feed. If content-type in the HTTP headers was text/html then we parsed the HTML header to see if it contained a link HTML tag that specified a feed. If we could not find a feed for the url using either of these two m ethods we assumed that the link was to some other type of content besides a feed and did not consider it in our analysis. Following this pattern and beginning with Robert Scoble’s blog we retrieved all entries for feeds located within 3 levels of Scoble.

[edit] Discovering Methods

When we first started to collect blog entries we were retrieving blog entries directly from the blogs as they were posted. This caused several problems, including

  1. It took lots of time to get the content for the database,
  2. If the server went down and you missed a day then you would have holes in your database, and
  3. You had to know which blogs you were retrieving entries from right from the start of the study. We wanted to look for a way to retrieve entries posted in the past. We thought that we would try to use one of the Google APIs. Initially it didn’t appear that Google Reader had an API so we looked at using the Blogger Data API or the Google AJAX Feed API but neither completely met our needs. The Blogger API would only work to analyze networks among Blogger sites and the Google AJAX didn’t seem to have a way to retrieve a set number of entries and had other limitations in its usage, for

instance, it could only be accessed using javascript.

[edit] Google Reader API

We discovered a post about the Google Reader API on at http://www.niallkennedy.com/blog/archives/2005/12/google_reader_a.html. In the post and its comments it seemed that Google was in the process of producing an API to access the data underneath the Google Reader interface. It also seemed that they didn’t mind if others were using the unreleased api specified by Nial.

[edit] Pyrfeed

We experimented a little using the API with HTTP requests and ran into some problems with java and parsing the feeds that were returned. A little more probing led us to http://code.google.com/p/pyrfeed/wiki/GoogleReaderAPI which explained in greater detail the Google Reader API. It also included a program called pyrfeed which was an interface for using the GoogleReader API. Given your Google login credentials it could send requests to Google to retrieve the entries from a specific feed url. It provided access to Google’s database which we assumed contained the majority of real feeds that we would be interested in. It also allowed you to specify how many entries you wanted to retrieve. Pyrfeed also was setup to input them into an SQL database, using SQLLite3. We tested it on a few feeds and it seemed to retrieve full entry content for most feeds. Pyrfeed was written in the Python language and seemed to run at a reasonable speed.

[edit] Using Pyrfeed

We needed to modify the pyrfeed code a little so that we could use it with our MySQL database. We changed the connection type and a little bit of the SQL code so that the syntax matched MySQL instead of SQLite3. We also added some support to the inserts so that it could match the system of id numbers that we were going to use in our database and so that we could also keep track of the authors.

[edit] Where to Begin

We decided that in order to determine which blogs to retrieve the entries from we would follow the links in the entries to other blogs and retrieve the entries from those blogs. The original goal was to do this for several levels providing enough data in order to find useful explicit and implicit networks. We decided to begin building our network at http://scobleizer.com/feed/ which is a feed written by Robert Scoble. We chose Scoble because he is a well known name in the blogosphere. We also made the decision to download only the entries for the period from July 1st 2006 to July 1st 2007. This would limit the number of entries we would need to deal with and allow for us to do some work with timegraphs using the explicit links of a blog. For that time period Scoble had about 1,870 entries. Because it was already past July 1st 2007 when we began to extract the data and the pyrfeed interface and Google Reader Api did not allow you to select a date range to retrieve entries from we needed to retrieve a large number of entries for each feed in order to make sure we got all the entries for that year. We chose to retrieve 2500 entries for each feed and then we could filter them ourselves according to the date range. It seemed to a reasonable number because we assumed Scoble posts more than an average blogger would post and so 2500 entries would cover the date range in most all cases. We later found that many bloggers either have recently begun to blog or do not post a significant amount so in the majority of cases it retrieved less than the limit of 2500 entries.

[edit] Creating the Database

The first part of the database that we created was tables for the entries, feeds and authors. The blog_url table includes a blog_url_id key for each feed, the url for that feed and the feed level of that feed. The feed level means the step of the process that a blog’s entries were retrieved on ie. Scoble had a feed level of 0, the blogs he linked to had a feed level of 1 and the blogs they linked to had a feed level of 2. The blog_entries table has the blog_url_id of each entry, a unique blog_entry_id and all of the information which pyrfeed retrieved for each entry, including its content, entry title, publish date, url, and author. The authors table includes a unique author id key, the blog_url_id of the blog which the author writes for and the name of the author.

[edit] Extracting Links

After creating the beginnings of the database we downloaded the first entries from Scoble and used an HTML parser in order to extract the links from the entries retrieved. The parser would look for the <a href> tags and extract all the links from a particular entry. It then added those links into another table called explicit_links. The explicit_links table contained the blog_entry_id that links was extracted from, the url of the link and the date the link was posted by the author. The parser and interface were implemented using python and its HTMLParser library.

[edit] Determining Which Links Were Feeds

Once we had the links the problem was determing which of those links were to other blogs/feeds. We could not simply plug the url into the pyrfeed/Google Reader Api because they required the actual url of the feed in order to retrieve entries. Because we assumed that most or all of the links would probably be to other blogger’s entries instead of to their feed url we needed a way to determine whether a url had a feed connected to it.

[edit] Technorati

We first experimented with using the Technorati Api to help us determine whether a url had a feed connected to it. We faced several problems. One was that Technorati limited the number of queries that could be run a day and would not respond to our emails inquiring about the possiblity of performing more queries each day. The second problem was that even technorati’s data seemed to be incomplete and a lot of entries didn’t have information connecting them to the actual feed url. We decided that using Technorati’s Api wasn’t an option.

[edit] Parsing the Webfile

After some investigation the best option that we found was to request the webpage for each url and using the information found in the headers determine whether or not we could connect it to a feed. The best option was to read the headers returned when we used the python urllib2 library to request the webfile and first check to see if under content-type we found one of the content types that specified a feed. If so then the link must be specifically to the feed. If the content type was not a feed but was an html file then we pulled down the webfile. We then parsed it using another HTML parser. The parser only parses between the <head></head> tags and looks for a <link> tag. When it finds a link tag it will check to see if it has a “rel” attribute has the value “alternate” and if it has a “type” attribute with a value such as “application/rss+xml” then we have found a feed url for that link. If it parses through the header and does not find such a link tag then we assume that there is no feed url for that link.

[edit] Retrieving Entries

After we had a list of the urls linked to by Scoble then we could begin to retrieve entries from Google Reader using pyrfeed. We modified the code again to update the database if we found that Google Reader didn’t have any entries. This usually meant that it was not a real feed or that something was wrong with the url. If Google Reader had no entries for that feed we removed it from our blog_urls table, modified the url_lookup table and removed all explicit links to that blog. When we were on the second level we realized it took too long to remove the explicit links as we retrieved entries, so we left them in the explicit_links table. They could be pruned later by matching them up to a -1 in the url_lookup table. If Google Reader had entries for that feed then it would add them to our database and move on to the next feed url.

Personal tools