BlogGrabber

From DML

Jump to: navigation, search

Contents

[edit] Download

Latest Version: Download BlogGrabber-1.0.1.zip

[edit] Documentation

BlogGrabber was written to allow for the extraction of blog data and to build networks of blogs based on the hypertext links between blogs. Additional information of the first experiment we conducted using this methodology can be found here.

[edit] Description of Blog Retrieval and Extraction Cycle

First, using the pyrfeed Google Reader interface entries were extracted for all blogs on a feed level. Second, all links were extracted from the blog entry content. Third, we determined whether or not the url in the link was to another feed by parsing the HTTP headers for a content-type that implied it was a feed (for example, application/atom+xml or application/rss+xml). If content-type in the HTTP headers was text/html then we parsed the HTML header to see if it contained a link HTML tag that specified a feed (for example, if you go to datamininglab.blogspot.com and view the source there is a <link rel="alternate" type="application/atom+xml" title="Data Mining Lab - Atom" href="http://datamininglab.blogspot.com/feeds/posts/default" /> tag that specifies the feed associated with the page). If we could not find a feed for the url using either of these two methods we assumed that the link was to some other type of content besides a feed and did not consider it in our analysis.

[edit] Basic Command Line Format

The user of BlogGrabber can perform different functions by running BlogGrabber.py with different runtime parameters or by modifying the parameter values in the config.dat file. The basic format of parameters on the command line is as follows.

python BlogGrabber.py -key1=value1 -key1=value2 key3=value3 etc.

For example the command

python BlogGrabber.py -dbhost=localhost -dbname=blogdatabase

would run the BlogGrabber using the mysql database on the localhost and using a database named blogdatabase.

[edit] How To Start

To begin you will need a blog or two to build your network around. So, you will most likely run a command similar to:

python BlogGrabber.py -run=all -googlename=yourgoogleusername -googlepassword=yourgooglepassword -url=theurlofthefeed -feedlevel=1

In addition you will want to edit in the config file or enter as command line parameters your dbhost, dbname and dbpassword. You may also want to change the value of numentries to the number of entries you want to retrieve for each blog. It is also important to note that the url entered in on the command line must be the url of the feed, not the url of the basic webpage of the blog. For example, for the blog found at http://sheeptoss.blogspot.com the url that would be used to retrieve feed content is http://feeds.feedburner.com/TheSheepToss because that is where the feed for the blog is located. If you simply enter the url for the home webpage of the blog then GoogleReader will not return any content.

After the first command has been run you should see that a few tables have been created with the content from that blog. By then changing the feedlevel parameter to 2 and running the command with the parameter run=all you can retrieve the blog content at the next degree of separation and so on. You will want to make sure that you pass that the url argument passed in on the command line or set in the config.dat file is set to be empty. When it is set to be empty BlogGrabber will retrieve entries from all feeds at that feedlevel, otherwise it will only retrieve blog entries from the url passed in as a parameter.

[edit] Description of Parameters

There are a number of parameters that can be passed to the BlogGrabber to modify the functions it will perform. Examples of each parameter follow.

For each example the value after the equals sign is a possible value that could be passed in for the parameter. It is important to note that for the run parameter, the possible values (retrieve,extract,feedurl,esn,all) are listed underneath the example.

  • -dbhost=localhost - dbhost is the host of mysql server you are connecting to. For example, localhost or dml.cs.byu.edu
  • -dbname=mydatabase - dbname is the name of the database located on mysql server. The database must have already been created on the database or it will not be able to connect.
  • -dbuser=myusername - dbuser is the user name used to connect to the database on the server.
  • -dbpassword=mypassword - dbpassword is the password associated with a username that can be used to access the database on the server.
  • -googlename=mygoogleid - googlename is the username for your google account. It is simply used to login to Google so GoogleReader can be accessed in order to retrieve feeds.
  • -googlepassword=mygooglepassword - googlepassword is the password for your google account.
  • -feedlevel=2 -feedlevel represents a level of blogs in the network. For instance the first level of your blog network should start at 1. The blogs that are linked to by blogs in the first level would then be located at level 2. IMPORTANT: You must change the feedlevel after each cycle in order for the BlogGrabber to work correctly. This is because it performs each step of the cycle on a whole feedlevel at a time. The feedlevel parameter must be an integer.
  • -url=http://bogusfeed.com/rss - url is the name of a url to be retrieved and added to the specified feedlevel. IMPORTANT the url must be set to the empty string if the BlogGrabber is to retrieve all entries from a feedlevel. It assumes that if the url is set to anything other than the empty string that you simply want to add that url's feed content to the database. This parameter is mainly intended to be used when a specific feed was accidently removed, or content was not fully retrieved. It also can be used to begin a blog network around a specific blog by retrieving the entries from that blog and adding them to the first feedlevel.
  • -numentries=20 -numentries specifies the max number of entries to be retrieved for each blog. Keep in mind that retrieving a large number of entries may take a significant amount more time because GoogleReader is only accessed by a single thread to avoid slamming the GoogleReader server with requests.
  • -run=retrieve -Specifies which portion of the retrieval and extraction cycle is to be run. In order for the BlogGrabber to function correctly the order that the commands should be run to retrieve and extract all information from a feed level is retrieve=>extract=>feedurl. Then after the entries have been retrieved from the next feed level of blogs you can run the esn command on the previous feedlevel to extract all blog-to-blog links on that feed level to the ESN table. Possiblities parameter values include the following.
    • retrieve -Retrieves all the entries from a feed level or from a specific url depending on the value of the url parameter and adds the entries to the database.
    • extract - Extracts all hypertext links from the entries in a feedlevel and adds them to the database.
    • feedurl - Checks all links extracted from a feedlevel to see if they are really feeds. If they are then it adds them to the blog_urls table with a feedlevel one greater than the current feedlevel. This represents the degrees of separation of a blog from the initial blogs in the network.
    • esn - Extracts information from the explicit_links table for all urls on a feedlevel and places into the esn_links table
    • all - Runs retrieve, extract and feedurl options for a feedlevel
  • -save Saves the config parameters from the command line in the config.dat file. This is most useful for parameters such as the database and user names and passwords that do not need to be changed from one operation to the next.

[edit] Description of MySQL Tables

  • blog_urls - Contains information about the blogs
    • blog_url_id - The index of the blog
    • feedlevel - The level of separation the blog is from where you began your network
    • url - The url where the blog can be found
    • added_on - The date the blog was added to the database
  • blog_entries - Contains information about the blog entries that have been retrieved
    • blog_entry_id - The index of the blog entry
    • blog_url_id - The index of the blog to which the entry belongs
    • title - The title of the blog entry
    • published - The date when the blog entry was published
    • updated - The date of the latest update to the blog entry
    • content - The actual html content of the blog entry
    • author_id - The index of the author who composed the entry
    • link - The url location where the blog entry can be found
    • original_id and google_id - information used by google to identify the entry
    • crawled - The time when the entry was retrieved
  • blog_authors - Contains information about the authors of blog entries
    • author_id - The index of an author
    • blog_url_id - The blog which the author writes for
    • name - The name of the author
  • explicit_links - Contains information about the links found in the blog entry content
    • blog_url_id - The index of the blog from which the link originated
    • blog_entry_id - The index of the blog entry from which the link originated
    • url - The url of the link found in the <a href=""> </a> tags of the html content of a blog entry
    • date - The date and time the link was posted
  • url_lookup - Contains information about the feeds connected to specific urls
    • url - The url to be evaluated for its associated feed
    • blog_url_id - The index of the feed found to associated with the url, either from the content type of the request or from the parsing the link tags found in between the <head></head> tags for the url of a feed
    • date - The date the url was checked for an associated feed
  • esn_links - Contains information about the explicit links found in the network
    • blog_url_id1 - The index of the blog from which an explicit link originated
    • blog_url_id2 - The index of the blog which was the target of the explicit link
    • pubdate - The time that the link was posted
Personal tools