Friday 30 April 2010

We Feel Fine

by Jonathan Harris and Sep Kamvar

At the core of We Feel Fine is a data collection engine that automatically scours the Internet every ten minutes, harvesting human feelings from a large number of blogs. Blog data comes from a variety of online sources, including LiveJournal, MSN Spaces, MySpace, Blogger, Flickr, Technorati, Feedster, Ice Rocket, and Google.

We Feel Fine scans blog posts for occurrences of the phrases "I feel" and "I am feeling". This is an approach that was inspired by techniques used in Listening Post, a wonderful project by Ben Rubin and Mark Hansen.

Once a sentence containing "I feel" or "I am feeling" is found, the system looks backward to the beginning of the sentence, and forward to the end of the sentence, and then saves the full sentence in a database.

Once saved, the sentence is scanned to see if it includes one of about 5,000 pre-identified "feelings". This list of valid feelings was constructed by hand, but basically consists of adjectives and some adverbs. The full list of valid feelings, along with the total count of each feeling, and the color assigned to each feeling, is here.

If a valid feeling is found, the sentence is said to represent one person who feels that way.

If an image is found in the post, the image is saved along with the sentence, and the image is said to represent one person who feels the feeling expressed in the sentence.

Because a high percentage of all blogs are hosted by one of several large blogging companies (Blogger, MySpace, MSN Spaces, LiveJournal, etc), the URL format of many blog posts can be used to extract the username of the post's author. Given the author's username, we can automatically traverse the given blogging site to find that user's profile page. From the profile page, we can often extract the age, gender, country, state, and city of the blog's owner. Given the country, state, and city, we can then retrieve the local weather conditions for that city at the time the post was written. We extract and save as much of this information as we can, along with the post.

This process is repeated automatically every ten minutes, generally identifying and saving between 15,000 and 20,000 feelings per day.

No comments:

Post a Comment

comment: