My master's capstone project is how I've spent a great many evenings over the past year. It's been a chance to explore some new technologies and new problem solving techniques like machine learning. Here's what I build.
Collaborative recommender systems (like the software that drives Netflix's movie recommendations) require a lot of rating data before they become useful. This presents a problem for new systems. To provide quality suggestions, a system needs a large set of users actively rating content - but attracting those initial users is difficult because new systems are of low value until they can meaningfully recommend content.
My project demonstrates one approach to solving the problem. The software I built bootstraps a recommender system by harvesting publicly available micro-blogging data (specifically Twitter). The software uses machine learning to build a sentiment analysis classifier that allows it to decide if movie-related posts express positive or negative feeling. It then uses the classified data to construct a recommendation system.
The system is large enough that breaking it into a set of discrete components made a much cleaner design. This also allowed me to work on the individual components in isolation. The component responsibilities are:
- Data Access Layer
- At the bottom of the system is the Data Access Layer. This component is responsible for storing and retrieving data. These classes present an abstraction between the final data storage tool and the rest of the software.
- Data Acquisition
- Movie Tweets pulls data from several external sources: Twitter, Topsy, and Rotten Tomatoes. The Twitter data is also harvested in different ways from different API points. The classes in this component connect to these external services and gather data.
- The classification component is responsible for determining if a Tweet is either positive or negative. Though interaction with this component by other areas of the software is restricted to a simple interface, internally this component is complex.
- Data Analysis
- The Data Analysis component transforms raw tweets into a form usable by the system. It identifies language, determines if a tweet contains sentiment, and cleans up tweets by replacing terms such as @twitter_user with a placeholder. The primary output of this component is a relationship between users and movies.
- The recommender component is responsible for generating a list of movies that a user might like through collaborative filtering.
- Assessment is a small component to build administrative reports over the tweet data.
- User Interface
- The final component is the user interface. The user interface depends on all the other components and presents their final output.
The Groovy language was my choice for the project. It's a language I've been playing with for a couple years now. I've used it in the context of building some small Grails sites. I've also started writing Groovy tests for our Java software at work; using Groovy for Java tests is a reasonably common use case. The Groovy console is something I often keep up during Java development so I can quickly test some regular expression or a Java API. This project was a chance to finally use the language on a large scale.
Grails is the tool I chose for the user interface. I've used it a couple times before for small personal projects and have always been happy with the results. I've tried quite a few web frameworks for the JVM and Grails is the most productive. Grails is a Groovy based framework, so it also made a lot of sense to use it on a Groovy based project. I didn't want to simply build a piece of Grails software, though, so the user interface is the only component that depends on Grails. The other pieces are pure Groovy.
I also took this opportunity to explore a few other technologies I hadn't found the time to experiment with yet. I used Gradle for the build system. The project uses the Spock specification/testing framework (this is a really brilliant piece of software). I included Weka for some machine learning algorithms, though in the final product Weka is simply a redundant algorithm implementation used to check the software's own calculations. Finally, I used Guice for dependency injection.