NoSQL Experimentsby Matt Cholick
- There are no real relationships; the different pieces of data can be stored in unrelated tables.
- Transactions aren't important because the data is either read or written in single large operations spanning the entire dataset. It is not randomly accessed from multiple threads of execution.
- Scale seemed important because I'm pulling data from Twitter's Firehose and this can result in large volumes of data I need to process with quickly.
I moved my data from PostgreSQL to Cassandra. One attractive aspect of Cassandra is that it's a Java project. Working in Groovy means that Java focused tools are very easy to integrate. Cassandra was developed by Facebook and became an Apache project, so I know it's software with a solid history as well as current backing. Finally, write-ups of the tool commented that write performance is quite good.
I added Cassandra and, as much as I possibly could, tried to keep my algorithms constant. I also made sure to include indexes and to slice the data in similar chunks for processing. I experimented with two of the longer running pieces of my program: the algorithm to clean-up and do post processing on Tweets and building a Bayes classifier from the cleaned data. The cleanup operation is very write intensive while the training operation only reads data. Here are the results.
Cassandra did show improved write performance. Running the cleanup operation took half the time. Read heavy operations, on the other hand, did not perform as well. It's likely that I could have made changes to the algorithm to improve performance in the Bayes implementation, but the same could be said for the PostgreSQL version. This was not a scientific experiment. This was instead dropping a different back-end behind the same implementation; the database operation involved in training is a simple series of reads pulling a slice data. The improved write performance in clean is nice, but unfortunately this is an algorithm that I rarely run. Training and modifying the classifier is where I've spent most of my development time.
These kinds of tools fill a real need, but they're simply not a drop-in database replacement. In this project, where my own time is the scarcest resources, a model that's familiar enough to implement quickly makes the most sense. The experimentation and exploration comes from the machine learning, recommender, and the large scale Groovy implementation.
My biggest takeaway: this type of technology isn't something to just drop in as a database replacement. My mental models still need adjusting.