And the Oscar goes to...

Working with Big Data

In this project, we got a first real taste of what working with big(ish) data feels like. We scraped over 700 MB of reviews and movie data which translated into very long scraping and processing times. When working with big data, we had to consider how best to tackle the problem of time and memory. For example, when first loading the entire set of IMDB reviews into our notebook and then trying to perform actions on it, we would often run out of memory. To overcome this, we then adopted the strategy of saving almost everything to disk—something that resulted in being ubiquitous in our code. Saving this big data to disk (and also using quite frequently python's gc.collect() function) allowed us to conserve our computer's resources as well as maximize efficiency by cutting down on the time that was required to run code to generate the data objects: instead of rerunning the code every time (to scrape the data for instance), we would simply load the pre-formatted data from disk.
All in all, working with big data has led us to appreciate things like algorithmic complexity of code, and how companies like Google and Facebook have to work with even bigger data on a day-to-day basis.

Future Directions

With such an open-ended problem like predicting movie Oscar wins and nominations, there are many directions our project could have taken. We did revise and re-revise our model herein, but there is always more we can do such as experimenting with different models, investigating new and exciting ways we can combine the different components, and finally, how we can more properly account for bias—something that we did not deal with very strongly in our models. Things to investigate here would be when combining reviews from different sources, did each source have a different bias? If so, how could we effectively normalize for this so a combination of them could be properly used? Finally, Oscar data over a 10 year period ended up not being very much to really train our models well and resulted in our models over-predicting failures. In order to overcome this, we could train our models on Oscar data from a greater number of years. With this, though, also comes the increased price of adding more movie reviews and working with even larger data.

Final Remarks

Herein we demonstrated how movie review data and scores can be combined with quantitative data describing a movie to form a predicting model about its relative success or failure in the Oscar awards. We showed that such models do a relatively good job at predicting Oscar failures, yet they can still be improved upon for predicting Oscar success. Ultimately, we generated a list of 2013 movies that our model predicts will win Oscar awards and another list that will receive Oscar nominations. It will be interesting to see how well our model performs for the 2014 Oscar Awards.

Reflections and Closing Remarks