Getting the Data

Getting the Movies

In order to run our analysis, we need to isolate a dataset of movies for a given year that are candidates for an Oscar. For this purpose, we decided to extract a subset of the movies outlined by Box Office Mojo as the top movies in a given year. Assuming that the movies most likely to win an Oscar are likely to be among those with the highest gross revenue in a given year, we decided to select the Top 200 movies from each year as candidates for an Oscar (in the following year).

The data collected via get_top_movies by year successfully found the IMDBid (the reference id used across the different segment of this project) for approximately 90% of the top 200 movies in each given year. The additional 10% of IMDBids not found by the above procedure were identified manually. The manual determination of IMDBids was recorded in two new files: 'revised_id_data.csv' and 'completedata.csv'.

Of the top 200 movies from each year, approximately 5% were excluded from further analysis. The top reasons for movie exclusion included:

  • Re-releases of prior movies (often occurs around the holidays)
  • Re-releases of movies with visual enhancements (a very common procedure for Disney movies)
  • Exclusive release of movies in IMAX theaters
  • One-time limited showings of events such as concerts (see Fathom Events)
  • When a movie was excluded from analysis its IMDBid was set to 'None'. These rows of the dataframe were then removed from analysis. In addition, all dollar values were converted to 2013 dollars, accounting for inflation as reported by usinflationrates.org


Getting the Movie Reviews

We needed to construct DataFrames for our movie reviews and rating data, so first we needed to pull this data from the web. We chose to use the following sources:

  • Rotten Tomatoes
  • Metacritic
  • IMDB
When pulling the data, we were sure to only take reviews and ratings that are specific to the site in question (for example, for IMDB, we only took the IMDB user reviews because the critic reviews are linked from other sites like Rotten Tomatoes and Metacritic.

From these sources, we scraped our review data and stored in it Dataframes of the following strcture:

CriticNormalized ScoreQuoteIDTitleSourceOverall ScoreYear
Name of the CriticScore normalized to what the max is for this sourceThe string of the reviewIMDB IDTitle of the movieSource of the reviewThe overall score for this movie on this site (if there is one)Year that the movie is form

Seeing the Data »

Presented By: Nick Perkons, Mike Rizzo, Julia Careaga, & Ibrahim Khan