And the Oscar goes to...

In order to run our analysis, we need to isolate a dataset of movies for a given year that are candidates for an Oscar. For this purpose, we decided to extract a subset of the movies outlined by Box Office Mojo as the top movies in a given year. Assuming that the movies most likely to win an Oscar are likely to be among those with the highest gross revenue in a given year, we decided to select the Top 200 movies from each year as candidates for an Oscar (in the following year).

The data collected via get_top_movies by year successfully found the IMDBid (the reference id used across the different segment of this project) for approximately 90% of the top 200 movies in each given year. The additional 10% of IMDBids not found by the above procedure were identified manually. The manual determination of IMDBids was recorded in two new files: 'revised_id_data.csv' and 'completedata.csv'.

Of the top 200 movies from each year, approximately 5% were excluded from further analysis. The top reasons for movie exclusion included:

Re-releases of prior movies (often occurs around the holidays)
Re-releases of movies with visual enhancements (a very common procedure for Disney movies)
Exclusive release of movies in IMAX theaters
One-time limited showings of events such as concerts (see Fathom Events)
When a movie was excluded from analysis its IMDBid was set to 'None'. These rows of the dataframe were then removed from analysis. In addition, all dollar values were converted to 2013 dollars, accounting for inflation as reported by usinflationrates.org

We needed to construct DataFrames for our movie reviews and rating data, so first we needed to pull this data from the web. We chose to use the following sources:

Rotten Tomatoes
Metacritic
IMDB

When pulling the data, we were sure to only take reviews and ratings that are specific to the site in question (for example, for IMDB, we only took the IMDB user reviews because the critic reviews are linked from other sites like Rotten Tomatoes and Metacritic.

From these sources, we scraped our review data and stored in it Dataframes of the following strcture:

Critic	Normalized Score	Quote	ID	Title	Source	Overall Score	Year
Name of the Critic	Score normalized to what the max is for this source	The string of the review	IMDB ID	Title of the movie	Source of the review	The overall score for this movie on this site (if there is one)	Year that the movie is form

Getting the Data

Getting the Movies

Getting the Movie Reviews