In order to run our analysis, we need to isolate a dataset of movies for a given year that are candidates for an Oscar. For this purpose, we decided to extract a subset of the movies outlined by Box Office Mojo as the top movies in a given year. Assuming that the movies most likely to win an Oscar are likely to be among those with the highest gross revenue in a given year, we decided to select the Top 200 movies from each year as candidates for an Oscar (in the following year).
The data collected via get_top_movies by year successfully found the IMDBid (the reference id used across the different segment of this project) for approximately 90% of the top 200 movies in each given year. The additional 10% of IMDBids not found by the above procedure were identified manually. The manual determination of IMDBids was recorded in two new files: 'revised_id_data.csv' and 'completedata.csv'.
Of the top 200 movies from each year, approximately 5% were excluded from further analysis. The top reasons for movie exclusion included:
We needed to construct DataFrames for our movie reviews and rating data, so first we needed to pull this data from the web. We chose to use the following sources:
From these sources, we scraped our review data and stored in it Dataframes of the following strcture:
Critic | Normalized Score | Quote | ID | Title | Source | Overall Score | Year |
---|---|---|---|---|---|---|---|
Name of the Critic | Score normalized to what the max is for this source | The string of the review | IMDB ID | Title of the movie | Source of the review | The overall score for this movie on this site (if there is one) | Year that the movie is form |