Machine Learning Newsletter

IMDB Top 100K Movies Analysis in Depth Part 2


Data is from IMDB and includes top 100042 popularly voted movies. This post is second in series, see the first post. I had some great feedback from HN and decided to deal with more on categories of movies. In this one, I will first look at number of movies per category over time and rating of the categories. Second, I will compare popularly voted directors with other directors, give best directors for popularly voted categories. At the end, I will look at the correlation of the categories and do PCA on the movies. As in the first post, I will let the data speak for itself rather than explaining every single graph.

Are old movies better than the contemporary ones ?

Mean Score of Ratings over Year

In [201]:

In the first post, we observed that the movies that are old consistently rated higher than the contemporary ones. When we look at the mean rating of movies per year(ignoring the count), we could observe this much more clearly. But, wait a second, maybe there are a lot of outliers which manipulate the rating to lower end for the movies that are released in 90's and 00's.

Median Score of Ratings over Year

In [202]:

Nope, still they are low. Although not as low as the mean rating, they are still low. Yet, they are getting better although not as high as old movies. However, this is too coarse as there are a number of different categories and we are counting over all of the categories and then get the median rating of the movies. Maybe, some of the categories are quite different than the all movies.

Categories over Time

In this section, I separated the 23 categories into 4 different sections(6,6,6,5) based on the count of the categories. My aim is to first look at the count of the categories and then track the median rating for the categories to get a better understanding of how movies per category gets rated in IMDB. I give the total count of the categories in the first post.
Movies given in the following four sections are sorted by the total number of movies. The graphs are also sorted by the count of categories. The largest category for each section would be in the bottom.

First Section

  • Drama
  • Comedy
  • Action
  • Romance
  • Crime
  • Thriller
In [204]:
In [205]:

Second Section

  • Horror
  • Adventure
  • Family
  • Sci-Fi
  • Fantasy
  • Mystery
In [206]:
In [207]:

Third Section

  • Musical
  • War
  • History
  • Animation
  • Western
  • Biography
In [208]:
In [209]:

Fourth Section

  • Music
  • Sport
  • Film-Noir
  • Adult
  • News
In [210]:
In [211]:


In this section, I will look at the directors whose movies median rating is larger than 7 and median vote is larger than 200000. Although the thresholds are arbitrary, it gives good directors if not best. Surprisingly, some of the directors that I thought good are not on the list if they do not satisfy these requirements. For an example, Woody Allen is not on the list as his median vote is about ~24000 even though his median rating is above 7. I did not look at the number of movies, maybe I should but if the movie quality is good and watched(read this one voted), director should make to the list. Ben Affleck is such a director among others.
In the distribution graphs, turquoise is the mean distribution over all of the directors and purple is the director's distribution given on the subplot.

Rating of Directors

In [213]:

Christopher Nolan is consistently higher in his all of movies. Surprisingly, Ben Affleck is on the list. Median rating tolerates one bad movie where the movie is Swept Away for Guy Ritchie has a movie rated below 4.

Number of Movies over Year

In [214]:

Generally, directors produce the movies for a decade or two. Steven Spielberg, Tim Burton, George Lucas and James Cameron are the biggest exceptions.

Runtime of Movies

In [215]: