Machine Learning Newsletter

IMDB Top 100K Movies Analysis in Depth Part 3

This post is third in the series. See the first and second one.

Data

IMDB does not provide box revenue for all of the movies so I pulled this information from different sources; freebase and wikipedia to name a few. Generally, the conflicts(movie name and movie year) is resolved by using the information IMDB provides as it is more reliable. If this information is not available or missing, I used wikipedia information to resolve the conflicts. For the box revenue, freebase provides only domestic gross revenue for the movie(strictly in USA) ,so it is not very useful as I wanted to observe worldwide gross revenue. For this, I depend on wikipedia, imdb, blogs and some other websites which periodically publishes worldwide gross revenue of movies. However, the data unlike previous weeks may occassionally contain incorrect values as the sources are scattered and not very reliable. That being said, it is generally hard to validate the box revenue information on a timely basis as movies may also generate money through different channels. I take max revenue values when such a conflict occurs. My hope is that the max revenue would be the most updated revenue for the selected movie. By combining different sources, I could only get 6672 movies' worldwide gross revenue. 95% movies in this list are in the most popularly voted 12000 movies in IMDB. Therefore, the distribution graphs of revenue are biased towards to the ones that have high number of votes in IMDB.

Freebase provides actors' and actresses' age and height information when the movie is released. The total character number is 450669 but the characters that have complete information is only 45264. That being said, the ratio gets higher when we look at the height and age. I took this information and looked at the age, height distribution of the actresses and actors as well as the number of movies whose main characters is actresses and actors across the years.

Gender Distribution for Movies

In [89]:
 

Apparently, there are a lot of actresses or actors that have age of 25 and 50! As one can imagine, the age distribution looks like quite like a Gaussian distribution. Let's fit a Gaussian Kernel to smooth the distribution in order to remove the outliers(25 and 50).

In [90]:
 

This is much better, we do not have outliers, kernel looks quite smooth and its peak is around 30 and the biggest jump occurs at the age of 17 to 18. But let's look at the gender specific distribution to see who are these outliers.

In [91]:
 

25 should be a magic number, both actresses and actors have peaked at 25, there is something wrong about the peaks in actors, though. If you remove those outliers, both of these distribution two different Gaussian distribution and as we did before, we could fit a Gaussian kernel to smooth the distributions a little bit.

In [92]:
 

This is much nicer. Actors at late 30's played a lot of movies comparing different ages whereas women who are in their early 20's enjoy playing as much. Note that, the distribution does not reveal the absolute number of counts for the movies. As we see later, the number of movies whose main character is an actor is almos twice as the movies whose main character is an actress. Also, note that the distribution is much more widespread in actors whereas the actresses distribution is quite skewed. We could take the difference of kernels to see the difference better.

In [93]:
 

In this one, the ratio of actresses to total number of actresses comparing to actors before 32 is larger whereas actors enjoy after 32 to late 70's take role in movies comparing to actresses. After late 70's, I guess we could only say that women live longer than men.

In [94]:
 

Actors unsurprisingly taller than actresses and surprisingly height distribution is not Gaussian, which is quite interesting.

In [95]:
 

The kernels are not good either as expected.

In [98]:
 

The total number of movies whose main characters are actors is almost twice high as movies whose main characters are actresses. Although the ratio is more or less same, the absolute difference is getting larger over the years. Note that if movies have more than one main characters, all of them counted as main characters for the movies.

In [105]:
 

The revenues for the movies over the years are clearly getting better. We could also see a lot of movies that generate close to one billion dollars .

In [106]:
 

The median revenue is quite conservative whereas mean revenue of the movies is getting larger and larger over the years as a result of quite successful movies in terms of revenue have been directed in recent years as seen in the previous scatter plot.

In [107]:
 

The standard deviation is also getting larger over the years.

In [109]:
 

Votes and revenue are quite correlated unsuprisingly.

In [121]:
 

Rating and revenue's correlation is not as strong as revenue and votes. However, one thing to note is that if the rating is higher than 5, there is almost a rectangle between (5, 8) in rating and $\$1$ M, $\$200$M which looks like almost a two dimensional uniform distribution.

In [111]:
 

If we fit a linear fit on votes and revenue per decade, we would get the above graph. Recent decades have higher correlation of revenue and votes comparing to older decades.

In [126]: