Cosine Similarity Matrix of Science Fiction TV Shows

Do you like awake at night thinking of startrek or stargate or farscape, or maybe Bablyon 5? Well, I do and so I wanted to see if there is any real similarity between these TV episodes.
To get a real statistical comparison between all these shows, we can take the synopsis of each episode and compare them as term vectors using cosine similarity. This allows us to create a cosine similarity matrix for all the science fiction tv shows. First, we get all the synopsis of all the Star Trek, Stargate and Farscape episodes off of Wikipedia and stuff them into a nice normalized Mysql database. Now we can compute the cosine similarity between every episode using Pythons sklearn library. With this library we can compute tf–idf ratio. This ratio, which means frequency–inverse document frequency, tells us the importance of a word as compared to the entire collection (the corpus). Besides words, we also look at the importance of n-grams. These are collections of 2 or three words. The sklearn library allows you to use as many n-grams as you like. For this exercise, we are using 2 and 3 n-grams.

The results can be found by clicking on the ‘Synopsis’ link on any episode at this page: Science Fiction Episodes.

An example

A good place to start would be by comparing these Star Trek TOS episodes:

The Cage and the Menagarie were basically the same shows with small differences. And of course everyone knows William Shatner was not in ‘The Cage’ but he was in the Menagarie.

Cosine Similarity Matrix Creation

To create the actual cosine similarity matrix, we query the data from a mysql database and then get the tf-idf ration between each episode. The code for doing that can be found here. Be forewarned, if you run this code at home, it will take some time. I have a icore 5 with 16GB of RAM and it took almost an hour.

Leave a Reply

Your email address will not be published. Required fields are marked *

1 × 5 =