Cosine Similarity Mysql Table Creation

To create a cosine similarity Mysql table, we’ll need these items:

• Database table with unique IDs and synopses for each episode.
• Another table that will store the actual IDs and scores.
• Python with the usual libraries installed: re, math
• Decent computer with enough RAM (I’m using a Intel(R) Core(TM) i3 CPU 530 @ 2.93GHz with 8GB RAM.

To get all the episodes is a bit of a chore. You could visit wikipedia and do a lot of copying and pasting, or you could use something like Perl’s Mechanize library. I opted to use the Beautiful Soup library for Python, which is actually quite fun to use.

Ok, so you’ve got the data in your database and you’re ready to create another database table that will hold all these scores. Lots of the good code comes from stackoverflow. Here is the code to do it.

import MySQLdb
import numpy as np
import re, math
CONNECTION = MySQLdb.connect("localhost", "imauser", "mypass", "mydb")
CURSOR = CONNECTION.cursor()
WORD = re.compile(r'\w+')

GETDATA = """select episode_id, trim(synopsis) from episode where synopsis is not null
order by episode_id"""

def insert_record(x, y, score):
ADDMATCH = """insert into matching
(episode_x_id, episode_y_id, score)
values (%s, %s, %s)"""
try:
CURSOR.execute(ADDMATCH, [x, y, score])
CONNECTION.commit()
except:
print "Something went wrong:"
CONNECTION.rollback()

def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])

sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)

if not denominator:
return 0.0
else:
return float(numerator) / denominator

def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)

def main():
""" run the main query, create the dictionary,
process, dictionary, sort it and run the inserts"""
episodes = {}
CURSOR.execute(GETDATA)
rows = CURSOR.fetchall()
_u = lambda t: t.decode('UTF-8', 'replace') if isinstance(t, str) else t

for row in rows:
episode_id = row
text = _u(row)
episodes[episode_id] = "".join(text)

for x,v1 in episodes.iteritems():
for y,v2 in episodes.iteritems():
if (x != y):
vector1 = text_to_vector(v1)
vector2 = text_to_vector(v2)
score = get_cosine(vector1, vector2)
score = "%10.5f" % score
print x, y, score
insert_record(x, y, score)

if __name__ == '__main__':
main()