alt In this post I'll outline how to use a Neo4j graph database to generate user recommendations for a data set consisting of users, products, and user ratings for those products.

For my data set I'm using a database of 30,000 different beers (pulled from brewDB's open API), and 100 users (I asked facebook friends to rate some beers).

For the recommendation engine, I'm going to explain how to implement a user based collaborative filtering algorithm using Neo4j's Cypher query language.

The approach is very straightforward. We will first calculate the similarity(distance) between each user, based on each user's individual beer preferences. Then, for a given user, we will find the N most similar(closest) users to them, and we will aggregate those users' beer preferences in order to calculate beer recommendations for the current user. If this seems cloudy right now, bare with me, we will take it step by step.

Step 1: Define database structure

To keep things simple, our graph database is going to contain user nodes and beer nodes. User nodes will be connected to beer nodes through 'rating' relationships. Every time a user rates a beer, we draw a new 'rating' relationship between that user and the beer they rated. Users can rate beers on a scale of 1-5 stars, so the 'rating' relationship will have a 'rating' property on it equal to the number of stars the user rated the beer.

Step 2: Calculate user similarities

The next thing we're going to do is calculate a similarity index between each user. In Neo4j we do this by creating a 'similarity' relationship between each user node. As I mentioned before, relationships in Neo4j can have properties, so each similarity relationship will have a 'similarity' property which will store the similarity index (a number between 0-1 that represents how similar two users are to each other). That index can be calculated using any kind of similarity/distance metric you choose. We are going to use Euclidean distance.

A little bit of math first

Conceptually, the euclidean distance between two points represents the length of the line segment connecting those two points. If we envision two users to be represented by two points (vectors) in multi-dimensional space, we can use euclidean distance as a measure of how similar (close) they are to each other.

For our purposes we can think of two users as being two points (vectors) in euclidean n-space, where each user's coordinates are determined by their beer ratings. For example, if user p gave three beers 2, 5, and 4 stars, respectively, that user could be represented as a cartesian coordinate p=(2,5,4). Likewise, if user q gave those same three beers 5, 4, and 3 stars, respectively, then user q could be represented as a cartesian coordinate q=(5,4,3). Both points exist as vectors in multi-dimensional space, and the distance between them can be calculated by using the pythagorean formula:

alt

Translating this to Cypher queries

Here's what we're going to do. We're going to use the pythagorean formula to define similarity ratings between every user in our database. We'll do that by looking at each user one by one. For each user, it will be a five step process:

1) Find all the users who have rated at least one beer that the current user has also rated. We will call these users neighbors.

2) For each neighbor, find all the beers they've rated that the current user has also rated.

3) For the set of n beers, n being the number of beers that both users rated, use the pythagorean formula to determine euclidean distance between the two users. p1, p2...pn will represent the current user's beer ratings, and q1, q2...qn will represent the other user's beer preferences for the same beers.
*note we're going to alter the formula slightly by dividing each (pn-qn)^2 by n, this way we will normalize for differences in dimension.

4) Divide the normalized euclidean distance (which we just calculated in step 3) by 4 so that we get a clean number indexed to be between 0-1.

5) Create a similarity relationship between the neighbor and the current user, and set the similarity property to be equal to the new similarity index.

This is what the code will look like in Cypher:
//Step 1: 

MATCH (u1:User {username: ({username})})                               -[x:Rating]-> (b:Beer)<-[y:Rating]-(u2:User)

//Step 2: 

WITH count(b) AS commonBeers, u1.username AS user1, u2.username AS user2, u1, u2,

//Steps 3 & 4 & 5: 

collect((x.rating-y.rating)^2) AS ratings,
collect(b.name) AS beers
WITH commonBeers, beers, u1, u2, ratings
MERGE (u1)-[s:Similarity]->(u2) SET s.similarity =                  1-(SQRT(reduce(total=0.0, k in extract(i in ratings | i/commonBeers) | total+k))/4)

Step 3: Generate Recommendations

Now that we have our user similarities drawn, in order to generate recommendations for a given user we just have to traverse the graph to find all the users with the highest similarity relationships, find the beers that they like, that our current user hasn't liked yet, and then aggregate the ratings among all of the similar users to provide recommendations to the current user. This will be another 5 step process:

1) Find all of the users who share a similarity relationship with the current user, who have a similarity index greater than 0.6

2) Find all of the beers that any of the highly similar users have liked (rating of 4 or 5), that the current user hasn't rated yet.

3) For each of those beers, aggregate all of the ratings from all of the users who rated that beer, to get an average rating. Also aggregate all of the user similarity scores of the users who rated that beer to get an average user similarity score.

4) Order all of the beers, first by the number of users who rated it, then by the average rating, then by the average user similarity score.

5) Provide the list of beers back to the user as recommendations. Use average rating as the predicted number of stars for the user.

This is what the code will look like in Cypher:
//Step 1:

MATCH (u1:User)-[r:Likes]->(b:Beer),
(u1)<-[s:Similarity]-(u2:User {username:({username})})
WHERE NOT ((u2)-[:Likes]->(b))
WITH b, r.rating AS rating,s.similarity as similarity
WHERE similarity > 0.6

//Step 2:

WITH b, COLLECT(similarity) AS similarities, COLLECT(rating) AS ratings

//Step 3:

WITH REDUCE(x = 0, i IN similarities | x+i)*1.0 / LENGTH(similarities) AS avgSimilarity,
b,similarities, REDUCE(x = 0, i IN ratings | x+i)*1.0 / LENGTH(similarities) AS avgRating, ratings

//Step 4:

ORDER BY LENGTH(ratings) DESC, avgRating DESC, avgSimilarity DESC
WHERE avgRating >3

//Step 5:
RETURN b AS Beer, avgRating AS Recommendation

That's it. We've finished. Our users will now be able to get individualized recommendations based on their own beer preferences and those of other users that are similar to them.

comments powered by Disqus