AgensGraph Tutorial
A Simple Recommendation System

Before Getting Started!

The objectives of this tutorial are to show how easy and natural it is to use the graph query language Cypher to analyze data, with a specific example of a recommendation system.

In order to load and use data as guided in this tutorial, AgensGraph should be installed first. You can install/use our visualization tool, AgensBrowser, for easier and quicker visual analysis of the sample data.


The motivation for the recommendation system is that it has been demonstrated across many diverse disciplines such as eCommerce, dating, social media and entertainment that recommendations can keep the customer more engaged, stay for longer and ultimately purchase more goods and services or be exposed to more advertising.

Even with a minimum of data, the recommendation system is of value. It is also the easiest move in the field of data analytics that goes from merely descriptive or diagnostic analytics to predictive and prescriptive analytics, where we can influence and direct customer behavior. It is also surprisingly easy to implement a basic recommendation system.

Contents of this tutorial

  1. Import CSV to AgensGraph
  2. Build Ratings Edges
  3. Build Similarity Edges
  4. Build Recommendations Edges

01. Import CSV to AgensGraph

The data is first exported as a .csv file (named purchase.csv) from, for example, excel or some other source. The data is very simple and in fact contains only 3 columns named cust, bought, and product. The first three rows of the data are shown below. The first row indicates that Customer C1 has bought 4 units of Product P1.

Cust Bought Product
C1 4 P1
C2 7 P2
C3 3 P3
... ... ...

To import the data we first create an SQL table called purchases to store the data. Of course, if we already have the data in PostgreSQL this step is not necessary.

CREATE TABLE purchases (
                           cust VARCHAR(20), qty INT, prod VARCHAR(20)
                        );

Then a single command loads the data from the file into the SQL table.

COPY purchases FROM 'C:\purchase.csv' DELIMITER ',' CSV HEADER;

Now we create a graph to store the data.

-- CREATE graph
                                CREATE GRAPH basic_recommend;
                                SET GRAPH_PATH = basic_recommend;

The graph schema will contain vertices (i.e. nodes) for customers, vertices for products and edges between them containing the quantity of each product that each customer has bought. The quantity is stored as a property in the edge.

-- CREATE 2 vertex and 1 edge labels
                            CREATE VLABEL cust; CREATE VLABEL prod; CREATE ELABEL buys;

We will refer to properties in the vertices or edges by adding a colon to the property name and will refer to labels by preceding the label with a colon. So q: means property q and :cust means label cust.

Creating the graph consists of looking at each row in the table in turn. For each new cust value, it will create a vertex of label :cust with a property c: containing the customer name. For each new prod value, it will create a vertex of label :prod with a property p: containing the product name. Then it will join those 2 vertices with an edge :buys with property q: containing the quantity. There is no duplication of :cust or :prod vertices. All properties are enclosed in curly brackets {} inside either a vertex designated by round brackets (), or inside an edge designated with square brackets [].

-- CREATE vertices
                            LOAD FROM purchases AS row -- read each row from the purchases TABLE
                            MERGE (x :cust { c: row.cust } ) -- CREATE cust vertices
                            MERGE (y :prod { p: row.prod } ) -- create product vertices 
                            CREATE (x) - [ :buys { q: row.qty } ] -> (y); -- CREATE edges [:buys { q: 4}]
                            
                            /* Sample query. The sql TABLE and the graph now have the same data. Display both
                            SELECT * FROM purchases;
                            MATCH (c)-[b]->(p) return c.c AS Cust, b.q AS Qty, p.p AS Prod order by Cust, Prod;
                            
                            -- Sample query. For each customer, find the total qty bought
                            MATCH (c)-[b:buys]->(p)return c.c, sum(b.q) AS total_qty;
                            */
                            

Below is the resulting graph, where :cust vertices are red, :buys edges are green and :prod vertices are blue.

02. Build Ratings Edges

We will now analyze the data and create a recommendation engine, which will recommend products to customers based on the purchases of other customers who made similar choices. This is called collaborative-based filtering. It relies on a principle of human behavior; that people who agreed in the past will agree in the future. If one member of the group is presented with a product that the other members have already purchased, then he/she should view the product favorably, with a higher probability of purchase.

Collaborative-based filtering begins with a matrix of ‘ratings’ of products by customers. These can be actual ratings (like 1 to 5 stars) or inferred ratings based on customer behavior. For example, if the products are consumables like batteries or food, that would normally be bought repeatedly, the ratings can capture how often a customer ‘repeats buys’ a product. In our data, we only have how many of each product each customer bought and so we will create the rating for each customer by simply dividing the quantity of each product bought by the maximum of products purchased by that customer. This will give a rating of 0 to 1, where the rating 1 is assigned to the customer’s ‘favourite’ product.

We will store the rating as a property r: in an edge that connects each customer to each product purchased.

CREATE ELABEL rates;

                            MATCH (c :cust)-[b :buys]->(p :prod)
                            WITH c, max(b.q) AS max_rating
                            MATCH (c)-[b :buys]->(p)
                            WITH c, p, b.q * 1.0/max_rating AS normalised_rating
                            MERGE (c)-[:rates {r: normalised_rating}]->(p);
                        

03. Build Similarity Edges

Now we have to somehow measure how similar customers are to each other. We will not form distinct groups but rather, for each person identify those most similar. The generally accepted method is called, ‘cosine similarity’. This method considers each person as a vector in a space whose coordinates are the ratings of each rated product. The vector points in some direction and other people who bought roughly the same products will point roughly in the same direction. If we consider the angle between these two vectors then a small angle means similar and a larger angle means less similar. We can actually evaluate the cosine of this angle and call it the similarity between the 2 customers. Remembering that cos(0)=1, we can see that similar customers will have cosine similarity value close to 1. In contrast, cos(90 degrees)=0 so other very different customers will have a similarity close to 0.

We calculate the cosine similarity from the dot product relationship

x·y = |x||y|Cos(θ)

|x| means the length or norm of vector x and is the usual Euclidean distance; the square root of the sum of the squares of the components. We first store this value for each customer as a property in the vertex called length. The dot product is the sum of the products of the components of each vector taken one at a time

x·y = ∑xiyi

The dot product divided by the products of the lengths gives us the cosine of the angle between the vectors, i.e. the similarity score.

We will create an edge :cos_sim between every pair of customers and store the similarity in a property s:


                        MATCH (c)-[r :rates]->(p)
                        WITH c, p, r.r^2 AS r_sq
                        WITH c, sum(r_sq) AS sum_r_sq
                        WITH c, sqrt(sum_r_sq) AS sqrt_sum_r_sq
                        SET c.length = sqrt_sum_r_sq; -- for each cust store their rating length in the node
                        
                        CREATE ELABEL cos_sim;
                        MATCH (x :cust), (y :cust)
                        WHERE id(x) < id(y) -- find 2 different customers, also if we do it for C1 and C2 don'	t do C2, C1 later. 
                        MATCH (x)-[xx :rates]-(p) MATCH (y)-[yy :rates]-(p) -- find the products they both rated
                        WITH x, y, p, xx.r * yy.r AS term_dot_product
                        WITH x, y, sum(term_dot_product) AS numerator
                        WITH x, y, numerator, x.length * y.length AS denominator
                        WITH x, y, numerator/denominator AS similarity
                        MERGE (x)-[:cos_sim {s: similarity  }]->(y);
                        
                        -- Show the similarity edges
                        MATCH (x)-[sim :cos_sim]->(y)
                        RETURN x.c as "Cust x", round(sim.s, 2) as cos_sim, y.c as "Cust y" order by "Cust x", "Cust y";

                        

04. Build Recommendations Edges

We will now create a recommendation score between each customer and each product. For a given customer and product, we will consider how other customers rated that product. But we want to favor the opinion of similar customers and diminish the opinion of dissimilar customers. That is, we want to weight the rating with the similarity. If we want to calculate the recommendation score for customer Ci and product Pj, we take every other customer’s rating for Pj and multiply by Ci’s similarity to that customer. We can then just take the weighted average as the recommendation score of the given product for the given customer.

We create yet a new edge :rec into which the recommendation score will be stored as property rs:.


                        /* CREATE recommend edge */
                        CREATE ELABEL rec;
                        MATCH (x :cust), (y :cust), (p :prod) WHERE id(x) < id(y)
                        MATCH (x)-[xy :cos_sim]-(y)
                        MATCH (y)-[r :rates]->(p)
                        WITH xy, x, p, x.c AS Custa, y.c AS Custb, p.p AS Prod, xy.s*r.r AS crec
                        WITH x, p, Custa, sum(crec) AS rec, sum(xy.s) as Sim_sum
                        WITH x, p, rec/sim_sum as wrec
                        MERGE (x)-[:rec {rs: wrec}]->(p);
                        

Finally, we can show the recommended products for each customer, with the most recommended product on the top.


                        MATCH (x)-[r :rec]->(p)
                        WHERE NOT EXISTS ((x)-[:buys]->(p))
                        RETURN x.c AS cust, round(r.rs,3) AS rec, p.p AS prod ORDER BY cust, rec DESC, prod;
                        

05. Refinement suggestions

There are many ways to improve the recommendation engine. You can, for instance, score each product by the margin made on the product so as to favor higher-margin sales. This involves adjusting each recommendation by the margin score. Other factors to add can be a quality score for each product by for example counting the number of returns, or promotional score for products of vendors you wish to promote.

Collaborative filtering may be used alone, or together with content-based filtering which considers similarity among products (rather than similarity among people). More sophisticated recommendation systems can factor search and browsing history or classification methods based on other information explicit or implied about the customers (age, gender, etc).

06. Conclusions

This tutorial has shown how easy loading data from an external source into PostgreSQL is and that modeling and loading the said data into a graph can also be done in a single query.

Using a graph database is a perfect way to store data because it is capable of connecting data and storing relationships between data. For example, between customers and products we created edge :buys, :rates and finally the recommendation edge :rec. We also created edges between customers :cos_sim that captured the similarity score between customers. Notice that for sparse arrays, like capturing similarity, graphs are a perfect and natural solution. No tables need to be created or schema defined. This is a really natural way to store relationships between objects. Notice also the simplicity of the ASCII art way Cypher represents relationships as ()-[]->(), in the example of (vertex)-[edge]->(vertex).

For queries or further information please contact joe.fagan@bitnine.net