One of the advantages of graph databases is that they are very visual by nature. Graphs help to illustrate the relationships between entities through nodes, drawing connections between people and objects.
Benefits of importing IMDB to AgensGraph
Here at AgensGraph, we have written a script that ports IMDB data into AgensGraph, so that it can be presented in a more visual format. In this post, we will explore how easy and natural importing the IMDB into AgensGraph is by using a very flexible importing method that combines SQL with LOAD clause in Cypher. In contrast to other graph databases which support importing from only CSV files and inflexible methods, AgensGraph has a powerful importing method that uses SQL to form vertexes and edges. Also, AgensGraph can handle JSON format properties. Hence, when forming properties such as IMDB, it is very easy to use nested structures and arrays.
For install instructions to the importing scripts, go here.
Relational Schema for IMDB
Picture courtesy of user ofthelit on Stackoverflow
Relational Schema of IMDB Info
Because IMDB’s file format is not easy to read and parse, rather than implementing the file directly, we use an additional step to load it to relational tables. For this project, we used IMDBpy (LINK) to load relational data into AgensGraph in relational form. The above figure is the relational schema which IMDBpy created. This schema is somewhat complicated, but essentially, there are 4 basic entries: Production, Person, Company and Keyword. Because there are many N-to-N relationships between these entities the relational schema has more tables than the number of entities. This makes the schema harder to understand. For example, a person can be related to many movies and a movie can have many characters.
Concise Graph Modeling
From there, we developed our own graph schema, with Production, Person, Company and Keyword as our nodes, or end data points.
Productions lie at the “center” of the graph, with everything leading to them. Keywords describe Productions, Persons and Companies are credited for their contributions to Productions. Productions are linked to other productions as well.
Simplified Graph Database Schema
With the data in graph form, one can easily see the connections between all of the nodes. The data can be visualized as a network and querying the data with Cypher allows users to explore the connections between entities.
Compared to the relational schema of IMDB, the graph schema is much simpler to understand. By merging related information for the main entities into nodes, we are able to access all relevant information to that node through that node, rather than having to match IDs across tables to get information that we need. If we want to examine how a node relates to another node, we can query its edges to see the connections it forms. Being able to visually “draw a connection” from one node to another helps to illustrate how they are connected.
Furthermore, the labels of the edges describe how the nodes are connected. Edge labels in the IMDB Graph describe what kind of connection is formed, and pertinent information may be stored in attributes in the edges. For example, for the connections ACTOR_IN and ACTRESS_IN, we store role data, such as character name and character id.
To make vertexes’ and edges’ properties we use Views which joins related tables. The data is migrated into a graph format by querying the relational data using selects and joins into a single table with all of the necessary information for creating each node.
For example, here is the SQL query used to create the jsonb_keyword view:
CREATE VIEW jsonb_keyword AS SELECT row_to_json(row(keyword)) AS data FROM keyword;
We use a view to make importing queries simpler. Once this view is created, it’s content can be migrated into the graph. After the graph is created, the graph_path is set, and the VLABEL is created, we can use the convenient LOAD keyword to load the json values from the relational table into the graph:
LOAD FROM jsonb_keyword AS keywords CREATE (a:Keyword = data(keywords) );
Note that here LOAD is used to load data in from a relational table, but LOAD can also be used to load data from external sources as well.
Creating edges is a similar process. We load edges from the tables that store id tuples of the between the entities after creating their ELABELs:
LOAD FROM movie_keyword AS rel_key_movie MATCH (a:Keyword), (b:Production) WHERE a.id::int = (rel_key_movie).keyword_id AND b.id::int = (rel_key_movie).movie_id CREATE (a)-[:KEYWORD_OF]->(b);
As you can see, AgensGraph is not restricted to the CSV format when importing data. We are able to import relational data into its graph portion using the LOAD feature and SQL statements to refine our data sets.
How is information stored?
Most of the pertinent information is held in the nodes (vertexes). Nodes are labelled either as Productions, Persons, Companies or Keywords, and their relative information is stored as JSONs. Since IMDB information is often constantly being added to, many fields for certain entities are left incomplete. JSON is semi-structured, so if an entity does not have a certain piece of information, rather than having a field and marking it as NULL, the field will not exist at all.
We also use nested JSON arrays to store data that may have multiple fields, such as quotes that persons may have said, or alternate titles to productions. This makes it possible to store “duplicate” fields in each node.
How can this information be used?
In the graph IMDB database, querying between entities is very easy to get a hang of. Using the Cypher Query Language, a user can find things such as all actors that acted in a certain production, or all productions that a person has worked on, or all other companies that have worked with a certain company on any production. Graph database strength is the simplicity of visualizing the data. There are many ways you can query a graph database to find what you need!
Find the name of all actors that acted in Night at the Museum:
MATCH (a:Person)-[:ACTOR_IN]->(b:Production) WHERE b.title = 'Night at the Museum' RETURN a.name,b.title;
name | title -----------------------+--------------------- Asprinio, Stephen | Night at the Museum Blais, Richard | Night at the Museum Bougere, Teagle F. | Night at the Museum Bourdain, Anthony | Night at the Museum Cherry, Jake | Night at the Museum Cheng, Paul Chih-Ping | Night at the Museum ... (56 rows)
Find all productions that Ben Stiller has worked on:
MATCH (a:Person)-[b]->(c:Production) WHERE a.name = 'Stiller, Ben' RETURN a.name,label(b),c.title;
name | label | title -------------+-------------+----------------------------------------------- ... Stiller, Ben | actor_in | The Heartbreak Kid: The Egg Toss Stiller, Ben | producer_of | The Hardy Men Stiller, Ben | actor_in | The Heartbreak Kid: Ben & Jerry Stiller, Ben | producer_of | The Polka King Stiller, Ben | actor_in | The Heartbreak Kid Stiller, Ben | actor_in | The Watch Stiller, Ben | actor_in | The History of 'Walter Mitty' Stiller, Ben | producer_of | The Making of 'The Pick of Destiny' Stiller, Ben | actor_in | The Making of 'The Pick of Destiny' ... (901 rows)
Find all actresses that have worked with Sarah Jessica Parker:
MATCH (a:Person)-[b:ACTRESS_IN]->(c:Production)<-[d:ACTRESS_IN]-(e:Person) WHERE a.name = 'Parker, Sarah Jessica' RETURN DISTINCT e.name;
name --------------------------------- Aaliyah Aaron, Caroline Aaron, Kelly Abascal, Nati Abbott, Diane Abdul, Paula ... (3524 rows)
Conclusion and New Developments
The most powerful aspects of graph databases are flexibility and visualization capabilities. Relationships in the IMDB database are inherently visual; seeing how things are connected grants us the ability to better understand the connections underneath. By importing IMDB data into AgensGraph as graph data, the schema is simplified by grouping together related data points into nodes and drawing connections between these nodes. We can then query these data points on attributes in the nodes and edges. This gives us visual insight into the data.
In the future, we plan to implement a one-step importing script. Currently the importing script is two-phased: the first step is to load into relational tables and the second step is to load into the graph. Additionally, we are working on AgensGraph’s Gephi Connector. The Gephi Connector will allow for graph visualization and analysis.
BITNINE GLOBAL INC., THE COMPANY SPECIALIZING IN GRAPH DATABASE
비트나인, 그래프 데이터베이스 전문 기업