Machine Learning Newsletter

Graphs, Databases and Graphlab

I will talk about graphs, graph databases and mainly the paper that powers Graphlab. At the end of the post, I will go over briefly basic capabilities of Graphlab as well.

Graph Definition

Generally speaking, graphs that are relevant to graph theory and graph databases are defined as following(most general definition):

Graph is an ordered pair $G = (V, E)$ where the $V$ represents the vertices(nodes) of graph and $E$ represents the edges(relations) that connect the nodes. If edges do not have any orientations, then graph is composed of unordered pairs and graph is called undirected graph.

Components

It has two simple components:

  • Node (Vertex)
  • Edge [Relationship]

where both structures have properties defined on them. Node properties are likely to define the nodes and the properties on the edges define the relationship between nodes.

Graphs

Graphs are ubiquitous. Apart from giant network that we use daily basis(read World Wide Web), almost all of social websites(Facebook, Twitter, Linkedin and so on) have inherently graphical representations for their networks. Not only they are good for representing networks, graphs have their own subsection in machine learning, namely graphical models. It is a combination of statistics, machine learning and graph theory(all the good parts, too). Further, they have also a variety of applications in machine learning, most predominantly recommender systems thanks to bipartite graphs. In social sciences, it is important to be able to identify different communities and how they interact each other in order to understand both intra-community and inter-community behaviors. Some social scientists may also be interested in the people who are more likely to influence others for making decisions and how an event propagates across time and different communities. They all need data mining, machine learning but more importantly a scalable, efficient and natural representation of graphs to do the computation as graphs grow exponentially with respect to number of edges and nodes.

Graph Databases

Graph Databases Introduction

  • Apply graph theory in the storage of information about the relationship between entries

Why not Relational Databases?

Relations, we have relations! Let's represent our graphs with old-trusted relational databases. This is actually possible, but for one second think about the SQL query of the social network that you will end up if you want to get the 4th degree of people in a graph. Imagine the joins that you will do, imagine how many times you will join the same table over and over. Let's go one further step ahead, getting the full social network of a single individual person. Do you imagine how much time that recursive join operation requires and what does SQL query look like. Why is this such an important shortcoming, are relational databases supposed to be very good at representing the relations after all? Relational databases are good at representing one-to-many relationships where one table are connected multiple tables. And in this scenario, if you want to get your relation structure, you could get a way with a few joins. However, most of the natural graphs are inherently many-to-many relations in which you need to break down the many-to-many relations into many one-to-many relations. This results in a huge number of tables in your database. Not only that, whenever you want to do query something, you need to pay the price of the join of all the tables. (Note that one-to-one relations could be put into the same table.) Moreover, the portion that you want to explore in the graph is also dependent on the amount of total data that you have. This is one of the biggest issue of RDMS's when it comes to scale for graph data.

Why Graph databases?

  • Data is getting more and more connected everyday. Think about Internet of things, social media, smartphones and so on. The data or digital print that you put on web, on physical devices will be eventually connected and graph databases are very well suited this type of networks.
  • Individualization and decentralization of content, so-called Web 2.0 allowed people to create content and contribute to a common space. This already created huge social networks and there is no reason for other similar networks do not appear in the future.
  • They can represent graphs and networks with performance with native data structures which could represent the graphs naturally. Icing on the cake, they would support some graph theory operations to do computation more elegantly.
  • They are not as rigid as SQL databases, schemas can be changed over time. Some of them also allow semi-structured data.
  • They are faster than relational databases in graph representation. Unlike relational databases, the time that query takes is not proportional to the data that you have in the database but how much you want to explore the graph in the network.

Stand shoulders of giants

  • Graph theory is well established field and luckily a number of well-known and knowledgeable mathematicians(Euler(1707-1783) to be the most prominent) worked on different graph theory problems which resulted in a theoretically sound field. Compare this field with the Relational Model theory(at most 50 years-old). Relational Model Theory is not in its infancy but definitely does not have the maturity and depth of the graph theory has.

Graphlab Powergraph

Graphlab Powergraph is scalable, computation efficient graph processing framework. It builds on Graphlab(early version). Although I talked about Graph databases earlier, this is not graph database but rather scalable and efficient graph processing library.
(What I like about company and framework is that they are purely research-based(they have two papers and two frameworks; Graphlab and GraphChi). People like it and they commercialized it. You have your business right there.)
Early work from Google Pregel presents a message passing algorithm along with parent steps where parent steps can be run in parallel without interfering each other. By doing so, graph gets broken down to a number of steps that can be run in parallel which improves the computation time. Graphlab Powergraph shows that they are faster and more scalable than Pregel both theoretically and empirically.

Original paper is as follows:

Problem Statement

Natural graphs are derived from natural phenomenon and from various websites across world wide web. There is a need for scalable and efficient computation in natural graphs. Due to power-law degree distribution of natural graphs, graphs do not have balanced cuts. Traditional graph-partitioning algorithms are unsuccessful in natural graphs.

Method

They introduce an algorithm Powergraph which builds on top of Graphlab algorithm they introduced earlier. The algorithm has different components but let's look at first to the state of art before this paper.

Graph-Parallel Abstraction

  • Constrains interaction along edges.
  • Use messages and share states.
  • Parallel: multiple-vertex programs run simultaneously.

High-degree vertices is hard(Surprise!)

  • Sequentially process edges.(Very Naive!)
  • Synchronous and mutliple messages(Pregel)
  • Asynchronous and partitioning(Graphlab) => requires locking.

Downside

  • In Pregel case, communication overhead. Needs to send the same message to different vertices(read machines)!
  • Graphlab creates ghost vertices in order to handle this issue but that creates a large amount of network traffic.

What do we do?

  • Graph partitioning

    • Minimize communication
    • Trade-off between computation and storage.
  • Pregel and Graphlab both use random partitioning on natural graphs. However, this results in low quality partitioning. High degree vertices still create problems for communication overhead. Let's introduce new algorithm: Powergraph.

Powergraph

  • More computation to data. Moving data and especially "big" data is costly operation and especially if you do not want to store the end result but only do computation on top of it. Even if you are doing ETL, you should always carry the computation to the data in order to scale better.
  • Parallelize high-degree vertices with the method that intoduce GAS(Gather, Apply, Scatter).
  • Partitioning vertex in order to effectively distribute large power-law graphs(natural graphs).

GAS decomposed

  • Gather: Accumulate information about neighborhood.
  • Apply: Apply the accumulated value to center vertex.
  • Scatter: Update adjacent vertices and edges, update edge data and activate neighbors.

Communitcation minimized

  • Percolation theory suggests that power law graphs have good vertex cuts: For any edge-cut, we can directly construct a vertex-cut which requires strictly less communication and storage.

In order to do so:

  • Evenly assign edges to machines. Minimize machines spanned by each vertex.
  • Assign each edge as it is loaded. Touch each edge only once.

Three Distributed Approaches

Random Edge Placement
  • Place edges on machines which already have the vertices in that edge.
  • De-randomization: greedily minimizes the expected number of machines spanned.
Coordinated Greedy Edge Placement
  • Requires coordination to place each edge
  • Slower but it yields higher quality cuts.
Oblivious Greedy Edge Placement
  • Approximately greed objective without coordination.
  • Faster but it yields lower quality cuts.

The paper talks about performance evaluation and gives promising comparison of Graphlab with other graph databases. It also implements a number of algorithms; Singular Value Decomposition, Non-Negative Matrix Factorization, Stochastic Gradient Descent, Pagerank, Triangle Counting, graph coloring and Latent Dirichlet Allocation.

They recently released a Python wrapper on top of it called GraphLab Create. In the following section, I will be mainly using that.

Why Graphlab Create?

  • Easy to setup, easy to learn, easy to use. If you know Pandas and Python in general, you are good to go at least for Graplab Create.
  • Source code of the Graphlab Powergraph is open source, not graphlab create, though.
  • Data is not stored in memory but harddisk. Therefore, you could process large datasets with a moderate RAM.
  • Model is to be planned to be stored in the harddisk as well in the future(right now it is stored in memory, though.)
  • Documentation coverage seems to be OK.

Disadvantages

  • Python wrapper is not feature rich. It lacks analytics and clustering algorithms that the original package has.
  • SFrames do not have all of the functionalities of dataframes. I was expecting that they will build on top of dataframes. Even if they introduce new functionalities, some of them are not there. As you you could use dataframes for algorithms, this is not a huge disadvantage. It would be great to enjoy dataframes in harddisk, though.

SFrame

  • SFrame is the basic data structure that Graphlab uses under the hood. If you are familiar with Pandas, what dataframe to Pandas is what SFrame to Graphlab.
  • Its columns SArray's are immutable whereas SFrame is mutable. SArray's are also stored in harddisk to allow large scale computation.
  • Methods of SFrame are generally similar to Pandas dataframe but does not support some of the capabilities of dataframe, e.g groupby operation.

Dataset

The data that I will use is from earlier work that I did some time ago. In order to understand, see the posts 1, 2, 3 and 4. Do come back, though.

In [39]:
import os
import csv
import itertools

import graphlab as gl
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import scipy

# I can haz large graphs ?
mpl.rcParams['figure.figsize'] = (16, 12)
mpl.rcParams['axes.labelsize'] = 22
mpl.rcParams['xtick.labelsize'] = 18
mpl.rcParams['ytick.labelsize'] = 18
mpl.rcParams['xtick.direction'] = 'out'
mpl.rcParams['ytick.direction'] = 'out'

DATA_PATH = os.path.join('data', 'movies.csv')
In [81]:
sf = gl.SFrame.read_csv(DATA_PATH);
In [82]:
sf.num_rows(), sf.num_cols()
Out[82]:
(100042, 34)
In [83]:
sf.shape
Out[83]:
(100042, 34)
In [84]:
sf.head(5)
Out[84]:
title year rating runtime votes genres outline director actors certificate Action Adult Adventure Animation Biography Comedy Crime Drama Family
0 0 The Shawshank Redemption 1994 9.3 142 1157455 [Crime, Drama] Two imprisoned men bond over a number of years... Frank Darabont [Tim Robbins, Morgan Freeman, Bob Gunton] R False False False False False False True True False ...
1 1 The Dark Knight 2008 9.0 152 1125214 [Action, Crime, Drama, Thriller] When Batman, Gordon and Harvey Dent launch an ... Christopher Nolan [Christian Bale, Heath Ledger, Aaron Eckhart] PG_13 True False False False False False True True False ...
2 2 Inception 2010 8.8 148 920345 [Action, Adventure, Mystery, Sci-Fi, Thriller] A skilled extractor is offered a chance to reg... Christopher Nolan [Leonardo DiCaprio, Joseph Gordon-Levitt, Elle... PG_13 True False True False False False False False False ...
3 3 Pulp Fiction 1994 9.0 154 893719 [Crime, Drama, Thriller] The lives of two mob hit men, a boxer, a gangs... Quentin Tarantino [John Travolta, Uma Thurman, Samuel L. Jackson] R False False False False False False True True False ...
4 4 Fight Club 1999 8.9 139 881325 [Drama] An insomniac office worker looking for a way t... David Fincher [Brad Pitt, Edward Norton, Helena Bonham Carter] R False False False False False False False True False ...

5 rows × 34 columns

In [85]:
sf['year'] = sf['year'].apply(lambda x: int(x))
sf['votes'] = sf['votes'].apply(lambda x: int(x))
sf['rating'] = sf['rating'].apply(lambda x: float(x))

def make_integer(x):
    if x == '':
        return np.nan
    else:
        try:
            x = int(x)
            return x
        except:
            return np.nan
sf['runtime'] = sf['runtime'].apply(make_integer)
In [86]:
print('Mean Rating: {}'.format(sf['rating'].mean()))
print('Variance of Rating: {}'.format(sf['rating'].var()))
print('Standard deviation of Rating: {}'.format(sf['rating'].std()))
print('Minimum rating: {}'.format(sf['rating'].min()))
print('Maximum rating: {}'.format(sf['rating'].max()))
print('Number of non-zero entries: {}'.format(sf['rating'].nnz()))
Mean Rating: 6.05980388237
Variance of Rating: 1.6072088658
Standard deviation of Rating: 1.26775741599
Minimum rating: 1.0
Maximum rating: 9.4
Number of non-zero entries: 100042
In [87]:
first_part, second_part = sf.random_split(0.8, seed = 1)

Divide the SFrame 4 by 1. Nice utility function by the way!

Exploratory Graph Analysis

In [17]:
sf = gl.SFrame.read_csv('data/perf.csv', column_type_hints={'year': int, 'weight': float, 'id': int})
In [20]:
sf['actor'] = sf['actor_name'] 
del sf['actor_name']
sf['title'] = sf['film_name']
del sf['film_name']

actors = np.unique(sf['actor'])
films = np.unique(sf['title'])

g = gl.Graph()
g = g.add_edges(sf, src_field='actor', dst_field='title')
g = g.add_edges(sf, src_field='title', dst_field='actor')
In [21]:
g.summary()
Out[21]:
{'num_edges': 312730, 'num_vertices': 97723}
In [47]:
dark_knights = ['The Dark Knight', 'The Dark Knight Rises',] 
top_grass_graph = gl.Graph()
top_grass_graph = top_grass_graph.add_edges(g.get_edges(dst_ids=dark_knights), src_field='__src_id', dst_field='__dst_id')
top_grass_graph.show(vlabel='id', highlight=dark_knights)
In [53]:
def plot_graph_actor(actor_name):
    movies = g.get_edges(src_ids=[actor_name])
    graph_ = gl.Graph()
    graph_ = graph_.add_edges(movies, src_field='__src_id',
                              dst_field='__dst_id')
    graph_.show(vlabel='id', elabel='character', highlight=[actor_name])
plot_graph_actor('Morgan Freeman')

Morgan Freeman has quite a number of movies. Edges are the character names as you might imagine.

In [54]:
plot_graph_actor('Christian Bale')

Who is the Batman?

In [61]:
plot_grap_actor('Kate Winslet')

We could also display the movies that the actors/actresses played

In [67]:
def plot_costar_movies(actor_name):
    movies = g.get_edges(src_ids=[actor_name])
    costar_graph = gl.Graph()
    for ii in movies['__dst_id']:
        costar_graph = costar_graph.add_edges(g.get_edges(src_ids=[ii], dst_ids=None),
                                  src_field='__src_id', dst_field='__dst_id')
    costar_graph.show(highlight=list(movies['__dst_id']))
    
plot_costar_movies('Morgan Freeman')
In [68]:
plot_costar_movies('Kate Winslet')