We’re overhauling Dgraph’s docs to make them clearer and more approachable. If you notice any issues during this transition or have suggestions, please let us know.
Compared to RAM or SSD access, network calls are slow, so Dgraph is built from the ground up to minimize them. For graph databases which store sub-graphs on different shards, this is difficult or impossible, but predicate-based (relationship-based) sharding allows fast distributed query with Dgraph.
Dgraph is unique in its use of predicate-based sharding, which allows complex and deep distributed queries to run without incurring high network overhead and associated delays.
Rather than store and shard by putting different nodes (aka
entities*) on different servers, Dgraph stores predicates or triples
of the form <node1> <predicateRelation> <node2>
. The nodes are therefore
implicit in the predicate storage, rather than vice versa.
This makes querying much different and particularly allows network optimizations in a distributed database.
To explain how this works, let’s use an example query:
Find all posts liked by friends of friends of mine over the last year, written by a popular author A.
In a distributed SQL database or (non-graph) NoSQL database, this query requires retrieval of a lot of data. Consider two approaches:
Approach 1:
Approach 2:
result set 1
).result set 2
).result set 1
with result set 2
.Both approaches wouild result in a lot of data moving back and forth between database and app; would be slow to execute, and may require running an offline job.
This is how it would run in Dgraph:
Sharding assumptions (which predicates live where):
friends
representing all friend
relations.posts_liked
representing who likes
each post.author
representing all who authored
each post.title
representing the uid->string
title property of posts.Algorithm:
friends
and retrieve a list of my
friends as a list of uids.result set myFOF
.result set postsMyFOFLiked
.result set authoredByA
.result set postsMyFOFLiked
intersect
result set authoredByA
. Call this result set postsMyFOFLikedByA
result set postsMyFOFLikedByA
to Server W which holds the title
predicate (1 RPC).result set postUidsAndTitles
result set postUidsAndTitles
.In at most 4 RPCs, we have figured out all the posts liked by friends of friends, written by popular author X, with titles. Typically, all four predicates will not live on four different Servers, so this is a worst-case scenario. Dgraph network activity is limited to the level of query join depth, rather than increasing arbitrarily according to the number of nodes in the graph, and how they are broken up across servers.
There is no way we are aware of that a node-based sharding database can avoid high network RPC counts during arbitrary queries because “node-hopping” does not mix well with a graph that is segmented across servers.
* Throughout this note, we call entities in a graph “nodes” which is a standard terminology when talking about nodes and predicates. These may be confused with Raft or Kubernetes nodes in some contexts, but generally we mean nodes in a graph.
We’re overhauling Dgraph’s docs to make them clearer and more approachable. If you notice any issues during this transition or have suggestions, please let us know.
Compared to RAM or SSD access, network calls are slow, so Dgraph is built from the ground up to minimize them. For graph databases which store sub-graphs on different shards, this is difficult or impossible, but predicate-based (relationship-based) sharding allows fast distributed query with Dgraph.
Dgraph is unique in its use of predicate-based sharding, which allows complex and deep distributed queries to run without incurring high network overhead and associated delays.
Rather than store and shard by putting different nodes (aka
entities*) on different servers, Dgraph stores predicates or triples
of the form <node1> <predicateRelation> <node2>
. The nodes are therefore
implicit in the predicate storage, rather than vice versa.
This makes querying much different and particularly allows network optimizations in a distributed database.
To explain how this works, let’s use an example query:
Find all posts liked by friends of friends of mine over the last year, written by a popular author A.
In a distributed SQL database or (non-graph) NoSQL database, this query requires retrieval of a lot of data. Consider two approaches:
Approach 1:
Approach 2:
result set 1
).result set 2
).result set 1
with result set 2
.Both approaches wouild result in a lot of data moving back and forth between database and app; would be slow to execute, and may require running an offline job.
This is how it would run in Dgraph:
Sharding assumptions (which predicates live where):
friends
representing all friend
relations.posts_liked
representing who likes
each post.author
representing all who authored
each post.title
representing the uid->string
title property of posts.Algorithm:
friends
and retrieve a list of my
friends as a list of uids.result set myFOF
.result set postsMyFOFLiked
.result set authoredByA
.result set postsMyFOFLiked
intersect
result set authoredByA
. Call this result set postsMyFOFLikedByA
result set postsMyFOFLikedByA
to Server W which holds the title
predicate (1 RPC).result set postUidsAndTitles
result set postUidsAndTitles
.In at most 4 RPCs, we have figured out all the posts liked by friends of friends, written by popular author X, with titles. Typically, all four predicates will not live on four different Servers, so this is a worst-case scenario. Dgraph network activity is limited to the level of query join depth, rather than increasing arbitrarily according to the number of nodes in the graph, and how they are broken up across servers.
There is no way we are aware of that a node-based sharding database can avoid high network RPC counts during arbitrary queries because “node-hopping” does not mix well with a graph that is segmented across servers.
* Throughout this note, we call entities in a graph “nodes” which is a standard terminology when talking about nodes and predicates. These may be confused with Raft or Kubernetes nodes in some contexts, but generally we mean nodes in a graph.