Skip to content

Compact representations

shensiduanxing007 edited this page Nov 13, 2013 · 4 revisions

NetflixGraph’s focus is on maintaining sets of connections between nodes. In order to maximize the utility of each and every byte of RAM, the NFCompressedGraph stores all of the graph’s data into a single byte array. NetflixGraph uses one of three different representations to encode individual sets of connections:

Compact connection sets

Compact connection sets are represented with variable-length delta encoding. This is the default representation described in the original NetflixGraph tech blog post.

Variable-byte encoding is a way to represent integers in a variable number of bytes, whereby smaller values are represented in fewer bytes. An excellent explanation is available here on Wikipedia. Because smaller values can be represented in fewer bytes, we benefit significantly if we can represent our connected ordinals with smaller values.

This is the reason for the “delta” part of “variable-length delta encoding”. If we sort the ordinals for some connection set in ascending order, we might represent each connection not by it’s actual value, but by the difference between it’s value and the previous value in the set. For example, if we had some ordinals [1, 2, 3, 5, 7, 11, 13], we would represent this with the values [1, 1, 1, 2, 2, 4, 2].

In this way, many larger values can be represented with fewer bytes. In fact, the less sparse a set is, the more compact the representation for each element is likely to be.

Hashed connection sets

With Compact connection sets, determining whether or not a given node is contained in the set is an O(n) operation. This is due to the delta encoding in compact connection sets — to determine any given value, we must know the previous value. Determining set membership therefore requires a complete iteration of the entire set.

While this is acceptable for some use cases, others may require the sacrifice of a small amount of memory for better runtime performance. In these cases, hashed connection sets are the answer. Hashed connection sets provide O(1) membership determination time, while still providing strong memory-efficiency.

Hashed connection sets are a “variable-byte hashed integer array” representation. Underlying each of these connection sets is, of course, a byte array. Each connected ordinal is hashed into the byte array, then represented with between one and five bytes per connection, in a variant of the way Compact connection sets are represented. The byte array can be thought of as a open-addressed hash table with linear probing, with each byte representing a single bucket. Because values may be represented in more than one byte, single values may spill over into multiple buckets. The beginning of each value is indicated by an unset sign bit, and will be located at or after the bucket to which it is hashed. If the value’s first byte is not located at the hashed position, it will be located in a position after the bucket with no empty buckets in between.

Empty buckets are represented with a “0” byte. The variable-byte integer representation for Hashed connection sets has been designed to make “0” an illegal value.

Bit sets

The compact connection sets and hashed connection sets are efficient representations of connection sets in most circumstances. However, there are times when it may be more compact to allocate a single bit for each node in the type to which a connection set points. When the number of bits required to represent a set of connections via the default representation is greater than the number of total possible connections, NetflixGraph will automatically represent that set with a bit set.

For example, if we have some property which connects node “a” to node “b”, and there are only 16 total “b” nodes, then any connection set which requires more than two bytes can be more efficiently represented as a bit set. If there are three connections in a set, then two bytes will be used, and three of the bits in those bytes will be set to “1”. The indices of the “1” bits will correspond to the ordinals of the connected nodes in the set.

Membership determination in a bit set is O(1), because we need only check the status of a bit at a known location.

Clone this wiki locally