Target of column based data organize
- Only load needed data
- Organize data storage based on needs(query, filter)
For flat data
id | column_1 | column_2 | column_3 |
---|---|---|---|
1 | 111 | 222 | 333 |
2 | - | - | 444 |
3 | 555 | 666 | 777 |
Easy to encoding and decoding. For none-exist
field, we could use some marker field
to record it.
But when we talk about nested data
, take protobuf
as example, it has
required: exactly one occurrence
optional: 0 or 1 occurrence
repeated: 0 or more occurrences
For example
message AddressBook {
required string owner;
repeated string ownerPhoneNumbers;
repeated group contacts {
required string name;
optional string phoneNumber;
}
}
How can we re-construct original structure with just value(plus some marker
)
Definition Level
To support nested records we need to store the level for which the field is null
. This is what the definition level is for: from 0 at the root of the schema up to the maximum level for this column. When a field is defined then all its parents are defined too, but when it is null
we need to record the level at which it started being null to be able to reconstruct the record.
Repetition levels
To support repeated fields we need to store when new lists are starting in a column of values. This is what repetition level is for: it is the level at which we have to create a new list for the current value. In other words, the repetition level can be seen as a marker of when to start a new list and at which level.
create one column per primitive type cell shown in blue
Column | Max Definition Level | Reason | Max Repetition Level | Reason |
---|---|---|---|---|
DocID | 0 | required | 0 | no repetition |
forward | 2 | 1 | ||
backward | 2 | 1 | ||
url | 2 | 1 | ||
code | 2 | required, use its parent level, or understand as when there is language there must exists code |
2 | |
Country | 3 | 2 |
Explanation for different value of Repetition level
, take Country
as example
- 0 marks every new record and implies creating a new
Name
andLanguage
List - 1 marks every new record and implies creating a new
Language
List - 2 marks every new element in Language
Explanation of NULL
- In
Name.Url
, totally there are 4 names in data, r = 1 means a new entry in the Name at Level 1, d = 1 means Name is defined but not URL, so just create an empty URL - In
Links.Backward
, r = 0 means a new record and creates new list, d = 1 meansLinks
is defined but notBackward
- For
Code
andCountry
, there are 5 field has language due to that first name has 2 language.
01302021思考
Repetition Level:
- 記錄該列的值是在哪一個級別上重複的
- Name.Language.Country as example
- 0 marks every new record and implies creating a new
Name
andLanguage
List. 每个新纪录(r0, r1, r2)的第一个总是会赋0 - Name是1级,Langauage是2级,Country是3级
- 0-2-1-1-0的解释:
- 0表示r1 record的开始,因为us出现在第一个Name的第一个Language的第一个Country里面。
- 第二个是NULL(另一个Name里面是没有Country的值),但它属于name中重复的language,所以设为2
- 第三个也是NULL(是下一个Name里面没有Language.Country), 只有Name是重复元素,所以设为1
- 第三个是1,
gb
出现在第三个名字里,它是属于这个Name的第一个值,只有Name是重复的,所以设为1
- 0 marks every new record and implies creating a new
demention level:
- 只对非required以及非NULL值有意义。用来表示一个值是否是想象出来的
- 比如在例子中, 诸如DocID的d一定是固定值
- 对于Name.Language.Code, 非NULL值他们的d也是固定值, 比如en-us, en, en-gb都是2
- 而对于第一个NULL, 我们需要知道有意义的数据到哪一层,比如第一个NULL,是从Name.Language中来的,是第二层数据, 而第二个NULL,它的上一层有意义的结构是Name,层级为1