-
-
Notifications
You must be signed in to change notification settings - Fork 684
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Added support for dynamic fast field. See README for more information. * Apply suggestions from code review Co-authored-by: PSeitz <[email protected]>
- Loading branch information
1 parent
1afa5bf
commit 4f9efe6
Showing
16 changed files
with
2,256 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
[package] | ||
name = "tantivy-columnar" | ||
version = "0.1.0" | ||
edition = "2021" | ||
license = "MIT" | ||
|
||
[dependencies] | ||
stacker = { path = "../stacker", package="tantivy-stacker"} | ||
serde_json = "1" | ||
thiserror = "1" | ||
fnv = "1" | ||
sstable = { path = "../sstable", package = "tantivy-sstable" } | ||
common = { path = "../common", package = "tantivy-common" } | ||
fastfield_codecs = { path = "../fastfield_codecs"} | ||
itertools = "0.10" | ||
|
||
[dev-dependencies] | ||
proptest = "1" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
# Columnar format | ||
|
||
This crate describes columnar format used in tantivy. | ||
|
||
## Goals | ||
|
||
This format is special in the following way. | ||
- it needs to be compact | ||
- it does not required to be loaded in memory. | ||
- it is designed to fit well with quickwit's strange constraint: | ||
we need to be able to load columns rapidly. | ||
- columns of several types can be associated with the same column name. | ||
- it needs to support columns with different types `(str, u64, i64, f64)` | ||
and different cardinality `(required, optional, multivalued)`. | ||
- columns, once loaded, offer cheap random access. | ||
|
||
# Coercion rules | ||
|
||
Users can create a columnar by inserting rows to a `ColumnarWriter`, | ||
and serializing it into a `Write` object. | ||
Nothing prevents a user from recording values with different type to the same `column_name`. | ||
|
||
In that case, `tantivy-columnar`'s behavior is as follows: | ||
- JsonValues are grouped into 3 types (String, Number, bool). | ||
Values that corresponds to different groups are mapped to different columns. For instance, String values are treated independently | ||
from Number or boolean values. `tantivy-columnar` will simply emit several columns associated to a given column_name. | ||
- Only one column for a given json value type is emitted. If number values with different number types are recorded (e.g. u64, i64, f64), | ||
`tantivy-columnar` will pick the first type that can represents the set of appended value, with the following prioriy order (`i64`, `u64`, `f64`). | ||
`i64` is picked over `u64` as it is likely to yield less change of types. Most use cases strictly requiring `u64` show the | ||
restriction on 50% of the values (e.g. a 64-bit hash). On the other hand, a lot of use cases can show rare negative value. | ||
|
||
# Columnar format | ||
|
||
This columnar format may have more than one column (with different types) associated to the same `column_name` (see [Coercion rules](#coercion-rules) above). | ||
The `(column_name, columne_type)` couple however uniquely identifies a column. | ||
That couple is serialized as a column `column_key`. The format of that key is: | ||
`[column_name][ZERO_BYTE][column_type_header: u8]` | ||
|
||
``` | ||
COLUMNAR:= | ||
[COLUMNAR_DATA] | ||
[COLUMNAR_KEY_TO_DATA_INDEX] | ||
[COLUMNAR_FOOTER]; | ||
# Columns are sorted by their column key. | ||
COLUMNAR_DATA:= | ||
[COLUMN_DATA]+; | ||
COLUMNAR_FOOTER := [RANGE_SSTABLE_BYTES_LEN: 8 bytes little endian] | ||
``` | ||
|
||
The columnar file starts by the actual column data, concatenated one after the other, | ||
sorted by column key. | ||
|
||
A sstable associates | ||
`(column name, column_cardinality, column_type) to range of bytes. | ||
|
||
Column name may not contain the zero byte `\0`. | ||
|
||
Listing all columns associated to `column_name` can therefore | ||
be done by listing all keys prefixed by | ||
`[column_name][ZERO_BYTE]` | ||
|
||
The associated range of bytes refer to a range of bytes | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,201 @@ | ||
use crate::utils::{place_bits, select_bits}; | ||
use crate::value::NumericalType; | ||
use crate::InvalidData; | ||
|
||
/// Enum describing the number of values that can exist per document | ||
/// (or per row if you will). | ||
/// | ||
/// The cardinality must fit on 2 bits. | ||
#[derive(Clone, Copy, Hash, Default, Debug, PartialEq, Eq, PartialOrd, Ord)] | ||
#[repr(u8)] | ||
pub enum Cardinality { | ||
/// All documents contain exactly one value. | ||
/// Required is the default for auto-detecting the Cardinality, since it is the most strict. | ||
#[default] | ||
Required = 0, | ||
/// All documents contain at most one value. | ||
Optional = 1, | ||
/// All documents may contain any number of values. | ||
Multivalued = 2, | ||
} | ||
|
||
impl Cardinality { | ||
pub(crate) fn to_code(self) -> u8 { | ||
self as u8 | ||
} | ||
|
||
pub(crate) fn try_from_code(code: u8) -> Result<Cardinality, InvalidData> { | ||
match code { | ||
0 => Ok(Cardinality::Required), | ||
1 => Ok(Cardinality::Optional), | ||
2 => Ok(Cardinality::Multivalued), | ||
_ => Err(InvalidData), | ||
} | ||
} | ||
} | ||
|
||
/// The column type represents the column type and can fit on 6-bits. | ||
/// | ||
/// - bits[0..3]: Column category type. | ||
/// - bits[3..6]: Numerical type if necessary. | ||
#[derive(Hash, Eq, PartialEq, Debug, Clone, Copy)] | ||
pub enum ColumnType { | ||
Bytes, | ||
Numerical(NumericalType), | ||
Bool, | ||
} | ||
|
||
impl ColumnType { | ||
/// Encoded over 6 bits. | ||
pub(crate) fn to_code(self) -> u8 { | ||
let column_type_category; | ||
let numerical_type_code: u8; | ||
match self { | ||
ColumnType::Bytes => { | ||
column_type_category = ColumnTypeCategory::Str; | ||
numerical_type_code = 0u8; | ||
} | ||
ColumnType::Numerical(numerical_type) => { | ||
column_type_category = ColumnTypeCategory::Numerical; | ||
numerical_type_code = numerical_type.to_code(); | ||
} | ||
ColumnType::Bool => { | ||
column_type_category = ColumnTypeCategory::Bool; | ||
numerical_type_code = 0u8; | ||
} | ||
} | ||
place_bits::<0, 3>(column_type_category.to_code()) | place_bits::<3, 6>(numerical_type_code) | ||
} | ||
|
||
pub(crate) fn try_from_code(code: u8) -> Result<ColumnType, InvalidData> { | ||
if select_bits::<6, 8>(code) != 0u8 { | ||
return Err(InvalidData); | ||
} | ||
let column_type_category_code = select_bits::<0, 3>(code); | ||
let numerical_type_code = select_bits::<3, 6>(code); | ||
let column_type_category = ColumnTypeCategory::try_from_code(column_type_category_code)?; | ||
match column_type_category { | ||
ColumnTypeCategory::Bool => { | ||
if numerical_type_code != 0u8 { | ||
return Err(InvalidData); | ||
} | ||
Ok(ColumnType::Bool) | ||
} | ||
ColumnTypeCategory::Str => { | ||
if numerical_type_code != 0u8 { | ||
return Err(InvalidData); | ||
} | ||
Ok(ColumnType::Bytes) | ||
} | ||
ColumnTypeCategory::Numerical => { | ||
let numerical_type = NumericalType::try_from_code(numerical_type_code)?; | ||
Ok(ColumnType::Numerical(numerical_type)) | ||
} | ||
} | ||
} | ||
} | ||
|
||
/// Column types are grouped into different categories that | ||
/// corresponds to the different types of `JsonValue` types. | ||
/// | ||
/// The columnar writer will apply coercion rules to make sure that | ||
/// at most one column exist per `ColumnTypeCategory`. | ||
/// | ||
/// See also [README.md]. | ||
#[derive(Copy, Clone, Ord, PartialOrd, Eq, PartialEq, Debug)] | ||
#[repr(u8)] | ||
pub(crate) enum ColumnTypeCategory { | ||
Bool = 0u8, | ||
Str = 1u8, | ||
Numerical = 2u8, | ||
} | ||
|
||
impl ColumnTypeCategory { | ||
pub fn to_code(self) -> u8 { | ||
self as u8 | ||
} | ||
|
||
pub fn try_from_code(code: u8) -> Result<Self, InvalidData> { | ||
match code { | ||
0u8 => Ok(Self::Bool), | ||
1u8 => Ok(Self::Str), | ||
2u8 => Ok(Self::Numerical), | ||
_ => Err(InvalidData), | ||
} | ||
} | ||
} | ||
|
||
/// Represents the type and cardinality of a column. | ||
/// This is encoded over one-byte and added to a column key in the | ||
/// columnar sstable. | ||
/// | ||
/// - [0..6] bits: encodes the column type | ||
/// - [6..8] bits: encodes the cardinality | ||
#[derive(Eq, Hash, PartialEq, Debug, Copy, Clone)] | ||
pub struct ColumnTypeAndCardinality { | ||
pub typ: ColumnType, | ||
pub cardinality: Cardinality, | ||
} | ||
|
||
impl ColumnTypeAndCardinality { | ||
pub fn to_code(self) -> u8 { | ||
place_bits::<0, 6>(self.typ.to_code()) | place_bits::<6, 8>(self.cardinality.to_code()) | ||
} | ||
|
||
pub fn try_from_code(code: u8) -> Result<ColumnTypeAndCardinality, InvalidData> { | ||
let typ_code = select_bits::<0, 6>(code); | ||
let cardinality_code = select_bits::<6, 8>(code); | ||
let cardinality = Cardinality::try_from_code(cardinality_code)?; | ||
let typ = ColumnType::try_from_code(typ_code)?; | ||
assert_eq!(typ.to_code(), typ_code); | ||
Ok(ColumnTypeAndCardinality { cardinality, typ }) | ||
} | ||
} | ||
|
||
#[cfg(test)] | ||
mod tests { | ||
use std::collections::HashSet; | ||
|
||
use super::ColumnTypeAndCardinality; | ||
use crate::column_type_header::{Cardinality, ColumnType}; | ||
|
||
#[test] | ||
fn test_column_type_header_to_code() { | ||
let mut column_type_header_set: HashSet<ColumnTypeAndCardinality> = HashSet::new(); | ||
for code in u8::MIN..=u8::MAX { | ||
if let Ok(column_type_header) = ColumnTypeAndCardinality::try_from_code(code) { | ||
assert_eq!(column_type_header.to_code(), code); | ||
assert!(column_type_header_set.insert(column_type_header)); | ||
} | ||
} | ||
assert_eq!( | ||
column_type_header_set.len(), | ||
3 /* cardinality */ * | ||
(1 + 1 + 3) // column_types (str, bool, numerical x 3) | ||
); | ||
} | ||
|
||
#[test] | ||
fn test_column_type_to_code() { | ||
let mut column_type_set: HashSet<ColumnType> = HashSet::new(); | ||
for code in u8::MIN..=u8::MAX { | ||
if let Ok(column_type) = ColumnType::try_from_code(code) { | ||
assert_eq!(column_type.to_code(), code); | ||
assert!(column_type_set.insert(column_type)); | ||
} | ||
} | ||
assert_eq!(column_type_set.len(), 2 + 3); | ||
} | ||
|
||
#[test] | ||
fn test_cardinality_to_code() { | ||
let mut num_cardinality = 0; | ||
for code in u8::MIN..=u8::MAX { | ||
if let Ok(cardinality) = Cardinality::try_from_code(code) { | ||
assert_eq!(cardinality.to_code(), code); | ||
num_cardinality += 1; | ||
} | ||
} | ||
assert_eq!(num_cardinality, 3); | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
use std::io; | ||
|
||
use fnv::FnvHashMap; | ||
use sstable::SSTable; | ||
|
||
pub(crate) struct TermIdMapping { | ||
unordered_to_ord: Vec<OrderedId>, | ||
} | ||
|
||
impl TermIdMapping { | ||
pub fn to_ord(&self, unordered: UnorderedId) -> OrderedId { | ||
self.unordered_to_ord[unordered.0 as usize] | ||
} | ||
} | ||
|
||
/// When we add values, we cannot know their ordered id yet. | ||
/// For this reason, we temporarily assign them a `UnorderedId` | ||
/// that will be mapped to an `OrderedId` upon serialization. | ||
#[derive(Clone, Copy, Debug, Hash, PartialEq, Eq)] | ||
pub struct UnorderedId(pub u32); | ||
|
||
#[derive(Clone, Copy, Hash, PartialEq, Eq, Debug)] | ||
pub struct OrderedId(pub u32); | ||
|
||
/// `DictionaryBuilder` for dictionary encoding. | ||
/// | ||
/// It stores the different terms encounterred and assigns them a temporary value | ||
/// we call unordered id. | ||
/// | ||
/// Upon serialization, we will sort the ids and hence build a `UnorderedId -> Term ordinal` | ||
/// mapping. | ||
#[derive(Default)] | ||
pub(crate) struct DictionaryBuilder { | ||
dict: FnvHashMap<Vec<u8>, UnorderedId>, | ||
} | ||
|
||
impl DictionaryBuilder { | ||
/// Get or allocate an unordered id. | ||
/// (This ID is simply an auto-incremented id.) | ||
pub fn get_or_allocate_id(&mut self, term: &[u8]) -> UnorderedId { | ||
if let Some(term_id) = self.dict.get(term) { | ||
return *term_id; | ||
} | ||
let new_id = UnorderedId(self.dict.len() as u32); | ||
self.dict.insert(term.to_vec(), new_id); | ||
new_id | ||
} | ||
|
||
/// Serialize the dictionary into an fst, and returns the | ||
/// `UnorderedId -> TermOrdinal` map. | ||
pub fn serialize<'a, W: io::Write + 'a>(&self, wrt: &mut W) -> io::Result<TermIdMapping> { | ||
let mut terms: Vec<(&[u8], UnorderedId)> = | ||
self.dict.iter().map(|(k, v)| (k.as_slice(), *v)).collect(); | ||
terms.sort_unstable_by_key(|(key, _)| *key); | ||
// TODO Remove the allocation. | ||
let mut unordered_to_ord: Vec<OrderedId> = vec![OrderedId(0u32); terms.len()]; | ||
let mut sstable_builder = sstable::VoidSSTable::writer(wrt); | ||
for (ord, (key, unordered_id)) in terms.into_iter().enumerate() { | ||
let ordered_id = OrderedId(ord as u32); | ||
sstable_builder.insert(key, &())?; | ||
unordered_to_ord[unordered_id.0 as usize] = ordered_id; | ||
} | ||
sstable_builder.finish()?; | ||
Ok(TermIdMapping { unordered_to_ord }) | ||
} | ||
} | ||
|
||
#[cfg(test)] | ||
mod tests { | ||
use super::*; | ||
|
||
#[test] | ||
fn test_dictionary_builder() { | ||
let mut dictionary_builder = DictionaryBuilder::default(); | ||
let hello_uid = dictionary_builder.get_or_allocate_id(b"hello"); | ||
let happy_uid = dictionary_builder.get_or_allocate_id(b"happy"); | ||
let tax_uid = dictionary_builder.get_or_allocate_id(b"tax"); | ||
let mut buffer = Vec::new(); | ||
let id_mapping = dictionary_builder.serialize(&mut buffer).unwrap(); | ||
assert_eq!(id_mapping.to_ord(hello_uid), OrderedId(1)); | ||
assert_eq!(id_mapping.to_ord(happy_uid), OrderedId(0)); | ||
assert_eq!(id_mapping.to_ord(tax_uid), OrderedId(2)); | ||
} | ||
} |
Oops, something went wrong.