Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support parsing for parquet writer option #4938

Merged
merged 5 commits into from
Oct 18, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions parquet/src/basic.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
//! Contains Rust mappings for Thrift definition.
//! Refer to [`parquet.thrift`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift) file to see raw definitions.

use std::str::FromStr;
use std::{fmt, str};

pub use crate::compression::{BrotliLevel, GzipLevel, ZstdLevel};
Expand Down Expand Up @@ -278,6 +279,25 @@ pub enum Encoding {
BYTE_STREAM_SPLIT,
}

impl FromStr for Encoding {
type Err = ParquetError;

fn from_str(s: &str) -> Result<Self, Self::Err> {
match s.to_owned().to_uppercase().as_str() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should be case sensitive and leave it to the caller to force to upper case if that is what they want?

It seems strange that we would support parsing things like PlAiN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems strange that we would support parsing things like PlAiN

For users, whose usage is generally unobservable, I also find it odd to support this kind of parsing like PlAiN, but it seems that both cases should be supported?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just support case sensitive parsing and the user can opt in to case insensitive parsing by converting the input to uppercase should they want this behaviour

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just support case sensitive parsing and the user can opt in to case insensitive parsing by converting the input to uppercase should they want this behaviour

done. i think it's a good suggestion.

"PLAIN" => Ok(Encoding::PLAIN),
"PLAIN_DICTIONARY" => Ok(Encoding::PLAIN_DICTIONARY),
"RLE" => Ok(Encoding::RLE),
"BIT_PACKED" => Ok(Encoding::BIT_PACKED),
"DELTA_BINARY_PACKED" => Ok(Encoding::DELTA_BINARY_PACKED),
"DELTA_LENGTH_BYTE_ARRAY" => Ok(Encoding::DELTA_LENGTH_BYTE_ARRAY),
"DELTA_BYTE_ARRAY" => Ok(Encoding::DELTA_BYTE_ARRAY),
"RLE_DICTIONARY" => Ok(Encoding::RLE_DICTIONARY),
"BYTE_STREAM_SPLIT" => Ok(Encoding::BYTE_STREAM_SPLIT),
_ => Err(general_err!("unknown encoding: {}", s)),
}
}
}

// ----------------------------------------------------------------------
// Mirrors `parquet::CompressionCodec`

Expand All @@ -295,6 +315,89 @@ pub enum Compression {
LZ4_RAW,
}

fn split_compression_string(
str_setting: &str,
) -> Result<(&str, Option<u32>), ParquetError> {
let split_setting = str_setting.split_once('(');

match split_setting {
Some((codec, level_str)) => {
let level =
&level_str[..level_str.len() - 1]
.parse::<u32>()
.map_err(|_| {
ParquetError::General(format!(
"invalid compression level: {}",
level_str
))
})?;
Ok((codec, Some(*level)))
}
None => Ok((str_setting, None)),
}
}

fn check_level_is_none(level: &Option<u32>) -> Result<(), ParquetError> {
if level.is_some() {
return Err(ParquetError::General("level is not support".to_string()));
}

Ok(())
}

fn require_level(codec: &String, level: Option<u32>) -> Result<u32, ParquetError> {
level.ok_or(ParquetError::General(format!("{} require level", codec)))
}

impl FromStr for Compression {
type Err = ParquetError;

fn from_str(s: &str) -> std::result::Result<Self, Self::Err> {
let (codec, level) = split_compression_string(s)?;
let codec = codec.to_uppercase();

let c = match codec.as_str() {
"UNCOMPRESSED" => {
check_level_is_none(&level)?;
Compression::UNCOMPRESSED
}
"SNAPPY" => {
check_level_is_none(&level)?;
Compression::SNAPPY
}
"GZIP" => {
let level = require_level(&codec, level)?;
Compression::GZIP(GzipLevel::try_new(level)?)
}
"LZO" => {
check_level_is_none(&level)?;
Compression::LZO
}
"BROTLI" => {
let level = require_level(&codec, level)?;
Compression::BROTLI(BrotliLevel::try_new(level)?)
}
"LZ4" => {
check_level_is_none(&level)?;
Compression::LZ4
}
"ZSTD" => {
let level = require_level(&codec, level)?;
Compression::ZSTD(ZstdLevel::try_new(level as i32)?)
}
"LZ4_RAW" => {
check_level_is_none(&level)?;
Compression::LZ4_RAW
}
_ => {
return Err(ParquetError::General(format!("unsupport {codec}")));
}
};

Ok(c)
}
}

// ----------------------------------------------------------------------
// Mirrors `parquet::PageType`

Expand Down
26 changes: 26 additions & 0 deletions parquet/src/file/properties.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
// under the License.

//! Configuration via [`WriterProperties`] and [`ReaderProperties`]
use std::str::FromStr;
use std::{collections::HashMap, sync::Arc};

use crate::basic::{Compression, Encoding};
Expand Down Expand Up @@ -72,6 +73,18 @@ impl WriterVersion {
}
}

impl FromStr for WriterVersion {
type Err = String;

fn from_str(s: &str) -> Result<Self, Self::Err> {
match s.to_owned().to_uppercase().as_str() {
"PARQUET_1_0" => Ok(WriterVersion::PARQUET_1_0),
"PARQUET_2_0" => Ok(WriterVersion::PARQUET_2_0),
_ => Err(format!("Invalid writer version: {}", s)),
}
}
}

/// Reference counted writer properties.
pub type WriterPropertiesPtr = Arc<WriterProperties>;

Expand Down Expand Up @@ -655,6 +668,19 @@ pub enum EnabledStatistics {
Page,
}

impl FromStr for EnabledStatistics {
type Err = String;

fn from_str(s: &str) -> Result<Self, Self::Err> {
match s.to_owned().to_uppercase().as_str() {
"NONE" => Ok(EnabledStatistics::None),
"CHUNK" => Ok(EnabledStatistics::Chunk),
"PAGE" => Ok(EnabledStatistics::Page),
_ => Err(format!("Invalid statistics arg: {}", s)),
}
}
}

impl Default for EnabledStatistics {
fn default() -> Self {
DEFAULT_STATISTICS_ENABLED
Expand Down