Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format control/JSON/Improvements #28

Merged
merged 26 commits into from
Aug 4, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

# Python cruft
*.pyc
.python-version

# C extensions
*.so
Expand Down
27 changes: 24 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,33 @@
## ClickHouse Connect ChangeLog

### Release 0.2.0, 2022-08-04

#### Deprecation warning

* In the next release the row_binary option for ClickHouse serialization will be removed. The performance is significantly lower than Native format and maintaining the option add complexity with no corresponding benefit

#### Improvements

* Support (experimental) JSON/Object datatype. ClickHouse Connect will take advantage of the fast orjson library if available. Note that inserts for JSON columns require ClickHouse server version 22.6.1 or later
* Standardize read format handling and allow specifying a return data format per column or per query.
* Added convenience min_version method to client to see if the server is at least the requested level
* Increase default HTTP timeout to 300 seconds to match ClickHouse server default

#### Bug Fixes
* Fixed multiple issues with SQL comments that would cause some queries to fail
* Fixed problem with SQLAlchemy literal binds that would cause an error in Superset filters
* Fixed issue with parameter
* Named Tuples were not supported and would result in throwing an exception. This has been fixed.
* The client query_arrow function would return incomplete results if the query result exceeded the ClickHouse max_block_size. This has been fixed. As part of the fix query_arrow method returns a PyArrow Table object. While this is a breaking change in the API it should be easy to work around.


### Release 0.1.6, 2022-07-06

#### Improvements

* Support Nested data types
* Support Nested data types.

#### Bug Fixes

* Fix issue with native reads of Nullable(LowCardinality) numeric and date types
* Empty inserts will now just log a debug message instead of throwing an IndexError
* Fix issue with native reads of Nullable(LowCardinality) numeric and date types.
* Empty inserts will now just log a debug message instead of throwing an IndexError.
11 changes: 8 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,15 @@ ClickHouse HTTP interface.


### Installation

```
pip install clickhouse-connect
```

ClickHouse Connect requires Python 3.7 or higher. The `cython` package must be installed prior to installing
`clickhouse_connect` to build and install the optional Cython/C extensions used for improving read and write
performance using the ClickHouse Native format. After installing cython if desired, clone this repository and
run `python setup.py install`from the project directory.
run `python setup.py install`from the project directory.

### Getting Started

Expand Down Expand Up @@ -104,8 +109,8 @@ Create a ClickHouse client using the `clickhouse_connect.driver.create_client(..
Native format is preferred for performance reasons
* `query_limit:int` LIMIT value added to all queries.
Defaults to 5,000 rows. Unlimited queries are not supported to prevent crashing the driver
* `connect_timeout:int` HTTP connection timeout in seconds
* `send_receive_timeout:int` HTTP read timeout in seconds
* `connect_timeout:int` HTTP connection timeout in seconds. Default 10 seconds.
* `send_receive_timeout:int` HTTP read timeout in seconds. Default 300 seconds.
* `client_name:str` HTTP User-Agent header. Defaults to `clickhouse-connect`
* `verify:bool` For HTTPS connections, validate the ClickHouse server TLS certificate, including
matching hostname, expiration, and signed by a trusted Certificate Authority. Defaults to True.
Expand Down
38 changes: 25 additions & 13 deletions clickhouse_connect/cc_sqlalchemy/datatypes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,15 @@

from clickhouse_connect.datatypes.base import ClickHouseType, TypeDef, EMPTY_TYPE_DEF
from clickhouse_connect.datatypes.registry import parse_name, type_map
from clickhouse_connect.driver.query import format_query_value

logger = logging.getLogger(__name__)


class ChSqlaType:
"""
A SQLAlchemy TypeEngine that wraps a ClickHouseType. We don't extend TypeEngine directly, instead all concrete subclasses
will inherit from TypeEngine
A SQLAlchemy TypeEngine that wraps a ClickHouseType. We don't extend TypeEngine directly, instead all concrete
subclasses will inherit from TypeEngine.
"""
ch_type: ClickHouseType = None
generic_type: None
Expand All @@ -22,7 +23,8 @@ class ChSqlaType:

def __init_subclass__(cls):
"""
Registers ChSqla type in the type map and sets the underlying ClickHouseType class to use to initialize ChSqlaType instances
Registers ChSqla type in the type map and sets the underlying ClickHouseType class to use to initialize
ChSqlaType instances
"""
base = cls.__name__
if not cls._ch_type_cls:
Expand All @@ -47,10 +49,10 @@ def build(cls, type_def: TypeDef):
def __init__(self, type_def: TypeDef = EMPTY_TYPE_DEF):
"""
Basic constructor that does nothing but set the wrapped ClickHouseType. It is overridden in some cases
to add specific SqlAlchemy behavior when constructing subclasses "by hand", in which case the type_def parameter is
normally set to None and other keyword parameters used for construction
:param type_def: TypeDef tuple used to build the underlying ClickHouseType. This is normally populated by the parse_name
function
to add specific SqlAlchemy behavior when constructing subclasses "by hand", in which case the type_def
parameter is normally set to None and other keyword parameters used for construction
:param type_def: TypeDef tuple used to build the underlying ClickHouseType. This is normally populated by the
parse_name function
"""
self.type_def = type_def
self.ch_type = self._ch_type_cls.build(type_def)
Expand All @@ -74,23 +76,33 @@ def low_card(self):
@staticmethod
def result_processor():
"""
Override for the SqlAlchemy TypeEngine result_processor method, which is used to convert row values to the correct Python type
The core driver handles this automatically, so we always return None
Override for the SqlAlchemy TypeEngine result_processor method, which is used to convert row values to the
correct Python type. The core driver handles this automatically, so we always return None.
"""
return None

@staticmethod
def _cached_result_processor(*_):
"""
Override for the SqlAlchemy TypeEngine _cached_result_processor method to prevent weird behavior when SQLAlchemy tries to cache
Override for the SqlAlchemy TypeEngine _cached_result_processor method to prevent weird behavior
when SQLAlchemy tries to cache.
"""
return None

@staticmethod
def _cached_literal_processor(*_):
"""
Override for the SqlAlchemy TypeEngine _cached_literal_processor. We delegate to the driver format_query_value
method and should be able to ignore literal_processor definitions in the dialect, which are verbose and
confusing.
"""
return format_query_value

def _compiler_dispatch(self, _visitor, **_):
"""
Override for the SqlAlchemy TypeEngine _compiler_dispatch method to sidestep unnecessary layers and complexity when generating
the type name. The underlying ClickHouseType generates the correct name
:return: Name generated by the underlying driver
Override for the SqlAlchemy TypeEngine _compiler_dispatch method to sidestep unnecessary layers and complexity
when generating the type name. The underlying ClickHouseType generates the correct name
:return: Name generated by the underlying driver.
"""
return self.name

Expand Down
9 changes: 1 addition & 8 deletions clickhouse_connect/cc_sqlalchemy/sql/__init__.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,10 @@
import re
from typing import Optional

from sqlalchemy import Table
from sqlalchemy.sql.compiler import RESERVED_WORDS

reserved_words = RESERVED_WORDS | set('index')
identifier_re = re.compile(r'^[a-zA-Z_][0-9a-zA-Z_]*$')


def quote_id(v: str) -> str:
if v in reserved_words or not identifier_re.match(v):
return f'`{v}`'
return v
return f'`{v}`'


def full_table(table_name: str, schema: Optional[str] = None) -> str:
Expand Down
8 changes: 7 additions & 1 deletion clickhouse_connect/cc_sqlalchemy/sql/preparer.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
from sqlalchemy.sql.compiler import IdentifierPreparer

from clickhouse_connect.cc_sqlalchemy.sql import quote_id


class ChIdentifierPreparer(IdentifierPreparer):
pass

quote_identifier = staticmethod(quote_id)

def _requires_quotes(self, _value):
return True
10 changes: 5 additions & 5 deletions clickhouse_connect/cc_superset/datatypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

from superset.utils.core import GenericDataType
from clickhouse_connect.cc_sqlalchemy.datatypes.base import sqla_type_map
from clickhouse_connect.datatypes import fixed_string_format, uint64_format, ip_format, uuid_format
from clickhouse_connect.datatypes.format import set_default_formats

type_mapping = (
(r'^(FLOAT|DECIMAL|INT|UINT)', GenericDataType.NUMERIC),
Expand All @@ -16,10 +16,10 @@ def configure_types():
Monkey patch the Superset generic_type onto the clickhouse type, also set defaults for certain type formatting to be
better compatible with superset
"""
fixed_string_format('string', 'utf8')
uint64_format('signed')
ip_format('string')
uuid_format('string')
set_default_formats(FixedString='string',
IPv4='string',
UInt64='signed',
UUID='string')
compiled = [(re.compile(pattern, re.IGNORECASE), gen_type) for pattern, gen_type in type_mapping]
for name, sqla_type in sqla_type_map.items():
for pattern, gen_type in compiled:
Expand Down
47 changes: 0 additions & 47 deletions clickhouse_connect/datatypes/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
import clickhouse_connect.datatypes.temporal
import clickhouse_connect.datatypes.registry

from clickhouse_connect.driver.exceptions import ProgrammingError

logger = logging.getLogger(__name__)

Expand All @@ -21,49 +20,3 @@
dt_string.FixedString._read_native_bytes = creaders.read_fixed_string_bytes
except ImportError:
logger.warning('Unable to connect optimized C driver functions, falling back to pure Python', exc_info=True)


def fixed_string_format(fmt: str, encoding: str = 'utf8'):
if fmt == 'string':
dt_string.FixedString.format = 'string'
dt_string.FixedString.encoding = encoding
elif fmt == 'bytes':
dt_string.FixedString.format = 'bytes'
dt_string.FixedString.encoding = 'utf8'
else:
raise ProgrammingError(f'Unrecognized fixed string default format {fmt}')


def big_int_format(fmt: str):
if fmt in ('string', 'int'):
dt_numeric.BigInt.format = fmt
else:
raise ProgrammingError(f'Unrecognized Big Integer default format {fmt}')


def uint64_format(fmt: str):
if fmt == 'unsigned':
dt_numeric.UInt64.format = 'unsigned'
dt_numeric.UInt64._array_type = 'Q'
dt_numeric.UInt64.np_format = 'u8'
elif fmt == 'signed':
dt_numeric.UInt64.format = 'signed'
dt_numeric.UInt64._array_type = 'q'
dt_numeric.UInt64.np_format = 'i8'
else:
raise ProgrammingError(f'Unrecognized UInt64 default format {fmt}')


def uuid_format(fmt: str):
if fmt in ('uuid', 'string'):
dt_special.UUID.format = fmt
else:
raise ProgrammingError(f'Unrecognized UUID default format {fmt}')


def ip_format(fmt: str):
if fmt in ('string', 'ip'):
dt_network.IPv4.format = fmt
dt_network.IPv6.format = fmt
else:
raise ProgrammingError(f'Unrecognized IPv4/IPv6 default format {fmt}')
Loading