Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ResultSet executeQuery(Map<ClickHouseQueryParam, String> additionalDBParams, List<ClickHouseExternalData> externalData) throws SQLException; #1370

Closed
rhy429081492 opened this issue May 18, 2023 · 3 comments
Labels

Comments

@rhy429081492
Copy link

Describe the bug

I occasionally report an error when I use this function in JDBC to add external data in batches.
Each error is caused by the format of the data sample.
The sample position is different each time the error is reported.
When I reduce the amount of data, it works.
I have tried to use Wireshark to capture packets. It is found that the request data packets received by the server when the request fails are the same as those received by the server when the request succeeds.

Code example

try (PreparedStatement statement = prepareStatement(dataSource, sql)) {
List useData = new ArrayList<>();
for (int i = 0; i < externalData.size(); i++) {
ClickHouseExternalData tmpData = externalData.get(i);
useData.add(new ClickHouseExternalData(tmpData.getName(),
new ByteArrayInputStream(input.get(i).getBytes(StandardCharsets.UTF_8)))
.withStructure(tmpData.getStructure()).withFormat(tmpData.getFormat()));
}
ClickHousePreparedStatement clickHousePreparedStatement = (ClickHousePreparedStatement) statement;
try (ResultSet rs = clickHousePreparedStatement.executeQuery(null, useData)){
LOGGER.debug("ExecuteShardQuery: {},Use DataSource Url: [{}]", exeMethod,
dataSource.getAllClickhouseUrls());
return callBack.call(rs);
}
}

Error log

First error log
2023.05.18 10:19:55.190332 [ 27874 ] {} DynamicQueryHandler: Code: 27, e.displayText() = DB::ParsingException: Cannot parse input: expected '\t' before: '\nsi\tslot1202\nsi\tslot1203\nsi\tslot1204\nsi\tslot1205\nsi\tslot1206\nsi\tslot1207\nsi\tslot1208\nsi\tslot1209\nsi\tslot1210\nsi\tslot1211\nsi\tslot1212\nsi\tslot1213\nsi\tslot1214\nsi\t':
Row 1203:
Column 0, name: tagname, type: String, parsed text: "si"
Column 1, name: tagvalue, type: String, parsed text: "slot1201"

Row 1204:
Column 0, name: tagname, type: String, parsed text: "201"
ERROR: Line feed found where tab is expected. It's like your file has less columns than expected.
And if your file have right number of columns, maybe it have unescaped backslash in value before tab, which cause tab has escaped.

: While executing SourceFromInputStream: (at row 1204)
, Stack trace (when copying this message, always include the lines below):

  1. DB::throwAtAssertionFailed(char const*, DB::ReadBuffer&) @ 0x8667d25 in /usr/bin/clickhouse
  2. DB::TabSeparatedRowInputFormat::readRow(std::__1::vector<COWDB::IColumn::mutable_ptrDB::IColumn, std::__1::allocator<COWDB::IColumn::mutable_ptrDB::IColumn > >&, DB::RowReadExtension&) @ 0xfa43b52 in /usr/bin/clickhouse
  3. DB::IRowInputFormat::generate() @ 0xf970c58 in /usr/bin/clickhouse
  4. DB::ISource::tryGenerate() @ 0xf8ff205 in /usr/bin/clickhouse
  5. DB::ISource::work() @ 0xf8fedfa in /usr/bin/clickhouse
  6. DB::ParallelParsingInputFormat::InternalParser::getChunk() @ 0xf9c7aae in /usr/bin/clickhouse
  7. DB::ParallelParsingInputFormat::parserThreadFunction(std::__1::shared_ptrDB::ThreadGroupStatus, unsigned long) @ 0xf9c710e in /usr/bin/clickhouse
  8. ThreadPoolImpl::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) @ 0x8654088 in /usr/bin/clickhouse
  9. ThreadFromGlobalPool::ThreadFromGlobalPool<void ThreadPoolImpl::scheduleImpl(std::__1::function<void ()>, int, std::__1::optional)::'lambda1'()>(void&&, void ThreadPoolImpl::scheduleImpl(std::__1::function<void ()>, int, std::__1::optional)::'lambda1'()&&...)::'lambda'()::operator()() @ 0x865603f in /usr/bin/clickhouse
  10. ThreadPoolImplstd::__1::thread::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0x865165f in /usr/bin/clickhouse
  11. ? @ 0x86550f3 in /usr/bin/clickhouse
  12. start_thread @ 0x7dc5 in /usr/lib64/libpthread-2.17.so
  13. __clone @ 0xf9d1d in /usr/lib64/libc-2.17.so
    (version 21.3.4.25 (official build))

Second error log
2023.05.18 10:26:41.058246 [ 2668 ] {} void DB::ParallelParsingInputFormat::onBackgroundException(size_t): Code: 27, e.displayText() = DB::ParsingException: Cannot parse input: expected ',' before: '\nsi,slot1706\nsi,slot1707\nsi,slot1708\nsi,slot1709\nsi,slot1710\nsi,slot1711\nsi,slot1712\nsi,slot1713\nsi,slot1714\nsi,slot1715\nsi,slot1716\nsi,slot1717\nsi,slot1718\nsi,':
Row 1707:
Column 0, name: tagname, type: String, parsed text: "si"
Column 1, name: tagvalue, type: String, parsed text: "slot1705"

Row 1708:
Column 0, name: tagname, type: String, parsed text: "5"
ERROR: Line feed found where delimiter (,) is expected. It's like your file has less columns than expected.
And if your file have right number of columns, maybe it have unescaped quotes in values.

: (at row 1708)
, Stack trace (when copying this message, always include the lines below):

  1. DB::throwAtAssertionFailed(char const*, DB::ReadBuffer&) @ 0x8667d25 in /usr/bin/clickhouse
  2. ? @ 0xf991adb in /usr/bin/clickhouse
  3. DB::CSVRowInputFormat::readRow(std::__1::vector<COWDB::IColumn::mutable_ptrDB::IColumn, std::__1::allocator<COWDB::IColumn::mutable_ptrDB::IColumn > >&, DB::RowReadExtension&) @ 0xf991c2e in /usr/bin/clickhouse
  4. DB::IRowInputFormat::generate() @ 0xf970c58 in /usr/bin/clickhouse
  5. DB::ISource::tryGenerate() @ 0xf8ff205 in /usr/bin/clickhouse
  6. DB::ISource::work() @ 0xf8fedfa in /usr/bin/clickhouse
  7. DB::ParallelParsingInputFormat::InternalParser::getChunk() @ 0xf9c7aae in /usr/bin/clickhouse
  8. DB::ParallelParsingInputFormat::parserThreadFunction(std::__1::shared_ptrDB::ThreadGroupStatus, unsigned long) @ 0xf9c710e in /usr/bin/clickhouse
  9. ThreadPoolImpl::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) @ 0x8654088 in /usr/bin/clickhouse
  10. ThreadFromGlobalPool::ThreadFromGlobalPool<void ThreadPoolImpl::scheduleImpl(std::__1::function<void ()>, int, std::__1::optional)::'lambda1'()>(void&&, void ThreadPoolImpl::scheduleImpl(std::__1::function<void ()>, int, std::__1::optional)::'lambda1'()&&...)::'lambda'()::operator()() @ 0x865603f in /usr/bin/clickhouse
  11. ThreadPoolImplstd::__1::thread::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0x865165f in /usr/bin/clickhouse
  12. ? @ 0x86550f3 in /usr/bin/clickhouse
  13. start_thread @ 0x7dc5 in /usr/lib64/libpthread-2.17.so
  14. __clone @ 0xf9d1d in /usr/lib64/libc-2.17.so
    (version 21.3.4.25 (official build))

Configuration

Environment

  • Client version: ru.yandex.clickhouse clickhouse-jdbc 0.3.2
  • Language version: jdk 1.8
  • OS: win10

ClickHouse server

  • ClickHouse Server version: 21.3.4.25
  • ClickHouse Server non-default settings, if any:
  • CREATE TABLE statements for tables involved:
  • Sample data for all these tables, use clickhouse-obfuscator if necessary
@zhicwu
Copy link
Contributor

zhicwu commented May 18, 2023

Hi @rhy429081492, have you validated data captured by Wireshack? Taking the first error log as an example, did you see si\tslot1201\n201\nsi\tslot1202 in Wireshack? Is that same as in the input.get(i)? If you're certain that you passed valid data to the JDBC driver, then it could be a bug.

Lastly, please pay attention that you're using an obsoleted driver which is no longer supported. I'd suggest you upgrade to a more recent version like 0.4.6.

@zhicwu zhicwu added question and removed bug labels May 18, 2023
@rhy429081492
Copy link
Author

Hi @rhy429081492, have you validated data captured by Wireshack? Taking the first error log as an example, did you see si\tslot1201\n201\nsi\tslot1202 in Wireshack? Is that same as in the input.get(i)? If you're certain that you passed valid data to the JDBC driver, then it could be a bug.

Lastly, please pay attention that you're using an obsoleted driver which is no longer supported. I'd suggest you upgrade to a more recent version like 0.4.6.
I confirm that the packets captured by Wireshark contain this information.
The data is valid. This problem occurs only when the data volume increases.
Initially, my main purpose is to solve the problem that the latency increases due to excessive data after the in statement.
In the latest version of the JDBC client, the interface that I use is no longer available. Use the new interface to solve this problem by creating temporary tables. However, the way temporary tables are created is not ideal for performance optimization. Does not perform as well as the interface I raised the issue with.
When creating a temporary table, the new interface communicates with the server for multiple times. The packet of the new interface is completely different from that of the interface used by me.
I'd rather do something similar to this in clickhouse docs. https://clickhouse.com/docs/en/engines/table-engines/special/external-data

@zhicwu
Copy link
Contributor

zhicwu commented May 23, 2023

Use the new interface to solve this problem by creating temporary tables. However, the way temporary tables are created is not ideal for performance optimization. Does not perform as well as the interface I raised the issue with.

Are you referring to the following example? The name might indeed be somewhat confusing. Internally, it utilizes an external table within the same http request. Moreover, you can invoke the ClickHouseExternalData.asTempTable() function if you prefer to employ a temporary table instead of an external table. This can be particularly useful for handling complex data types, as mentioned in ClickHouse/ClickHouse#32171.

static String exteralTables(String url) throws SQLException {
String sql = "select a.name as n1, b.name as n2 from {tt 'table1'} a inner join {tt 'table2'} b on a.id=b.id";
try (Connection conn = getConnection(url); PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1,
ClickHouseExternalTable.builder().name("table1").columns("id Int32, name Nullable(String)")
.format(ClickHouseFormat.CSV)
.content(newInputStream("1,a\n2,b")).build());
ps.setObject(2,
ClickHouseExternalTable.builder().name("table2").columns("id Int32, name String")
.format(ClickHouseFormat.JSONEachRow)
.content(newInputStream("{\"id\":3,\"name\":\"c\"}\n{\"id\":1,\"name\":\"d\"}")).build());
try (ResultSet rs = ps.executeQuery()) {
if (!rs.next()) {
throw new IllegalStateException("Should have at least one record");
}
// n1=a, n2=d
return String.format("n1=%s, n2=%s", rs.getString(1), rs.getString(2));
}
}
}

In conclusion, if your application demands high performance, I strongly recommend upgrading to 0.3.2-patch11(use com.clickhouse.jdbc.ClickHouseDriver since both old and new drivers co-exist in 0.3.2). This version offers a significant speed increase compared to its predecessors, as detailed in #768. Additionally, version 0.4 doubles the performance of nested type serialization. Furthermore, you might consider utilizing the Java client, which provides a lower-level API with minimal overhead, typically resulting in superior performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants