-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set header and footer skip line count for CSV tables #1090
Conversation
c18e425
to
ba1c84f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add a test where data for csv file spans across two splits to see if data read by Presto is correct.
// Textfile specific property | ||
String textSkipHeaderCount = table.get().getParameters().get(TEXT_SKIP_HEADER_COUNT_KEY); | ||
// Textfile and CSV specific property | ||
String textSkipHeaderCount = table.get().getParameters().get(SKIP_HEADER_COUNT_KEY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename variable to skipHeaderCount
} | ||
String textSkipFooterCount = table.get().getParameters().get(TEXT_SKIP_FOOTER_COUNT_KEY); | ||
String textSkipFooterCount = table.get().getParameters().get(SKIP_FOOTER_COUNT_KEY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
if (headerSkipCount > 0) { | ||
checkFormatForProperty(hiveStorageFormat, HiveStorageFormat.TEXTFILE, TEXTFILE_SKIP_HEADER_LINE_COUNT); | ||
tableProperties.put(TEXT_SKIP_HEADER_COUNT_KEY, String.valueOf(headerSkipCount)); | ||
checkFormatForProperty(hiveStorageFormat, expectedFormats, SKIP_HEADER_LINE_COUNT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
checkFormatForProperty(hiveStorageFormat, HiveStorageFormat.TEXTFILE, TEXTFILE_SKIP_HEADER_LINE_COUNT);
checkFormatForProperty(hiveStorageFormat, HiveStorageFormat.CSV, TEXTFILE_SKIP_HEADER_LINE_COUNT);
```?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But a csv format would fail in first statement and similarly for text format.
// Textfile specific properties | ||
getTextHeaderSkipCount(tableMetadata.getProperties()).ifPresent(headerSkipCount -> { | ||
// Textfile and CSV specific properties | ||
ImmutableSet<HiveStorageFormat> expectedFormats = ImmutableSet.of(HiveStorageFormat.TEXTFILE, HiveStorageFormat.CSV); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expectedFormats
-> csvAndTextFile
if (tableStorageFormat == HiveStorageFormat.TEXTFILE) { | ||
if (table.getParameters().containsKey(TEXT_SKIP_HEADER_COUNT_KEY)) { | ||
throw new PrestoException(NOT_SUPPORTED, format("Inserting into Hive table with %s property not supported", TEXT_SKIP_HEADER_COUNT_KEY)); | ||
if (tableStorageFormat == HiveStorageFormat.TEXTFILE || tableStorageFormat == HiveStorageFormat.CSV) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this wrapping if
block is needed here.
@@ -0,0 +1,3 @@ | |||
header1,header2,header3 | |||
value1,value2,value3 | |||
footer1,footer2,footer3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no new line at the end
onHive().executeQuery("DROP TABLE test_csvfile_skip_header"); | ||
|
||
onHive().executeQuery("DROP TABLE IF EXISTS test_csvfile_skip_footer"); | ||
onHive().executeQuery("" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you divide this test into two tests
@@ -164,6 +192,87 @@ private void testCreatePartitionedCsvTableAs(String tableName, String additional | |||
query("DROP TABLE " + tableName); | |||
} | |||
|
|||
@Test | |||
public void testCreateTextFileSkipHeaderFooter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please migrate this test to convention tests. See:
cox@Grzegorzs-MacBook-Pro:~/presto/presto-product-tests$ find src -name csv*
src/main/resources/sql-tests/datasets/csv_table.ddl
src/main/resources/sql-tests/datasets/csv_table_with_custom_parameters.ddl
src/main/resources/sql-tests/datasets/csv_table_with_custom_parameters.data
src/main/resources/sql-tests/datasets/csv_table.data
src/main/resources/sql-tests/testcases/csv_table_with_custom_parameters.sql
src/main/resources/sql-tests/testcases/csv_table.sql
" textfile_skip_footer_line_count = 1,\n" + | ||
" textfile_skip_header_line_count = 1\n" + | ||
" skip_footer_line_count = 1,\n" + | ||
" skip_header_line_count = 1\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please add tests for csv with skip_header_line_count
and skip_footer_line_count
here?
@Praveen2112, any updates on this? |
@martint Yes on the final leg of testing... |
ba1c84f
to
c5fc261
Compare
a16fa5c
to
ef52942
Compare
@@ -0,0 +1,3 @@ | |||
header1,header2,header3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very small data test... maybe let's inline this into the java file?
@@ -0,0 +1,3 @@ | |||
header1,header2,header3 | |||
value1,value2,value3 | |||
footer1,footer2,footer3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
footer and header does not need to have same number of columns. It is a bit misleading because it looks like data.
#### HEADER #######
value1,value2,value3
#### FOOTER #######
?
HiveSplitSource hiveSplitSource = hiveSplitSource(backgroundHiveSplitLoader); | ||
backgroundHiveSplitLoader.start(hiveSplitSource); | ||
|
||
assertEquals(drainSplits(hiveSplitSource).size(), 33); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Maybe we can now port tests that were using ./src/main/resources/sql-tests/datasets/table_with_header_and_footer.data-generator?
@@ -129,8 +129,8 @@ public HiveTableProperties(TypeManager typeManager, HiveConfig config) | |||
false), | |||
integerProperty(BUCKET_COUNT_PROPERTY, "Number of buckets", 0, false), | |||
stringProperty(AVRO_SCHEMA_URL, "URI pointing to Avro schema for the table", null, false), | |||
integerProperty(TEXTFILE_SKIP_HEADER_LINE_COUNT, "Number of header lines", null, false), | |||
integerProperty(TEXTFILE_SKIP_FOOTER_LINE_COUNT, "Number of footer lines", null, false), | |||
integerProperty(SKIP_HEADER_LINE_COUNT, "Number of header lines", null, false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ef52942
to
6babaf2
Compare
@Inject | ||
private HdfsClient hdfsClient; | ||
|
||
@BeforeTestWithContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This executes before each test.
The best would be to use test requirements here. So you could define HiveTableDefinition
that would automatically handle data upload. See io.prestosql.tests.hive.TestHivePartitionsTable#getRequirements
and usage of io.prestosql.tempto.fulfillment.table.hive.InlineDataSource#createStringDataSource
That way you don't need to clean this up, because tempto will take care of this.
Also IIRC tables that are defined in resources/sql-tests/datasets/
should be also available here via ImmutableTablesState
so you would need to setup anything here.
@Test | ||
public void testCreateCsvFileSkipHeader() | ||
{ | ||
onHive().executeQuery("DROP TABLE IF EXISTS test_create_csvfile_skip_header"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why on Hive? let's use Presto whenever possible
} | ||
|
||
@Test | ||
public void testInsertCsvSkipHeader() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you merge test inserts with create table tests. Tempto tests are a bit heavy.
"WITH ( " + | ||
" format = 'CSV', " + | ||
" external_location = 'hdfs://hadoop-master:9000/user/hive/warehouse/TestCsvFileHiveTable/csv_table_with_header', " + | ||
" csv_separator= ','," + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you test different separator like %
? So we could see that it works with different separators.
LOCATION '%LOCATION%' | ||
TBLPROPERTIES ( | ||
'skip.header.line.count' = '1', | ||
'skip.footer.line.count' = '1') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see tests that are using these tables. See:
./presto-product-tests/src/main/resources/sql-tests/testcases/csv_table_with_custom_parameters.sql
./presto-product-tests/src/main/resources/sql-tests/testcases/csv_table.sql
If you provide convention tests like the ones mentioned above then you can remove all select tests from TestCsv
and leave there only tests for inserts. Or even you could remove tests for insert from there because that could be easily tested in TestHiveIntegrationSmokeTest
6babaf2
to
d202a9d
Compare
2a9412a
to
336d1cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just few minor comments otherwise look good.
@@ -164,6 +172,24 @@ private void testCreatePartitionedCsvTableAs(String tableName, String additional | |||
query("DROP TABLE " + tableName); | |||
} | |||
|
|||
@Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Test(groups = {STORAGE_FORMATS})
@Test | ||
public void testCreateCsvFileSkipHeader() | ||
{ | ||
assertThat(query("SELECT * FROM csv_with_header")).containsOnly(row("10", "value2")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be easily refactored to tempto convention tests.
Have you seen:./src/main/resources/sql-tests/testcases/tables_with_header_and_footer.sql?
This way it is able to change tests and then execute them without recompiling.
HiveSplitSource hiveSplitSource = hiveSplitSource(backgroundHiveSplitLoader); | ||
backgroundHiveSplitLoader.start(hiveSplitSource); | ||
|
||
assertEquals(drainSplits(hiveSplitSource).size(), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please extract method like:
assertSplitsCount(CSV, ImmutableMap.of("skip.header.line.count", "1", "skip.footer.line.count", "1"), 1 /** splits count */);
I see that all the tests you added looked very similar.
@@ -49,6 +49,7 @@ | |||
import java.util.ArrayList; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would merge this commit with the previous one.
336d1cc
to
91603e2
Compare
@@ -0,0 +1,13 @@ | |||
-- database: presto; tables: csv_with_header, csv_with_footer, csv_with_header_and_footer; groups: hive, hive_file_header; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use groups: storage_formats
172583f
to
ead33aa
Compare
ead33aa
to
16c6784
Compare
Fixes #1085 . Here we replace
text_skip_header_line_count
andtext_skip_footer_line_count
toskip_header_line_count
andskip_footer_line_count
which would be common for both csv and text file.