Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set header and footer skip line count for CSV tables #1090

Merged
merged 2 commits into from
Oct 9, 2019

Conversation

Praveen2112
Copy link
Member

Fixes #1085 . Here we replace text_skip_header_line_count and text_skip_footer_line_count to skip_header_line_count and skip_footer_line_count which would be common for both csv and text file.

@cla-bot cla-bot bot added the cla-signed label Jul 6, 2019
@Praveen2112 Praveen2112 force-pushed the header_footer_csv branch 5 times, most recently from c18e425 to ba1c84f Compare July 7, 2019 14:22
Copy link
Member

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add a test where data for csv file spans across two splits to see if data read by Presto is correct.

// Textfile specific property
String textSkipHeaderCount = table.get().getParameters().get(TEXT_SKIP_HEADER_COUNT_KEY);
// Textfile and CSV specific property
String textSkipHeaderCount = table.get().getParameters().get(SKIP_HEADER_COUNT_KEY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename variable to skipHeaderCount

}
String textSkipFooterCount = table.get().getParameters().get(TEXT_SKIP_FOOTER_COUNT_KEY);
String textSkipFooterCount = table.get().getParameters().get(SKIP_FOOTER_COUNT_KEY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

if (headerSkipCount > 0) {
checkFormatForProperty(hiveStorageFormat, HiveStorageFormat.TEXTFILE, TEXTFILE_SKIP_HEADER_LINE_COUNT);
tableProperties.put(TEXT_SKIP_HEADER_COUNT_KEY, String.valueOf(headerSkipCount));
checkFormatForProperty(hiveStorageFormat, expectedFormats, SKIP_HEADER_LINE_COUNT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

checkFormatForProperty(hiveStorageFormat, HiveStorageFormat.TEXTFILE, TEXTFILE_SKIP_HEADER_LINE_COUNT);
checkFormatForProperty(hiveStorageFormat, HiveStorageFormat.CSV, TEXTFILE_SKIP_HEADER_LINE_COUNT);
```?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But a csv format would fail in first statement and similarly for text format.

// Textfile specific properties
getTextHeaderSkipCount(tableMetadata.getProperties()).ifPresent(headerSkipCount -> {
// Textfile and CSV specific properties
ImmutableSet<HiveStorageFormat> expectedFormats = ImmutableSet.of(HiveStorageFormat.TEXTFILE, HiveStorageFormat.CSV);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expectedFormats -> csvAndTextFile

if (tableStorageFormat == HiveStorageFormat.TEXTFILE) {
if (table.getParameters().containsKey(TEXT_SKIP_HEADER_COUNT_KEY)) {
throw new PrestoException(NOT_SUPPORTED, format("Inserting into Hive table with %s property not supported", TEXT_SKIP_HEADER_COUNT_KEY));
if (tableStorageFormat == HiveStorageFormat.TEXTFILE || tableStorageFormat == HiveStorageFormat.CSV) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this wrapping if block is needed here.

@@ -0,0 +1,3 @@
header1,header2,header3
value1,value2,value3
footer1,footer2,footer3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no new line at the end

onHive().executeQuery("DROP TABLE test_csvfile_skip_header");

onHive().executeQuery("DROP TABLE IF EXISTS test_csvfile_skip_footer");
onHive().executeQuery("" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you divide this test into two tests

@@ -164,6 +192,87 @@ private void testCreatePartitionedCsvTableAs(String tableName, String additional
query("DROP TABLE " + tableName);
}

@Test
public void testCreateTextFileSkipHeaderFooter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please migrate this test to convention tests. See:

cox@Grzegorzs-MacBook-Pro:~/presto/presto-product-tests$ find src -name csv*
src/main/resources/sql-tests/datasets/csv_table.ddl
src/main/resources/sql-tests/datasets/csv_table_with_custom_parameters.ddl
src/main/resources/sql-tests/datasets/csv_table_with_custom_parameters.data
src/main/resources/sql-tests/datasets/csv_table.data
src/main/resources/sql-tests/testcases/csv_table_with_custom_parameters.sql
src/main/resources/sql-tests/testcases/csv_table.sql

" textfile_skip_footer_line_count = 1,\n" +
" textfile_skip_header_line_count = 1\n" +
" skip_footer_line_count = 1,\n" +
" skip_header_line_count = 1\n" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add tests for csv with skip_header_line_count and skip_footer_line_count here?

@martint
Copy link
Member

martint commented Aug 16, 2019

@Praveen2112, any updates on this?

@Praveen2112
Copy link
Member Author

@martint Yes on the final leg of testing...

@Praveen2112
Copy link
Member Author

@kokosing , @martint Have applied the comments

@Praveen2112 Praveen2112 force-pushed the header_footer_csv branch 3 times, most recently from a16fa5c to ef52942 Compare August 19, 2019 04:47
@@ -0,0 +1,3 @@
header1,header2,header3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very small data test... maybe let's inline this into the java file?

@@ -0,0 +1,3 @@
header1,header2,header3
value1,value2,value3
footer1,footer2,footer3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

footer and header does not need to have same number of columns. It is a bit misleading because it looks like data.

#### HEADER #######
value1,value2,value3
#### FOOTER #######

?

HiveSplitSource hiveSplitSource = hiveSplitSource(backgroundHiveSplitLoader);
backgroundHiveSplitLoader.start(hiveSplitSource);

assertEquals(drainSplits(hiveSplitSource).size(), 33);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Maybe we can now port tests that were using ./src/main/resources/sql-tests/datasets/table_with_header_and_footer.data-generator?

@@ -129,8 +129,8 @@ public HiveTableProperties(TypeManager typeManager, HiveConfig config)
false),
integerProperty(BUCKET_COUNT_PROPERTY, "Number of buckets", 0, false),
stringProperty(AVRO_SCHEMA_URL, "URI pointing to Avro schema for the table", null, false),
integerProperty(TEXTFILE_SKIP_HEADER_LINE_COUNT, "Number of header lines", null, false),
integerProperty(TEXTFILE_SKIP_FOOTER_LINE_COUNT, "Number of footer lines", null, false),
integerProperty(SKIP_HEADER_LINE_COUNT, "Number of header lines", null, false),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@electrum @findepi do you guys like his rename of textfile_skip_header_line_count to skip_footer_line_count hive table property?

@Inject
private HdfsClient hdfsClient;

@BeforeTestWithContext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This executes before each test.

The best would be to use test requirements here. So you could define HiveTableDefinition that would automatically handle data upload. See io.prestosql.tests.hive.TestHivePartitionsTable#getRequirements and usage of io.prestosql.tempto.fulfillment.table.hive.InlineDataSource#createStringDataSource

That way you don't need to clean this up, because tempto will take care of this.

Also IIRC tables that are defined in resources/sql-tests/datasets/ should be also available here via ImmutableTablesState so you would need to setup anything here.

@Test
public void testCreateCsvFileSkipHeader()
{
onHive().executeQuery("DROP TABLE IF EXISTS test_create_csvfile_skip_header");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why on Hive? let's use Presto whenever possible

}

@Test
public void testInsertCsvSkipHeader()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you merge test inserts with create table tests. Tempto tests are a bit heavy.

"WITH ( " +
" format = 'CSV', " +
" external_location = 'hdfs://hadoop-master:9000/user/hive/warehouse/TestCsvFileHiveTable/csv_table_with_header', " +
" csv_separator= ','," +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test different separator like %? So we could see that it works with different separators.

LOCATION '%LOCATION%'
TBLPROPERTIES (
'skip.header.line.count' = '1',
'skip.footer.line.count' = '1')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see tests that are using these tables. See:

./presto-product-tests/src/main/resources/sql-tests/testcases/csv_table_with_custom_parameters.sql
./presto-product-tests/src/main/resources/sql-tests/testcases/csv_table.sql

If you provide convention tests like the ones mentioned above then you can remove all select tests from TestCsv and leave there only tests for inserts. Or even you could remove tests for insert from there because that could be easily tested in TestHiveIntegrationSmokeTest

@Praveen2112 Praveen2112 force-pushed the header_footer_csv branch 3 times, most recently from 2a9412a to 336d1cc Compare October 8, 2019 17:03
Copy link
Member

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just few minor comments otherwise look good.

@@ -164,6 +172,24 @@ private void testCreatePartitionedCsvTableAs(String tableName, String additional
query("DROP TABLE " + tableName);
}

@Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Test(groups = {STORAGE_FORMATS})

@Test
public void testCreateCsvFileSkipHeader()
{
assertThat(query("SELECT * FROM csv_with_header")).containsOnly(row("10", "value2"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be easily refactored to tempto convention tests.

Have you seen:./src/main/resources/sql-tests/testcases/tables_with_header_and_footer.sql?

This way it is able to change tests and then execute them without recompiling.

HiveSplitSource hiveSplitSource = hiveSplitSource(backgroundHiveSplitLoader);
backgroundHiveSplitLoader.start(hiveSplitSource);

assertEquals(drainSplits(hiveSplitSource).size(), 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please extract method like:

assertSplitsCount(CSV, ImmutableMap.of("skip.header.line.count", "1", "skip.footer.line.count", "1"), 1 /** splits count */);

I see that all the tests you added looked very similar.

@@ -49,6 +49,7 @@
import java.util.ArrayList;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would merge this commit with the previous one.

@@ -0,0 +1,13 @@
-- database: presto; tables: csv_with_header, csv_with_footer, csv_with_header_and_footer; groups: hive, hive_file_header;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use groups: storage_formats

@Praveen2112 Praveen2112 force-pushed the header_footer_csv branch 2 times, most recently from 172583f to ead33aa Compare October 9, 2019 08:10
@kokosing kokosing merged commit 4514c6f into trinodb:master Oct 9, 2019
@kokosing kokosing mentioned this pull request Oct 9, 2019
6 tasks
@martint martint added this to the 321 milestone Oct 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Allow to set header and footer skip line count for CSV tables
3 participants