The instructions here relate to run Presto with data present in GCS. This setup has been adapted from this repository: Hive connector over MinIO file storage. Tables can be created in two flavors:
- Hive
- Iceberg
- Trino (Presto)
- Hive GCS connector
- Iceberg connector
- Hive Metastore with MariaDB persistence
- Download the gcs connector and place the jar file in
lib
directory. - Obtain GCP service-account credentials json with the service-account having admin permissions on GCS. Place this file in
creds
directory - Set the GCP project-id against
fs.gs.project.id
in metastore-site.xml. - Run
docker-compose up -d
. - It should start instances of the database, metastore and Trino
Creating network "trino-hive_trino-network" with driver "bridge"
Creating trino-hive_mariadb_1 ... done
Creating trino-hive_trino-coordinator_1 ... done
Creating trino-hive_hive-metastore_1 ... done
docker container exec -it trino-hive_trino-coordinator_1 trino
# The following prompt must show up
trino>
View Catalogs
trino> show catalogs;
Catalog
---------
hive
iceberg
system
tpcds
tpch
(4 rows)
Create a Schema in the Hive Catalog
trino> CREATE SCHEMA hive.hive_gcs WITH (location = 'gs://bucket-test-tj-1/');
CREATE SCHEMA
Create some data in GCS. Ensure that partitions like partion_name=foo
are present in the structure and each file has 5 columns with headers col_1
, col_2
, col_3
, col_4
, col_5
. Example:
Create a partitioned Hive Table
USE hive.hive_gcs;
CREATE TABLE sample_table1 (
col_1 varchar, col_2 varchar, col_3 varchar, col_4 varchar, col_5 varchar,
xing varchar, fing varchar
)
WITH (
format = 'CSV',
partitioned_by = ARRAY['xing','fing'],
external_location = 'gs://bucket-test-tj-1/ping',
skip_header_line_count = 1
);
Update partitions
CALL system.sync_partition_metadata(schema_name=>'hive_gcs', table_name=>'sample_table1', mode=>'FULL');
Read Data
trino:hive_gcs> select * from sample_table1 where xing = 'bar' and fing = 'io';
col_1 | col_2 | col_3 | col_4 | col_5 | xing | fing
-------+-------+-------+-------+-------+------+------
91 | 92 | 93 | 94 | 95 | bar | io
81 | 82 | 83 | 84 | 85 | bar | io
Create a Schema in the Iceberg Catalog
trino> CREATE SCHEMA iceberg.iceberg_gcs WITH (location = 'gs://bucket-test-tj-1/');
CREATE SCHEMA
Create a partitioned Iceberg Table
USE iceberg.iceberg_gcs;
CREATE TABLE sample_table (
id bigint, name varchar, known varchar, country varchar, fact varchar
)
WITH (
format = 'PARQUET',
partitioning = ARRAY['country', 'fact']
);
Insert some data
INSERT INTO sample_table values
(25, 'James Bond', 'Agent', 'American', 'NYC'),
(40, 'George Bush', 'President', 'American', 'Texas'),
(82, 'A1', 'Football Player', 'Austria', 'Vienna')
;
A sample write mechanism to Iceberg table is present in this repository
Read Data
trino:iceberg_gcs> select * from sample_table where country = 'American';
id | name | known | country | fact
----+-------------+-----------+----------+-------
40 | George Bush | President | American | Texas
25 | James Bond | Agent | American | NYC
Maven dependency
<dependency>
<groupId>io.trino</groupId>
<artifactId>trino-jdbc</artifactId>
<version>382</version>
</dependency>
Java Code
public static void main(String[] args) throws Exception {
// Format: <Trino_Coordinator>/Catalog/Schema
try (Connection conn = DriverManager.getConnection("jdbc:trino://localhost:9080/hive/hive_gcs?user=anything")) {
try(Statement stmt = conn.createStatement()) {
try(ResultSet rs = stmt.executeQuery("SELECT * FROM sample_table1")) {
while (rs.next()) {
String col1 = rs.getString("col_1");
String col2 = rs.getString("col_2");
System.out.println(String.format("col_1=%s and col_2=%s", col1, col2));
}
}
}
}
}
- Hadoop GCS Connector
- Intro to Hive Connector
- Introduction to Apache Iceberg
- Iceberg Connector
- Trino GCS Configuration
- What is Presto connection?
MinIO is a multi-cloud object storage abstraction for any cloud. It has s3 like semantics and is compatible with most clouds. Hence, it can be a good alternate to object storage if rich abstraction is a necessity. A sample experiment.