More https://trino.io/resources.html #3472

tooptoop4 · 2020-04-18T13:48:27Z

https://prestosql.io/resources.html is very thin right now.
Proposing an 'unverified' section:

Unmerged Connectors (data sources):

Adding below for SEO purpose...

Merged connectors (data sources):

Microsoft SQL Server
Oracle
Hive
PostgreSQL
MySQL
Amazon Redshift
MongoDB
Cassandra
BigQuery
Elasticsearch
Apache Kafka
Clickhouse
Druid
JMX
Kinesis
Phoenix
Google Sheets
Kudu
Redis
Thrift
Prometheus
Linkedin Pinot
Black Hole
Accumulo
Local File
Memory
MemSQL
System
TPCDS
TPCH
see https://prestosql.io/docs/current/connector.html for up to date list

Supported file types in Hive Connector:
ORC
Parquet
Uber Hudi/hoodie (already in Hive connector)
Netflix Iceberg (already in Hive connector)
Avro
JSON (using org.apache.hive.hcatalog.data.JsonSerDe)
CSV (using org.apache.hadoop.hive.serde2.OpenCSVSerde)
TextFile
RCText (RCFile using ColumnarSerDe)
RCBinary (RCFile using LazyBinaryColumnarSerDe)
SequenceFile

Supported filesystem storage for Hive Connector:
AWS S3 ("s3", "s3a")
Google Cloud Storage aka GCS ("gs")
Hadoop File System ("hdfs")
Windows Azure Storage Blob (WASB) ("wasb", "wasbs")
Azure Data Lake Storage (ADLS) ("adl")
Azure ADLS Gen2 - Azure Blob File System ("abfs", "abfss")
Aliyun Object Storage Service (OSS) - https://hadoop.apache.org/docs/current/hadoop-aliyun/tools/hadoop-aliyun/index.html
Tencent Cloud Object Storage (COSN) - #4978
IBM Cloud Object Storage (COS) - CODAIT/stocator#218 (comment) / https://docs.starburstdata.com/latest/connector/starburst-hive-ibm-cos.html
??? Oracle Cloud Infrastructure Object Storage (OCI) - oracle/oci-hdfs-connector#51
MinIO
Ceph
Dell EMC
Cloudian?
OpenIO?
SwiftStack?

Caching:
Data: Alluxio or Qubole Rubix
Directory/file listing
Metastore partitions

Performance tips:
Partitioning on VARCHAR columns (not high cardinality) that are included in WHERE of SELECTs
ANALYZE to gather statistics: https://prestosql.io/docs/current/optimizer/statistics.html
Compressed ORC/Parquet files with size between 32MB-1GB and the columns inside the files sorted so min/max ranges don't overlap for reduced network IO and better predicate pushdown (https://www.slideshare.net/databricks/the-parquet-format-and-performance-optimization-opportunities)
There are no Primary Keys/Foreign Keys/Indexes
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
High CPU important for parallelism/concurrency
High Memory important for joins/group by/distinct
task.max-worker-threads
task.concurrency

Security:
LDAP/Kerberos/JWT/Cert
TLS
Apache Ranger for column/table/schema/catalog level authorisation, column masking and row level filtering fine grained access

All view sql:
from_utf8(from_base64(substr(view_original_text,17,length(view_original_text) - 19))) view_sql

Very useful feature is joining between data from any of the connectors.
Disclaimer: while Hive sources are suitable for data of large size (ie Petabytes), JDBC sources are only performant on small tables. For example:
you have a 8 billion row txns table that is indexed/partitioned in Oracle.
Below query takes 38 seconds run in Oracle directly but 70 minutes when run in Presto -->

select custid, count(1) numtxns from oracle.txns
where txnmonth in (to_date('202002','YYYYMM'),to_date('201902','YYYYMM'))
group by custid
having count(1) > 4

Missing in all JDBC connectors:
Aggregate pushdown #6613
Join pushdown #6620
Complex filter pushdown #7994 / #402
ORDER BY pushdown #8093
DISTINCT pushdown #4324

The text was updated successfully, but these errors were encountered:

tooptoop4 closed this as completed Apr 18, 2020

tooptoop4 mentioned this issue May 28, 2021

Add Apache Ignite connector #8098

Closed

ebyhr changed the title ~~More https://prestosql.io/resources.html~~ More https://trino.io/resources.html Aug 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More https://trino.io/resources.html #3472

More https://trino.io/resources.html #3472

tooptoop4 commented Apr 18, 2020 •

edited

Loading

More https://trino.io/resources.html #3472

More https://trino.io/resources.html #3472

Comments

tooptoop4 commented Apr 18, 2020 • edited Loading

tooptoop4 commented Apr 18, 2020 •

edited

Loading