-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Swift] Spooled client protocol extension #22662
Comments
Really excited about this enhancement. |
Yes. It will be possible to retrieve the same segment couple of times |
Thanks for the update. Few more thoughts.
Below are some ideas that I thought could benefit client/coordinator communication.
Thoughts? |
Would be interested at looking at client implementation that utilise this. Specially for distributed ML frameworks that need to load a lot of data into distributed workers. Of interest would be a spark connector and a client for Ray. |
Other ideas
|
@nickalexander053 this is one of the use-cases we have in mind :) |
Thanks @wendigo for the details and I really appreciate the quick response. This makes sense. For For |
@sajjoseph the spooling manager is a pluggable SPI. One of the implementations is based on native filesystem APIs which allow for spooling to s3/gcs/abfs out of the box. Please file an issue for the second item. We will discuss it with the trino-gateway group. |
@sajjoseph "Encoding and spooling happens on worker nodes" means that worker nodes are spooling to the external storage :) |
Thanks @wendigo for the updates. I am super excited about the possibilities and how this feature can improve customer experience. Thank you! About the cluster identifier feature, I will file an issue later. |
Good stuff @wendigo! A couple of questions:
And a comment:
|
@martint there is an explicit http DELETE that clients can do in order to remove a segment if it's no longer needed. Beside that, we want to implement a configurable TTL for segments that will render them useless when they are expired (client won't be able to retrieve them anymore) and the GC process that will cleanup expired segments from the storage. The data is spooled only if the nextUri is called by the client - the difference is that coordinator is not getting an actual page of data but the uri to the data. If client has abandoned calling nextUri - spooling won't progress. As for the latency part, spooled extension allows both inline and spooled segments so it's up to the engine whether the data should be inlined or spooled. The current spooling operator implementation has a I was thinking that the client implementation could actually hint the engine whether it prefers the throughput of the data transfer over initial latency. |
@martint here's how we plugged it into the execution on the high level: |
In the case when the SpoolingOperator is spooling, how does pushback happen? Today, it's based on task output buffers filling up. |
@martint right now the spooling operator holds only a single spooled segment that wasn't consumed by the coordinator (and thus client). It won't request/accept more pages until client retrieves the segment so it's similar to being blocked while having a buffer full but buffer is represented as the identifier of the spooled segment rather than actual data in memory. |
How does it know it hasn't been consumed by the coordinator? The coordinator interacts with the Task Output Buffer. The Driver will attempt to fetch from the tip of the operator pipeline as long as the buffer is not full. Does this require that the output buffer understand the internals of these special pages produced by the spooling operator? |
@martint I see what you mean now. I think that we would need to improve that area |
There is no pushback for spooling. We assume that storage is effectively infinite, or large enough to store the entire result set. This is true for the expected use case of object storage for spooling. Requiring that clients consume the results in a specific order makes this more difficult to use and can result in deadlocks. This doesn't seem useful given that storage is typically unlimited. Consider using this with a system like MapReduce or Spark where the URIs are scheduled independently as worker tasks. There are no guarantees as to when those tasks will run, and these systems often need all the URIs up front. We could implement a best effort limit on the total size (IIRC we have something similar for scan size) if needed. You can think of this as similar to CTAS, which writes the full result up front and doesn't have any limit on size. |
@wendigo is there any PoC branch with the new approach or it's still in design phase? |
@sopel39 not yet, why are you asking? :) |
This change introduces spooled protocol extension as described in the #22662. Co-authored-by: Jan Waś <[email protected]>
This change introduces spooled protocol extension as described in the #22662. Co-authored-by: Jan Waś <[email protected]>
The inital work is done. We will now follow-up with better spooling and some experiments around formats |
@wendigo - is there ever a case when coordinator needs to do more than just pulling records from workers and returning to the client? For ex - a query plan that requires coordinator to do some kind of post aggregation or filtering? Or is that work already done on the worker nodes as part of the DAG execution. Curious to know whether the spooled client protocol works for all query shapes. |
@samarthjain final aggregation/sorting happens on workers so no which means that spooling also happens on workers - the question is whether this will be a single worker or many |
@nickalexander053 do you have any pointers how to implement support for it in Ray? :) |
Spooled Trino protocol (protocol v1+)
Important
This document is a design proposal and can change in the future. This is part of the ongoing effort to improve performance of the existing client protocol (#22271)
Background
The existing Trino client protocol is over 10 years old and served different purpose upon its inception than the use cases that we want to support today. These new requirements include:
To address aforementioned requirements, we are going to introduce an extension to the existing protocol called spooled protocol, which extend its semantics in a backward-compatible fashion.
Existing client protocol
The current protocol consists of two endpoints that are used to submit queries and retrieve partial results:
Both endpoints share the same QueryResults definition:
Note
Some of the query results fields were omitted for brevity.
The important fields related to result set retrieval are:
data
- contains encoded JSON data as list of row values. Individual values are encoded as a result of the Type.getObjectValue call (important: these are low-level, raw representations that are interpreted on the client side (see: AbstractTrinoResultSet).columns
- contains list of columns and types descriptors. It's worth to note thatdata
field can't have non-null value withoutcolumn
information present,nextUri
- used to retrieve next partial result set.Spooled protocol extension
To express the partial result sets having a different encoding,
spooled protocol extension
is introduced, which contains backward and forward-compatible changes to the existing protocol.These changes are:
Trino-Query-Data-Encoding
The header is used by the client libraries to use the new
spooled protocol extension
and contains the list of encoding in the order of preference (comma separated). If any of the encodings is supported by the server, the client can expect theQueryResults.data
to be returned in a new format. If it's not supported, server fallbacks to the existing behaviour for compatibility.Note
Once the encoding is negotiated between the client and the server, it stays the same for the duration of the query which means that client can expect results in a single format - either existing one or extended.
EncodedQueryData
EncodedQueryData
is the extension of theQueryResults.data
field that has the following on-the-wire representation:Meaning of the fields is following:
encoding
- the string identifier of the encoding format as negotiated between the client and the server, chosen from the client requested encodings sent in theTrino-Query-Data-Encoding
header. These are known and shared by both the client and the server and are part of thespooled protocol extension
definition,metadata
- theMap<String, Object>
that represents metadata of the partial result set, i.e. decryption key used to decrypt encoded data,segments
- list of data segments representing partial result set. SingleQueryResults
object can contain arbitrary number of segments which are always ordered.DataSegment
DataSegment
is a representation of the encoded data and has two distinct types:inline
andspooled
with following semantics:inline
segment holds encoded, partial result set data in the base64-encodeddata
field,spooled
segment type points to the encoded partial result set spooled in the configured storage location. Location designated by theURI uri
field value along with HTTP headers stored inMap<String, List<String>> headers
field are used to retrieve spooledbyte[]
data from the spooling storage.uri
andheaders
are opaque and contain all of the authentication information required which means that client implementation can retrieve spooled segment data by doing an ordinaryGET
HTTP call without any processing of the URI or the headers. It's worth to note that URI can point to arbitrary location including endpoints exposed on the coordinator or storage used for spooling (i.e. presigned URIs on S3). This depends on the actual implementation of the spooling manager and server configuration.ackUri
is used to along with the httpGET
method to explicitly acknowledge that segment was retrieved which allows the coordinator to eagerly remove it from the storage, rather than waiting for the TTL to pass.Note
Data segments are returned in the order specified by the query semantics. Data is spooled when client library requests next batch of result set data by following
nextUri
. It's up to the client library to decide whether all segments will be spooled at once, buffered and then read.Important
In order to support the spooled protocol, client implementations need to support both inline and spooled representations as server can use these interchangeably.
Caution
Client implementation must not send any additional information when retrieving spooled data segment, particularly the authentication headers used to authenticate to a Trino.
DataSegment
contains ametadata
field of typeMap<String, Object>
with attributes describing a data segment.Following metadata attributes are always present:
rowOffset
of the data segment in relation to the whole result set (long
),rowsCount
number of the rows in the data segment (long
),segmentSize
size of the encoded data segment (long
).Optional metadata attributes can be a part of the encoding definition shared between the client and server implementations (to be discussed).
Implementation considerations
Compatibility
The implementation of this design must to take into the consideration backward and forward compatibility which means two things:
As this is an extension of the protocol, it's an opt-in feature that requires a new client and a new server with additional configuration (spooling storage).
Encodings
Encoding describes the serialization format (like JSON) and other information required to both write (encode) and read (decode) result set as data segments. Example of encodings are:
json+zstd
which reads as JSON serialization with ZSTD compression,parquet+snappy
which reads as parquet encoding with Snappy compression.Definition and meaning of the encoding is a contract between client and the server and will be specified separately in the future for each encoding.
Spooling
Spooling by definition means buffering data in a separate location by the server and retrieving it later by the client. In order to support
spooled protocol extension
servers (coordinator and workers) are required to be configured to use the spooling. In the initial implementation we plan to add a plugin SPI for theSpoolingManager
and implementation based on thenative file-system API
.SpoolingManager
is responsible for both storing and retrievingspooled
data segments from the external storage.Segments
It's up to the server implementation whether the partial result set will be
inlined
,spooled
, either or both. It's required that client requesting aspooled protocol extensions
supports both types of the data segments.Plugabillity
An encoding is a shared definition between the client and the server, therefore it can't be a Plugin SPI interface. In the initial implementation, we do not plan to provide an external SPI for adding new encodings. These will be shipped as part of the implementation of the client and server libraries. Spooling process on the server-side is pluggable with the new
SpoolingManager
SPI.Backward compatibility
Above and for all, the new spooled protocol can't break existing one as it's an opt-in client/server feature. At any given moment, the client or the server can fall back to the original, unmodified v1 protocol to maintain backward compatibility.
Performance
For small result sets, spooling on the storage adds an additional overhead which can be avoided by inlining the data in the response. For larger ones, spooling allows faster, parallel data retrieval as the result set can be fully spooled and query finished as fast as possible and then retrieved, decoded and processed by the client after the query has already completed.
We plan to implement spooling on the worker side which means that when the result set is partially or fully spooled, coordinator node is not involved in encoding of the data to a requested format and spooling it in the storage. According to our benchmarks, JSON encoding in the existing format accounts for significant amount of the CPU time during output processing. Initially we plan to support existing JSON format but the encoding will be moved to worker nodes which will reduce the load on the coordinator.
Security
Protocol supports server-side encryption of data. We've settled on server-side encryption, rather than client-side due to the following reasons:
Proposed API/SPI
Server-side
Used to encode the data in the specified output format.
Used to manage, spool and unspool data segments.
Client-side
Used to decode the data segment.
The text was updated successfully, but these errors were encountered: