Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META] Search streams using Apache Arrow and Flight #16679

Closed
14 tasks
rishabhmaurya opened this issue Nov 19, 2024 · 0 comments · Fixed by #16691
Closed
14 tasks

[META] Search streams using Apache Arrow and Flight #16679

rishabhmaurya opened this issue Nov 19, 2024 · 0 comments · Fixed by #16691
Assignees
Labels
Meta Meta issue, not directly linked to a PR untriaged

Comments

@rishabhmaurya
Copy link
Contributor

rishabhmaurya commented Nov 19, 2024

Please describe the end goal of this project

  • In-memory columnar representation of any intermediate results from search
    • Data adjacency for sequential access (scans)
    • O(1) (constant-time) random access
    • SIMD and vectorization-friendly
    • Relocatable without “pointer swizzling”, allowing for true zero-copy access in shared memory.
  • Interoperable representation of columnar data to be used across different engines like sharing between opensearch and datafusion, which is a rust based engine.
  • RPC using bidirectional streams: making use of GRPC bidirectional streams handling backpressure from the client in realtime and producing batches of records on demand. Used both for internode communication (between data nodes and cordinator) as well as communication with end client.

Use cases

  • Optimize memory overhead, cpu utilization and performance for -
    • Search pagination API
    • Aggregation (more details to follow) .

Apache Arrow will serve as a library for in-memory columnar representation on any transient results used for retrieval in these use cases. Arrow Flight to be used for stream RPC.

Supporting References

JOINs RFC making use of this integration - #15185

Issues

  • Library changes containing POJOs and Arrow vector APIs - Library changes for Apache Arrow integration #16691
  • OpenSearch basic Arrow Flight server and client implementation as a separate module.
  • General purpose FlightProducer supporting basic getStream() and getFlightInfo() APIs.
  • ProxyStreamProducer acting as a proxy stream connecting the right data node holding the stream for a given ticket to the client.
  • Integration with Tasks API for all stream producers.
  • Integration with server module via plugin interface
  • TLS integration for Flight server and client
  • Stream cancellation, error handling and renewal.
  • Metrics & Troubleshooting
  • Admission control
  • Integration tests
  • Benchmark - compare against performance of scroll API.
  • Revisit defaults for server, client and allocator configs after benchmarks.
  • Documentation

Related component

Search

@rishabhmaurya rishabhmaurya added Meta Meta issue, not directly linked to a PR untriaged labels Nov 19, 2024
@rishabhmaurya rishabhmaurya self-assigned this Nov 19, 2024
@rishabhmaurya rishabhmaurya moved this from New to In Progress in OpenSearch Roadmap Nov 19, 2024
@rishabhmaurya rishabhmaurya changed the title [META] Apache Arrow and Flight integration [META] Search streams using Apache Arrow and Flight Nov 25, 2024
@rishabhmaurya rishabhmaurya moved this from Todo to In Progress in Performance Roadmap Nov 25, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Performance Roadmap Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Meta Meta issue, not directly linked to a PR untriaged
Projects
Status: In Progress
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant