From ab71673ce0955798645ae9178018f562a82ed7f2 Mon Sep 17 00:00:00 2001 From: Will Jones Date: Tue, 20 Sep 2022 01:48:40 -0700 Subject: [PATCH] ARROW-13454: [C++][Docs] Tables vs Record Batches (#14008) Adds a little more explanation of the difference between tables and record batches, as well as a diagram representation. Authored-by: Will Jones Signed-off-by: Antoine Pitrou --- .../cpp/tables-versus-record-batches.svg | 102 ++++++++++++++++++ docs/source/cpp/tables.rst | 12 +++ docs/source/format/Glossary.rst | 8 +- 3 files changed, 120 insertions(+), 2 deletions(-) create mode 100644 docs/source/cpp/tables-versus-record-batches.svg diff --git a/docs/source/cpp/tables-versus-record-batches.svg b/docs/source/cpp/tables-versus-record-batches.svg new file mode 100644 index 0000000000000..d793b1de2bf7e --- /dev/null +++ b/docs/source/cpp/tables-versus-record-batches.svg @@ -0,0 +1,102 @@ + + + + + + Arrow Table versus Record Batch + + + + Arrow Table + + Schema + + + + + Field + + + + + + Chunked + Array + + + + + + + + Array + + + + + A Table is a C++ data structure, + allowing for a mixed chunking structure and very large arrays. + + + + Arrow Record Batch + + Schema + + + + + Field + + + + + + Array + + + + + A Record Batch is a common Arrow data structure which is recognized by all implementations. + + + \ No newline at end of file diff --git a/docs/source/cpp/tables.rst b/docs/source/cpp/tables.rst index ea9198771cfac..b28a9fc1e13a5 100644 --- a/docs/source/cpp/tables.rst +++ b/docs/source/cpp/tables.rst @@ -77,6 +77,18 @@ has a schema which must match its arrays' datatypes. Record batches are a convenient unit of work for various serialization and computation functions, possibly incremental. +.. image:: tables-versus-record-batches.svg + :alt: A graphical representation of an Arrow Table and a Record Batch, with + structure as described in text above. + +Record batches can be sent between implementations, such as via +:ref:`IPC ` or +via the :doc:`C Data Interface <../format/CDataInterface>`. Tables and +chunked arrays, on the other hand, are concepts in the C++ implementation, +not in the Arrow format itself, so they aren't directly portable. + +However, a table can be converted to and built from a sequence of record +batches easily without needing to copy the underlying array buffers. A table can be streamed as an arbitrary number of record batches using a :class:`arrow::TableBatchReader`. Conversely, a logical sequence of record batches can be assembled to form a table using one of the diff --git a/docs/source/format/Glossary.rst b/docs/source/format/Glossary.rst index 423ebf85783f6..5944d7c18cffe 100644 --- a/docs/source/format/Glossary.rst +++ b/docs/source/format/Glossary.rst @@ -196,7 +196,11 @@ Glossary different buffers for different indices. Not part of the columnar format; this term is specific to - certain language implementations of Arrow (primarily C++ and - its bindings). + certain language implementations of Arrow (for example C++ and + its bindings, and Go). + + .. image:: ../cpp/tables-versus-record-batches.svg + :alt: A graphical representation of an Arrow Table and a + Record Batch, with structure as described in text above. .. seealso:: :term:`chunked array`, :term:`record batch`