Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC-REFACTOR: 1/n Refactoring the documentation #2094

Closed
wants to merge 122 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
122 commits
Select commit Hold shift + click to select a range
e647d26
Move logic of `sort_values` into the query compiler (#1754)
devin-petersohn Jul 21, 2020
b78a628
Improve performance of slice indexing (#1753)
devin-petersohn Jul 21, 2020
55d05c8
[FIX] Fix #1683 - losing index names in pd.concat (#1684)
dchigarev Jul 22, 2020
00a4e51
TEST-#1759: Add commitlint check on pull requests (#1760)
devin-petersohn Jul 22, 2020
371e533
REFACTOR-#1741: Use low-level api for kurt function implementation wi…
amyskov Jul 22, 2020
3a1d929
[FIX] Fix of inconsistent indices #1726 #1731 #1732 #1734 (#1727)
dchigarev Jul 23, 2020
2ef9c43
FEAT-#1781: implement ClusterError.__str__ (#1782)
anmyachev Jul 23, 2020
ee223d6
FEAT-#1194 #1283 #1138: Add `Series.rolling`, `DataFrame.rolling`
prutskov Jul 23, 2020
b95ab6c
REFACTOR-#1763: Move logic of `merge` (#1764)
YarShev Jul 23, 2020
5936db1
TEST-#1785: Change size of commit message header (#1786)
YarShev Jul 23, 2020
72b6ba1
FIX-#1775: Fix support for callable in loc/iloc (#1776)
devin-petersohn Jul 24, 2020
222b4a9
FEAT-#1598: Update iterator implemetion to `iloc` (#1599)
devin-petersohn Jul 24, 2020
aad9ba2
FIX-#1556: Fix support for nested assignment with `loc`/`iloc` (#1788)
devin-petersohn Jul 24, 2020
4696eeb
FEAT-#1205: melt implementation (#1689)
dchigarev Jul 24, 2020
c33d420
FIX-#1610: Fix support for `loc` with MultiIndex parameter (#1789)
devin-petersohn Jul 24, 2020
b19265f
FIX-#1700: Fix metadata for concat and mask when `axis=1` (#1797)
devin-petersohn Jul 24, 2020
789ece3
FIX-#1774: Fix unlimited column printing for smaller dataframes (#1799)
devin-petersohn Jul 24, 2020
dd81084
FIX-#1705: Fix visual bug with repr on smaller dataframes (#1798)
devin-petersohn Jul 24, 2020
d793d12
FIX-#1467: Fix support for cummax and cummin across int and float (#1…
devin-petersohn Jul 24, 2020
20e3136
FIX-#1631: Fix support for dictionary in `pd.concat` (#1795)
devin-petersohn Jul 24, 2020
c3f27d5
TEST-#1119: Re-enable skipped Windows tests for `DataFrame.prod` (#1801)
devin-petersohn Jul 24, 2020
66821e0
Revert "TEST-#1119: Re-enable skipped Windows tests for `DataFrame.pr…
devin-petersohn Jul 26, 2020
afb3e0e
DOCS-#1809 missing links in the architecture page (#1810)
ikedaosushi Jul 26, 2020
1a8d30b
TEST-#1779: Limit object store to 1GB during CI tests (#1744)
vnlitvinov Jul 26, 2020
853b55a
FIX-#1803: Change the order of execution engine change callbacks (#1805)
vnlitvinov Jul 27, 2020
7d8531e
FIX-#1807: Fix passing zone in rayscale, add ability to override imag…
vnlitvinov Jul 27, 2020
735edbe
FIX-#1457: Series.reset_index considering 'name' fix (#1820)
dchigarev Jul 27, 2020
4418f0f
FIX-#1459: 'to_pandas' of nested objects added (#1828)
dchigarev Jul 28, 2020
d62042b
FIX-1708: Don't sort indexes in Series functions with level parameter…
anmyachev Jul 28, 2020
12bf4c1
FEAT-#1701: Enable running Modin via remote Ray on spawned cluster (#…
vnlitvinov Jul 28, 2020
b89205b
FIX-#1729: Fix result of `Series.dt.components/freq/tz` (#1730)
prutskov Jul 28, 2020
bebb062
DOCS-#1835: add runner of taxi benchmark as example (#1836)
anmyachev Jul 28, 2020
c494f04
DOCS-#1816: Add notes about using MODIN_SOCKS_PROXY variable (#1817)
anmyachev Jul 28, 2020
276daa0
FEAT-#1821: Speed up RPyC connection (#1833)
vnlitvinov Jul 28, 2020
d4f85e4
Implement numpy autowrapping in remote context (#1834)
vnlitvinov Jul 28, 2020
f5ecc5d
FIX-#1426: Groupby on categories fixed (#1802)
dchigarev Jul 29, 2020
9e8fd4b
FIX-#1464: product/sum incorrect behavior of 'min_count' fixed (#1827)
dchigarev Jul 29, 2020
cb04f3c
FIX-#1849: Fix AttributeError in ClusterError class (#1850)
anmyachev Jul 29, 2020
7966a86
FEAT-#1831: Add optional RPyC tracing and a simple trace parser (#1843)
vnlitvinov Jul 29, 2020
a777ecf
FIX-#1770: Support for groupby() with original Series in by list.
itamarst Jul 29, 2020
046f20f
FIX-#1852: Add stub engines to known ones during test-internals (#1853)
vnlitvinov Jul 29, 2020
45b1bb8
DOCS-#1855: add runner of h2o benchmark as example (#1856)
anmyachev Jul 29, 2020
df3f1d4
FIX-#1857: make 'sort_index' consider axis parameter (#1858)
dchigarev Jul 29, 2020
0e826ac
FIX-#1154: properly process UDFs (#1845)
dchigarev Jul 29, 2020
e9a0636
Bump version to 0.8.0 (#1864)
devin-petersohn Jul 29, 2020
f51d3fa
FEAT-#1861: Use cloudpickle library for experimental.cloud features (…
anmyachev Jul 30, 2020
65d9217
FIX-#911: Pin Dask Dependency for Python 3.8 compatiblity (#1846)
devin-petersohn Jul 30, 2020
7225f2c
Fix access to special attributes in experimental mode (#1875)
vnlitvinov Jul 31, 2020
aa0360f
FIX-#1647: Support repr() on empty Series. (#1859)
itamarst Jul 31, 2020
77fde17
Fix recursion in experimental mode in some cases (#1874)
vnlitvinov Jul 31, 2020
e9964e4
TEST-#1876: Add tests running under experimental (#1877)
vnlitvinov Aug 3, 2020
2fc06d3
FIX-#1867: establish CI (#1868)
anmyachev Aug 3, 2020
ea12d95
FIX-#1887: fix versions (#1888)
anmyachev Aug 4, 2020
8739fd9
FIX-#1674: Series.apply and DataFrame.apply (#1718)
amyskov Aug 4, 2020
7fd2476
FIX-#1869: index sort for count(level=...) (#1870)
amyskov Aug 5, 2020
5dd4047
TEST-#1865: Add RPyC library in requirements (#1866)
anmyachev Aug 5, 2020
d320341
FEAT-#1881: add scale-out feature dependencies (#1892)
anmyachev Aug 6, 2020
b04916d
FIX-#1497: Don't sort in concat() when sort=False (#1889)
itamarst Aug 6, 2020
956cb4e
FIX-#1904: CI fix (#1905)
amyskov Aug 10, 2020
0c52efa
Revert "FIX-#1904: CI fix" (#1907)
devin-petersohn Aug 10, 2020
82b6793
FIX-#1854: groupby() with arbitrary series (#1886)
itamarst Aug 11, 2020
61247eb
REFACTOR-#1917: move part of reset_index code to backend (#1920)
ienkovich Aug 18, 2020
3f9f6f8
FEAT-#1925: add import from arrow table (#1913)
ienkovich Aug 18, 2020
bed0bcc
REFACTOR-#1928: move columnarization to backend (#1914)
ienkovich Aug 18, 2020
6f0da28
REFACTOR-#1929: avoid unnecessary index access in groupby (#1910)
ienkovich Aug 18, 2020
f402caa
FEAT-#1911: support cat methods (#1912)
ienkovich Aug 19, 2020
1745349
REFACTOR-#1938: avoid index access in a simple column reference (#1939)
ienkovich Aug 21, 2020
6fc51a8
FEAT-#1936: avoid empty frame checks for lazy backend (#1937)
ienkovich Aug 22, 2020
a293dae
FIX-#1898: Support DataFrame.__setitem__ with boolean mask. (#1899)
itamarst Aug 24, 2020
1e1a773
FIX-#1784: removed columns sort from 'df_equals' (#1896)
dchigarev Aug 24, 2020
49ed9cc
REFACTOR-#1941: move index access from DataFrame.insert to backend (#…
ienkovich Aug 25, 2020
2c29f85
TEST-#1955: speed up TestDataFrameDefault test (#1956)
anmyachev Aug 26, 2020
818644f
TEST-#1950: improve TestDataFrameIter test time (#1947)
anmyachev Aug 26, 2020
963e756
FEAT-#1196: Support `replace` method for `DataFrame` and `Series` (#1…
YarShev Aug 26, 2020
12ed9d9
FEAT-#1922: Support cases for `DataFrame.join` (#1923)
YarShev Aug 26, 2020
6373569
REFACTOR-#1973: Refactor code in accordance with (#1974)
YarShev Aug 27, 2020
ad7a7f2
FEAT-#1838: Lazy map evaluation at Pandas backend (#1940)
dchigarev Aug 28, 2020
035a29e
REFACTOR-#1934: move MultiIndex checks to backend (#1935)
ienkovich Aug 28, 2020
70c1278
FIX-#1679: Assignment df.loc["row"]["col"] fixed in case of one row (…
dchigarev Aug 28, 2020
605deb4
FEAT-#1291 #1187: Add `DataFrame.unstack`, `Series.unstack` (#1649)
prutskov Aug 28, 2020
8133153
FIX-#1646: Fix representation of `Series` with datetimelike index (#1…
prutskov Aug 28, 2020
0f007c5
TEST-#1961: speed up TestDataFrameReduction_* test (#1962)
anmyachev Aug 31, 2020
b868c2d
FEAT-#1189: Add `DataFrame.stack` (#1673)
prutskov Aug 31, 2020
00bf1e1
FEAT-#1201: pivot implementation via unstack (#1645)
dchigarev Aug 31, 2020
4ef1634
TEST-#1985: optimize common test dataset (#1984)
anmyachev Sep 1, 2020
9b782ae
FEAT-#1958: pin ray to 0.8.7 (#1966)
binarycrayon Sep 1, 2020
51f1577
REFACTOR-#1527: split test_dataframe.py into several others (#1995)
abykovsk Sep 1, 2020
45922ae
REFACTOR-#1879: Move logic for `groupby.agg` into query compiler (#1885)
devin-petersohn Sep 3, 2020
e8a5398
FIX-#1998: Unpin msgpack to satisfy Ray requirements (#1999)
gshimansky Sep 3, 2020
88af24f
FIX-#1953: Fix computing of reduced indices (#1960)
YarShev Sep 3, 2020
4c5d951
FEAT-#1944: Added ability to broadcast multi-partitioned frames (#1945)
dchigarev Sep 4, 2020
494ae5d
REFACTOR-#2009: avoid index access in is_scalar calls (#2010)
ienkovich Sep 4, 2020
81a299f
FIX-#1284: Series.searchsorted (#1668)
amyskov Sep 4, 2020
07fbd71
FEAT-#1222: Implement DataFrame.asof() without Pandas fallback (#1989)
itamarst Sep 4, 2020
05b014f
FEAT-#1523: 'nrows' support to 'read_csv' added (#1894)
dchigarev Sep 4, 2020
088431c
DOCS-#2015: update supported docs (#2016)
anmyachev Sep 7, 2020
ce9ba4e
TEST-2022: speed up prepare-cache job (#2023)
anmyachev Sep 7, 2020
29829ab
TEST-#2024: remove test_dataframe.py (#2025)
anmyachev Sep 7, 2020
61601e0
TEST-#2020: decrease parallel tests on Ubuntu (#2021)
anmyachev Sep 7, 2020
fcc86bf
TEST-#2030: speed up cache; decrease parallel jobs in push.yml (#2031)
anmyachev Sep 7, 2020
90897d0
TEST-#2028: speed up window tests (#2029)
anmyachev Sep 7, 2020
24a9b47
TEST-#2026: speed up test_join_sort.py (#2027)
anmyachev Sep 7, 2020
c49b9bc
FIX-#1959 #1987: Fix `duplicated` and `drop_duplicates` functions (#…
prutskov Sep 7, 2020
cfac9dd
TEST-#2037: speed up test_binary with refactor dataframe.py (#2038)
anmyachev Sep 7, 2020
f4b68ea
TEST-#2044: speed up iter tests (#2045)
anmyachev Sep 7, 2020
4f88ac5
TEST-#2042: speed up udf tests (#2043)
anmyachev Sep 7, 2020
d4d42b4
TEST-#2033: speed up test_series.py (#2034)
anmyachev Sep 8, 2020
86464d4
TEST-#2039: speed up default tests (#2040)
anmyachev Sep 8, 2020
df9aaa5
TEST-#2050: decrease number of parallel jobs on windows Ci (#2051)
anmyachev Sep 8, 2020
6c8596d
Conda recipe for Modin (#1986)
anton-malakhov Sep 8, 2020
b8049eb
REFACTOR-#2011: move default_to_pandas in groupby to backend (#2041)
ienkovich Sep 8, 2020
1e6bbde
FEAT-#1285: Add `sem` implementation for `Series` and `DataFrame` (#2…
YarShev Sep 9, 2020
90e403e
FIX-#2054: Moved non-dependent on modin.DataFrame utils to modin/util…
dchigarev Sep 9, 2020
a9a58b0
TEST-#1891: use conda instead of pip (#2056)
anmyachev Sep 9, 2020
d694aa7
FIX-#2052: fix spawning of remote cluster (#2053)
anmyachev Sep 10, 2020
0763d53
FEAT-#2058: Improve how remote factories are defined (#2060)
vnlitvinov Sep 10, 2020
98292b5
FIX-#1918: fix core dumped issue (#2000)
amyskov Sep 14, 2020
872da76
FIX-#1386: Fix `read_csv` for incorrect csv data (#2076)
prutskov Sep 15, 2020
af5216d
REFACTOR-#2035: move getitem_array to the backend (#2036)
ienkovich Sep 15, 2020
f1778e6
REFACTOR-#2083: Rename LISCENSE_HEADER to LICENSE_HEADER. (#2082)
heuermh Sep 16, 2020
40aa690
Merge pull request #2 from modin-project/master
aregm Sep 16, 2020
16b7457
DOC-REFACTOR: 1/n Refactoring the documentation - added first stab fo…
aregm Jul 21, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 16 additions & 4 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
import modin

project = u"Modin"
copyright = u"2018, Modin"
copyright = u"2018-2020, Modin"
author = u"Modin contributors"

# The short X.Y version
Expand All @@ -28,7 +28,17 @@
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = []
extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.intersphinx",
'sphinx.ext.todo',
'sphinx.ext.mathjax',
'sphinx.ext.githubpages',
'sphinx.ext.graphviz',
'sphinxcontrib.plantuml',
"sphinx_issues",
]


# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
Expand Down Expand Up @@ -67,15 +77,15 @@
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = "nature"
html_theme = "sphinx_rtd_theme"

html_logo = "img/MODIN_ver2.png"

# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#
html_theme_options = {"sidebarwidth": 270}
html_theme_options = {"sidebarwidth": 270, 'collapse_navigation': False}

# Custom sidebar templates, must be a dictionary that maps document names
# to template names.
Expand All @@ -88,3 +98,5 @@
html_sidebars = {
"**": ["globaltoc.html", "relations.html", "sourcelink.html", "searchbox.html"]
}

issues_github_path = "modin-project/modin"
260 changes: 260 additions & 0 deletions docs/developer/architecture.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,260 @@
System Architecture
===================

In this section, we will lay out the overall system architecture for
Modin, as well as go into detail about the component design, implementation and
other important details. This document also contains important reference
information for those interested in contributing new functionality, bugfixes
and enhancements.

High-Level Architectural View
-----------------------------
The diagram below outlines the general layered view to the components of Modin
with a short description of each major section of the documentation following.


.. image:: /img/modin_architecture.png
:align: center

Modin is logically separated into different layers that represent the hierarchy of a
typical Database Management System. Abstracting out each component allows us to
individually optimize and swap out components without affecting the rest of the system.
We can implement, for example, new compute kernels that are optimized for a certain type
of data and can simply plug it in to the existing infrastructure by implementing a small
interface. It can still be distributed by our choice of compute engine with the
logic internally.

System View
---------------------------
If we look to the overall class structure of the Modin system from very top, it will
look to something like this:

.. image:: /img/10000_meter.png
:align: center

The user - Data Scientist interacts with the Modin system by sending interactive or
batch commands through API and Modin executes them using various backend execution
engines: Ray, Dask and MPI are currently supported.

Subsystem/Container View
------------------------
If we click down to the next level of details we will see that inside Modin the layered
architecture is implemented using several interacting components:

.. image:: /img/component_view.png
:align: center

For the simplicity the other backend systems - Dask and MPI are omitted and only Ray backend is shown.

* Dataframe subsystem is the backbone of the dataframe holding and query compilation. It is responsible for
dispatching the ingress/egress to the appropriate module, getting the Pandas API and calling the query
compiler to convert calls to the internal intermediate Dataframe Algebra.
* Data Ingress/Egress Module is working in conjunction with Dataframe and Partitions subsystem to read data
split into partitions and send data into the appropriate node for storing.
* Query Planner is subsystem that translates the Pandas API to intermediate Dataframe Algebra representation
DAG and performs an initial set of optimizations.
* Query Executor is responsible for getting the Dataframe Algebra DAG, performing further optimizations based
on a selected backend execution subsystem and mapping or compiling the Dataframe Algebra DAG to and actual
execution sequence.
* Backends module is responsible for mapping the abstract operation to an actual executor call, e.g. Pandas,
PyArrow, custom backend.
* Orchestration subsystem is responsible for spawning and controlling the actual execution environment for the
selected backend. It spawns the actual nodes, fires up the execution environment, e.g. Ray, monitors the state
of executors and provides telemetry

Component View
--------------
Coming soon...

Module/Class View
-----------------
Coming soon...

DataFrame Partitioning
----------------------

The Modin DataFrame architecture follows in the footsteps of modern architectures for
database and high performance matrix systems. We chose a partitioning schema that
partitions along both columns and rows because it gives Modin flexibility and
scalability in both the number of columns and the number of rows supported. The
following figure illustrates this concept.

.. image:: /img/block_partitions_diagram.png
:align: center

Currently, each partition's memory format is a `pandas DataFrame`_. In the future, we will
support additional in-memory formats for the backend, namely `Arrow tables`_.

Index
"""""

We currently use the ``pandas.Index`` object for both indexing columns and rows. In the
future, we will implement a distributed, pandas-compatible Index object in order remove
this scaling limitation from the system. It does not start to become a problem until you
are operating on more than 10's of billions of columns or rows, so most workloads will
not be affected by this scalability limit. **Important note**: If you are using the
default index (``pandas.RangeIndex``) there is a fixed memory overhead (~200 bytes) and
there will be no scalability issues with the index.


API
"""

The API is the outer-most layer that faces users. The majority of our current effort is
spent implementing the components of the pandas API. We have implemented a toy example
for a sqlite API as a proof of concept, but this isn't ready for usage/testing. There
are also plans to expose the Modin DataFrame API as a reduced API set that encompasses
the entire pandas/dataframe API.

Query Compiler
""""""""""""""

The Query Compiler receives queries from the pandas API layer. The API layer's
responsibility is to ensure clean input to the Query Compiler. The Query Compiler must
have knowledge of the compute kernels/in-memory format of the data in order to
efficiently compile the queries.

The Query Compiler is responsible for sending the compiled query to the Modin DataFrame.
In this design, the Query Compiler does not have information about where or when the
query will be executed, and gives the control of the partition layout to the Modin
DataFrame.

In the interest of reducing the pandas API, the Query Compiler layer closely follows the
pandas API, but cuts out a large majority of the repetition.

Modin DataFrame
"""""""""""""""

At this layer, operations can be performed lazily. Currently, Modin executes most
operations eagerly in an attempt to behave as pandas does. Some operations, e.g.
``transpose`` are expensive and create full copies of the data in-memory. In these
cases, we can wait until another operation triggers computation. In the future, we plan
to add additional query planning and laziness to Modin to ensure that queries are
performed efficiently.

The structure of the Modin DataFrame is extensible, such that any operation that could
be better optimized for a given backend can be overridden and optimized in that way.

This layer has a significantly reduced API from the QueryCompiler and the user-facing
API. Each of these APIs represents a single way of performing a given operation or
behavior. Some of these are expanded for convenience/understanding. The API abstractions
are as follows:

Modin DataFrame API
'''''''''''''''''''

* ``mask``: Indexing/masking/selecting on the data (by label or by integer index).
* ``copy``: Create a copy of the data.
* ``mapreduce``: Reduce the dimension of the data.
* ``foldreduce``: Reduce the dimension of the data, but entire column/row information is needed.
* ``map``: Perform a map.
* ``fold``: Perform a fold.
* ``apply_<type>``: Apply a function that may or may not change the shape of the data.

* ``full_axis``: Apply a function requires knowledge of the entire axis.
* ``full_axis_select_indices``: Apply a function performed on a subset of the data that requires knowledge of the entire axis.
* ``select_indices``: Apply a function to a subset of the data. This is mainly used for indexing.

* ``binary_op``: Perform a function between two dataframes.
* ``concat``: Append one or more dataframes to either axis of this dataframe.
* ``transpose``: Swap the axes (columns become rows, rows become columns).
* ``groupby``:

* ``groupby_reduce``: Perform a reduction on each group.
* ``groupby_apply``: Apply a function to each group.

* take functions
* ``head``: Take the first ``n`` rows.
* ``tail``: Take the last ``n`` rows.
* ``front``: Take the first ``n`` columns.
* ``back``: Take the last ``n`` columns.

* import/export functions
* ``from_pandas``: Convert a pandas dataframe to a Modin dataframe.
* ``to_pandas``: Convert a Modin dataframe to a pandas dataframe.
* ``to_numpy``: Convert a Modin dataframe to a numpy array.

More documentation can be found internally in the code_. This API is not complete, but
represents an overwhelming majority of operations and behaviors.

This API can be implemented by other distributed/parallel DataFrame libraries and
plugged in to Modin as well. Create an issue_ or discuss on our Discourse_ for more
information!

The Modin DataFrame is responsible for the data layout and shuffling, partitioning,
and serializing the tasks that get sent to each partition. Other implementations of the
Modin DataFrame interface will have to handle these as well.

Execution Engine/Framework
""""""""""""""""""""""""""

This layer is what Modin uses to perform computation on a partition of the data. The
Modin DataFrame is designed to work with `task parallel`_ frameworks, but with some
effort, a data parallel framework is possible.

Internal abstractions
"""""""""""""""""""""

These abstractions are not included in the above architecture, but are important to the
internals of Modin.

Partition Manager
'''''''''''''''''

The Partition Manager can change the size and shape of the partitions based on the type
of operation. For example, certain operations are complex and require access to an
entire column or row. The Partition Manager can convert the block partitions to row
partitions or column partitions. This gives Modin the flexibility to perform operations
that are difficult in row-only or column-only partitioning schemas.

Another important component of the Partition Manager is the serialization and shipment
of compiled queries to the Partitions. It maintains metadata for the length and width of
each partition, so when operations only need to operate on or extract a subset of the
data, it can ship those queries directly to the correct partition. This is particularly
important for some operations in pandas which can accept different arguments and
operations for different columns, e.g. ``fillna`` with a dictionary.

This abstraction separates the actual data movement and function application from the
DataFrame layer to keep the DataFrame API small and separately optimize the data
movement and metadata management.

Partition
'''''''''

Partitions are responsible for managing a subset of the DataFrame. As is mentioned
above, the DataFrame is partitioned both row and column-wise. This gives Modin
scalability in both directions and flexibility in data layout. There are a number of
optimizations in Modin that are implemented in the partitions. Partitions are specific
to the execution framework and in-memory format of the data. This allows Modin to
exploit potential optimizations across both of these. These optimizations are explained
further on the pages specific to the execution framework.

Supported Execution Frameworks and Memory Formats
"""""""""""""""""""""""""""""""""""""""""""""""""

This is the list of execution frameworks and memory formats supported in Modin. If you
would like to contribute a new execution framework or memory format, please see the
documentation page on Contributing_.

- `Pandas on Ray`_
- Uses the Ray_ execution framework.
- The compute kernel/in-memory format is a pandas DataFrame.
- `Pandas on Dask`_
- Uses the `Dask Futures`_ execution framework.
- The compute kernel/in-memory format is a pandas DataFrame.
- `Pyarrow on Ray`_ (experimental)
- Uses the Ray_ execution framework.
- The compute kernel/in-memory format is a pyarrow Table.

.. _pandas Dataframe: https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.html
.. _Arrow tables: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html
.. _Ray: https://github.com/ray-project/ray
.. _code: https://github.com/modin-project/modin/blob/master/modin/engines/base/frame/data.py
.. _Contributing: contributing.html
.. _Pandas on Ray: UsingPandasonRay/optimizations.html
.. _Pandas on Dask: UsingPandasonDask/optimizations.html
.. _Dask Futures: https://docs.dask.org/en/latest/futures.html
.. _issue: https://github.com/modin-project/modin/issues
.. _Discourse: https://discuss.modin.org
.. _task parallel: https://en.wikipedia.org/wiki/Task_parallelism
.. _Pyarrow on Ray: UsingPyarrowonRay/index.html
Loading