Skip to content

Commit

Permalink
Docs: Improve Landing Page (#1520)
Browse files Browse the repository at this point in the history
* storage access docs moved to their own section at the bottom
* code formatting now copy/paste friendly
* rewrite of getting started
* section on transactions added

Co-authored-by: Nick Clarke <[email protected]>
DrNickClarke and Nick Clarke authored Apr 30, 2024

Verified

This commit was signed with the committer’s verified signature. The key has expired.
kmbcook Kevin Cook
1 parent 392958a commit dfcba71
Showing 1 changed file with 351 additions and 234 deletions.
585 changes: 351 additions & 234 deletions docs/mkdocs/docs/index.md
Original file line number Diff line number Diff line change
@@ -1,387 +1,504 @@
![](images/FullWithBorder.png)
![](images/FullWithBorder.png)

## What is ArcticDB?

ArcticDB is an embedded/serverless database engine designed to integrate with Pandas and the Python Data Science ecosystem. ArcticDB enables
you to store, retrieve and process DataFrames at scale, backed by commodity object storage (S3-compatible storages and Azure Blob Storage).
ArcticDB is a serverless DataFrame database engine designed for the Python Data Science ecosystem.

ArcticDB enables you to store, retrieve and process DataFrames at scale, backed by commodity object storage (S3-compatible storages and Azure Blob Storage).

ArcticDB requires *zero additional infrastructure* beyond a running Python environment and access to object storage and can be **installed in seconds.**

ArcticDB is:

- **Fast**: ArcticDB is incredibly fast, able to process millions of (on-disk) rows a second, and is very easy to install: `pip install arcticdb`!
- **Flexible**: Supporting data with and without a schema, ArcticDB is also fully compatible with streaming data ingestion. The platform is bitemporal, allowing you to see all previous versions of stored data.
- **Familiar**: ArcticDB is the world's simplest database, designed to be immediately familiar to anyone with prior Python and Pandas experience.

#### What is ArcticDB _not_?

ArcticDB is designed for high throughput analytical workloads. It is _not_ a transactional database and as such is not a replacement for tools such as PostgreSQL.
- **Fast**
- Process up to 100 million rows per second for a single consumer
- Process a billion rows per second across all consumers
- Quick and easy to install: `pip install arcticdb`
- **Flexible**
- Data schemas are not required
- Supports streaming data ingestion
- Bitemporal - stores all previous versions of stored data
- Easy to setup both locally and on cloud
- Scales from dev/research to production environments
- **Familiar**
- ArcticDB is the world's simplest shareable database
- Easy to learn for anyone with Python and Pandas experience
- Just you and your data - the cognitive overhead is very low.

## Getting Started

The below guide covers installation, setup and basic usage. More detailed information on advanced functionality such as _snapshots_ and _parallel writers_ can be found in the tutorials section.
This section will cover installation, setup and basic usage. More details on basics and advanced features can be found in the [tutorials](tutorials/fundamentals.md) section.

### Installation

ArcticDB supports Python 3.6 - 3.11. To install, simply run:
ArcticDB supports Python 3.6 - 3.11. Python 3.7 is the earliest version for Windows.

```
To install, simply run:


```python
pip install arcticdb
```

### Setup

ArcticDB is a storage engine designed for object storage, but also supports local-disk storage using LMDB.
ArcticDB is a storage engine designed for object storage and also supports local-disk storage using LMDB.

!!! Storage Compatibility

ArcticDB supports any S3 API compatible storage. It has been tested against AWS S3 and storage appliances like [VAST Universal Storage](https://vastdata.com/).
ArcticDB supports any S3 API compatible storage, including AWS and Azure, and storage appliances like [VAST Universal Storage](https://vastdata.com/) and [Pure Storage](https://purestorage.com/).

ArcticDB also supports LMDB for local/file based storage - to use LMDB, pass an LMDB path as the URI: `adb.Arctic('lmdb://path/to/desired/database')`.

To get started, we can import ArcticDB and instantiate it:

```python
>>> import arcticdb as adb
>>> ac = adb.Arctic(<URI>)
import arcticdb as adb
# this will set up the storage using the local file system
uri = "lmdb://tmp/arcticdb_intro"
ac = adb.Arctic(uri)
```

For more information on the format of _<URI\>_, please view the docstring ([`>>> help(Arctic)`](api/arctic.md#arcticdb.Arctic)). Below we'll run through some setup examples.
For more information on how to correctly format the `uri` string for other storages, please view the docstring ([`help(Arctic)`](api/arctic.md#arcticdb.Arctic)) or read the [storage access](#storage-access) section (click the link or keep reading below this section).

#### S3 configuration
### Library Setup

There are two methods to configure S3 access. If you happen to know the access and secret key, simply connect as follows:
ArcticDB is geared towards storing many (potentially millions) of tables. Individual tables (DataFrames) are called _symbols_ and
are stored in collections called _libraries_. A single _library_ can store many symbols.

_Libraries_ must first be initialized prior to use:

```python
>>> import arcticdb as adb
>>> ac = adb.Arctic('s3://ENDPOINT:BUCKET?region=blah&access=ABCD&secret=DCBA')
ac.create_library('intro') # static schema - see note below
ac.list_libraries()
```
output
```python
['intro']
```

Otherwise, you can delegate authentication to the AWS SDK (obeys standard [AWS configuration options](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)):
The library must then be instantiated in the code ready to read/write data:

```python
>>> ac = adb.Arctic('s3://ENDPOINT:BUCKET?aws_auth=true')
library = ac['intro']
```

Same as above, but using HTTPS:
Sometimes it is more convenient to combine library creation and instantiation using this form, which will automatically create the library if needed, to save you checking if it exists already:

```python
>>> ac = adb.Arctic('s3s://ENDPOINT:BUCKET?aws_auth=true')
library = ac.get_library('intro', create_if_missing=True)
```

!!! s3 vs s3s
!!! info "ArcticDB Static & Dynamic Schemas"

Use `s3s` if your S3 endpoint used HTTPS
ArcticDB does not need data schemas, unlike many other databases. You can write any DataFrame and read it back later. If the shape of the data is changed and then written again, it will all just work. Nice and simple.

##### Connecting to a defined storage endpoint
The one exception where schemas are needed is in the case of functions that modify existing symbols: `update` and `append`. When modifying a symbol, the new data must have the same schema as the existing data. The schema here means the index type and the name, order, and type of each column in the DataFrame. In other words when you are appending new rows they must look like the existing rows. This is the default option and is called `static schema`.

Connect to local storage (not AWS - HTTP endpoint of s3.local) with a pre-defined access and storage key:
However, if you need to add, remove or change the type of columns via `update` or `append`, then you can do that. You simply need to create the library with the `dynamic_schema` option set. See the `library_options` parameter of the ([`create_library`](api/arctic/#arcticdb.Arctic.create_library)) method.

So you have the best of both worlds - you can choose to either enforce a static schema on your data so it cannot be changed by modifying operations, or allow it to be flexible.

The choice to use static or dynamic schemas must be set at library creation time.

In this section we are using `static schema`, just to be clear.

### Reading And Writing Data

Now we have a library set up, we can get to reading and writing data. ArcticDB has a set of simple functions for DataFrame storage.

Let's write a DataFrame to storage.

First create the data:

```python
>>> ac = adb.Arctic('s3://s3.local:arcticdb-test-bucket?access=EFGH&secret=HGFE')
# 50 columns, 25 rows, random data, datetime indexed.
import pandas as pd
import numpy as np
from datetime import datetime
cols = ['COL_%d' % i for i in range(50)]
df = pd.DataFrame(np.random.randint(0, 50, size=(25, 50)), columns=cols)
df.index = pd.date_range(datetime(2000, 1, 1, 5), periods=25, freq="H")
df.head(5)
```
_output (the first 5 rows of the data)_
```
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 05:00:00 18 48 10 16 38 34 25 44 ...
2000-01-01 06:00:00 48 10 24 45 22 36 30 19 ...
2000-01-01 07:00:00 25 16 36 29 25 9 48 2 ...
2000-01-01 08:00:00 38 21 2 1 37 6 31 31 ...
2000-01-01 09:00:00 45 17 39 47 47 11 33 31 ...
```

##### Connecting to AWS

Connecting to AWS with a pre-defined region:
Then write to the library:

```python
>>> ac = adb.Arctic('s3s://s3.eu-west-2.amazonaws.com:arcticdb-test-bucket?aws_auth=true')
library.write('test_frame', df)
```
_output (information about what was written)_
```
VersionedItem(symbol=test_frame,library=intro,data=n/a,version=0,metadata=None,host=<host>)
```

Note that no explicit credential parameters are given. When `aws_auth` is passed, authentication is delegated to the AWS SDK which is responsible for locating the appropriate credentials in the `.config` file or
in environment variables. You can manually configure which profile is being used by setting the `AWS_PROFILE` environment variable as described in the
[AWS Documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html).
The `'test_frame'` DataFrame will be used for the remainder of this guide.

#### Using a specific path within a bucket
!!! info "ArcticDB index"

You may want to restrict access for the ArcticDB library to a specific path within the bucket. To do this, you can use the `path_prefix` parameter:
When writing Pandas DataFrames, ArcticDB supports the following index types:

* `pandas.Index` containing `int64` (or the corresponding dedicated types `Int64Index`, `UInt64Index`)
* `RangeIndex`
* `DatetimeIndex`
* `MultiIndex` composed of above supported types

The "row" concept in `head()/tail()` refers to the row number ('iloc'), not the value in the `pandas.Index` ('loc').

```python
>>> ac = adb.Arctic('s3s://s3.eu-west-2.amazonaws.com:arcticdb-test-bucket?path_prefix=test&aws_auth=true')
Read the data back from storage:

```Python
from_storage_df = library.read('test_frame').data
from_storage_df.head(5)
```
_output (the first 5 rows but read from the database)_
```
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 05:00:00 18 48 10 16 38 34 25 44 ...
2000-01-01 06:00:00 48 10 24 45 22 36 30 19 ...
2000-01-01 07:00:00 25 16 36 29 25 9 48 2 ...
2000-01-01 08:00:00 38 21 2 1 37 6 31 31 ...
2000-01-01 09:00:00 45 17 39 47 47 11 33 31 ...
```

#### Azure
The data read matches the original data, of course.

ArcticDB uses the [Azure connection string](https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string) to define the connection:

```python
>>> import arcticdb as adb
>>> ac = adb.Arctic('azure://AccountName=ABCD;AccountKey=EFGH;BlobEndpoint=ENDPOINT;Container=CONTAINER')
```
### Slicing and Filtering

For example:
ArcticDB enables you to slice by row and by column.

```python
>>> import arcticdb as adb
>>> ac = adb.Arctic("azure://CA_cert_path=/etc/ssl/certs/ca-certificates.crt;BlobEndpoint=https://arctic.blob.core.windows.net;Container=acblob;SharedAccessSignature=sp=awd&st=2001-01-01T00:00:00Z&se=2002-01-01T00:00:00Z&spr=https&rf=g&sig=awd%3D")
!!! info "ArcticDB indexing"

ArcticDB will construct a full index for _ordered numerical and timeseries (e.g. DatetimeIndex) Pandas indexes_. This will enable
optimised slicing across index entries. If the index is unsorted or not numeric your data can still be stored but row-slicing will
be slower.

#### Row-slicing

```Python
library.read('test_frame', date_range=(df.index[5], df.index[8])).data
```
_output (the rows in the data range requested)_
```
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 10:00:00 23 39 0 45 15 28 10 17 ...
2000-01-01 11:00:00 36 28 22 43 23 6 10 1 ...
2000-01-01 12:00:00 18 42 1 15 19 36 41 36 ...
2000-01-01 13:00:00 28 32 47 37 17 44 29 24 ...
```

For more information, [see the Arctic class reference](api/arctic.md#arcticdb.Arctic).
#### Column slicing

#### LMDB
```Python
_range = (df.index[5], df.index[8])
_columns = ['COL_30', 'COL_31']
library.read('test_frame', date_range=_range, columns=_columns).data
```
_output (the rows in the date range and columns requested)_
```
COL_30 COL_31
2000-01-01 10:00:00 31 2
2000-01-01 11:00:00 3 34
2000-01-01 12:00:00 24 43
2000-01-01 13:00:00 18 8
```

LMDB supports configuring its map size. See its [documentation](http://www.lmdb.tech/doc/group__mdb.html#gaa2506ec8dab3d969b0e609cd82e619e5).
#### Filtering

You may need to tweak it on Windows, whereas on Linux the default is much larger and should suffice. This is because Windows allocates physical
space for the map file eagerly, whereas on Linux the map size is an upper bound to the physical space that will be used.
ArcticDB uses a Pandas-_like_ syntax to describe how to filter data. For more details including the limitations, please view the docstring ([`help(QueryBuilder)`](api/query_builder)).

You can set a map size in the connection string:
!!! info "ArcticDB Filtering Philosphy & Restrictions"

In most cases this is more memory efficient and performant than the equivalent Pandas operation as the processing is within the C++ storage engine and parallelized over multiple threads of execution.

```python
>>> import arcticdb as adb
>>> ac = adb.Arctic('lmdb://path/to/desired/database?map_size=2GB')
_range = (df.index[5], df.index[8])
_cols = ['COL_30', 'COL_31']
import arcticdb as adb
q = adb.QueryBuilder()
q = q[(q["COL_30"] > 10) & (q["COL_31"] < 40)]
library.read('test_frame', date_range=_range, columns=_cols, query_builder=q).data
```
_output (the data filtered by date range, columns and the query which filters based on the data values)_
```
COL_30 COL_31
2000-01-01 10:00:00 31 2
2000-01-01 13:00:00 18 8
```

The default on Windows is 2GiB. Errors with `lmdb errror code -30792` indicate that the map is getting full and that you should
increase its size. This will happen if you are doing large writes.
### Modifications, Versioning (aka Time Travel)

In each Python process, you should ensure that you only have one Arctic instance open over a given LMDB database.
ArcticDB fully supports modifying stored data via two primitives: _update_ and _append_.

LMDB does not work with remote filesystems.
These operations are atomic but do not lock the symbol. Please see the section on [transactions](#transactions) for more on this.

#### In-memory configuration
#### Append

An in-memory backend is provided mainly for testing and experimentation. It could be useful when creating files with LMDB is not desired.
Let's append data to the end of the timeseries.

There are no configuration parameters, and the memory is owned solely be the Arctic instance.
To start, we will take a look at the last few records of the data (before it gets modified)

For example:
```python
library.tail('test_frame', 4).data
```
_output_
```
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-02 02:00:00 46 12 38 47 4 31 1 42 ...
2000-01-02 03:00:00 46 20 5 42 8 35 12 2 ...
2000-01-02 04:00:00 17 48 36 43 6 46 5 8 ...
2000-01-02 05:00:00 20 19 24 44 29 32 2 19 ...
```
Then create 3 new rows to append. For append to work the new data must have its first `datetime` starting after the existing data.

```python
random_data = np.random.randint(0, 50, size=(3, 50))
df_append = pd.DataFrame(random_data, columns=['COL_%d' % i for i in range(50)])
df_append.index = pd.date_range(datetime(2000, 1, 2, 7), periods=3, freq="H")
df_append
```
_output_
```
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-02 07:00:00 9 15 4 48 48 35 34 49 ...
2000-01-02 08:00:00 35 4 12 30 30 12 38 25 ...
2000-01-02 09:00:00 25 17 3 1 1 15 33 49 ...
```

Now _append_ that DataFrame to what was written previously

```Python
library.append('test_frame', df_append)
```
_output_
```
VersionedItem(symbol=test_frame,library=intro,data=n/a,version=1,metadata=None,host=<host>)
```
Then look at the final 5 rows to see what happened

```python
>>> import arcticdb as adb
>>> ac = adb.Arctic('mem://')
library.tail('test_frame', 5).data
```
_output_
```
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-02 04:00:00 17 48 36 43 6 46 5 8 ...
2000-01-02 05:00:00 20 19 24 44 29 32 2 19 ...
2000-01-02 07:00:00 9 15 4 48 48 35 34 49 ...
2000-01-02 08:00:00 35 4 12 30 30 12 38 25 ...
2000-01-02 09:00:00 25 17 3 1 1 15 33 49 ...
```

For concurrent access to a local backend, we recommend LMDB connected to tmpfs.
The final 5 rows consist of the last two rows written previously followed by the 3 new rows that we have just appended.

#### Library Setup
Append is very useful for adding new data to the end of a large timeseries.

ArcticDB is geared towards storing many (potentially millions) of tables. Individual tables are called _symbols_ and
are stored in collections called _libraries_. A single _library_ can store an effectively unlimited number of symbols.
#### Update

_Libraries_ must first be initialized prior to use:
The update primitive enables you to overwrite a contiguous chunk of data. This results in modifying some rows and deleting others as we will see in the example below.

Here we create a new DataFrame for the update, with only 2 rows that are 2 hours apart

```python
random_data = np.random.randint(0, 50, size=(2, 50))
df = pd.DataFrame(random_data, columns=['COL_%d' % i for i in range(50)])
df.index = pd.date_range(datetime(2000, 1, 1, 5), periods=2, freq="2H")
df
```
_output (rows 0 and 2 only as selected by the `iloc[]`)_
```
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 05:00:00 47 49 15 6 22 48 45 22 ...
2000-01-01 07:00:00 46 10 2 49 24 49 8 0 ...
```
now update the symbol
```python
>>> ac.create_library('data') # fixed schema - see note below
>>> ac.list_libraries()
['data']
library.update('test_frame', df)
```
_output (information about the update)_
```
VersionedItem(symbol=test_frame,library=intro,data=n/a,version=2,metadata=None,host=<host>)
```

A library can then be retrieved:
Now let's look at the first 4 rows in the symbol:

```python
>>> library = ac['data']
library.head('test_frame', 4).data # head/tail are similar to the equivalent Pandas operations
```
_output_
```
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 05:00:00 47 49 15 6 22 48 45 22 ...
2000-01-01 07:00:00 46 10 2 49 24 49 8 0 ...
2000-01-01 08:00:00 38 21 2 1 37 6 31 31 ...
2000-01-01 09:00:00 45 17 39 47 47 11 33 31 ...
```

!!! info "ArcticDB Schemas & the Dynamic Schema library option"
Let's unpack how we end up with that result. The update has

ArcticDB enforces a strict schema that is defined on first write. This schema defines the name, order, index type and type of each column in the DataFrame.
* replaced the data in the symbol with the new data where the index matched (in this case the 05:00 and 07:00 rows)
* removed any rows within the date range of the new data that are not in the index of the new data (in this case the 06:00 row)
* kept the rest of the data the same (in this case 09:00 onwards)

If you wish to add, remove or change the type of columns via `update` or `append` options, please see the documentation for the `dynamic_schema`
option within the `library_options` parameter of the `create_library` method. Note that whether to use fixed or dynamic schemas must be set at
library creation time.
Logically, this corresponds to replacing the complete date range of the old data with the new data, which is what you would expect from an update.

##### Reading And Writing Data(Frames)!

Now we have a library set up, we can get to reading and writing data! ArcticDB exposes a set of simple API primitives to enable DataFrame storage.
#### Versioning

Let's first look at writing a DataFrame to storage:
You might have noticed that `read` calls do not return the data directly - but instead returns a `VersionedItem` structure. You may also have noticed that modification operations (`write`, `append` and `update`) increment the version number. ArcticDB versions all modifications, which means you can retrieve earlier versions of data - it is a bitemporal database:

```Python
# 50 columns, 25 rows, random data, datetime indexed.
>>> import pandas as pd
>>> import numpy as np
>>> from datetime import datetime
>>> cols = ['COL_%d' % i for i in range(50)]
>>> df = pd.DataFrame(np.random.randint(0, 50, size=(25, 50)), columns=cols)
>>> df.index = pd.date_range(datetime(2000, 1, 1, 5), periods=25, freq="H")
>>> df.head(2)
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 05:00:00 35 46 4 0 17 35 33 25 ...
2000-01-01 06:00:00 9 24 18 30 0 39 43 20 ...
library.tail('test_frame', 7, as_of=0).data
```

Write the DataFrame:

```Python
>>> library.write('test_frame', df)
VersionedItem(symbol=test_frame,library=data,data=n/a,version=0,metadata=None,host=<host>)
_output_
```
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 23:00:00 16 46 3 45 43 14 10 27 ...
2000-01-02 00:00:00 37 37 20 3 49 38 23 46 ...
2000-01-02 01:00:00 42 47 40 27 49 41 11 26 ...
2000-01-02 02:00:00 46 12 38 47 4 31 1 42 ...
2000-01-02 03:00:00 46 20 5 42 8 35 12 2 ...
2000-01-02 04:00:00 17 48 36 43 6 46 5 8 ...
2000-01-02 05:00:00 20 19 24 44 29 32 2 19 ...
```

The `'test_frame'` DataFrame will be used for the remainder of this guide.
Note the timestamps - we've read the data prior to the `append` operation. Please note that you can also pass a `datetime` into any `as_of` argument, which will result in reading the last version earlier than the `datetime` passed.

!!! info "ArcticDB index"
!!! note "Versioning, Prune Previous & Snapshots"

When writing Pandas DataFrames, ArcticDB supports the following index types:

* `pandas.Index` containing `int64` or `float64` (or the corresponding dedicated types `Int64Index`, `UInt64Index` and `Float64Index`)
* `RangeIndex` with the restrictions noted below
* `DatetimeIndex`
* `MultiIndex` composed of above supported types
By default, `write`, `append`, and `update` operations will **not** remove the previous versions. Please be aware that this will consume more space.

Currently, ArcticDB allows `append()`-ing to a `RangeIndex` only with a continuing `RangeIndex` (i.e. the appending `RangeIndex.start` == `RangeIndex.stop` of the existing data and they have the same `RangeIndex.step`). If a DataFrame with a non-continuing `RangeIndex` is passed to `append()`, ArcticDB does _not_ convert it `Int64Index` like Pandas and will produce an error.
This behaviour can be can be controlled via the `prune_previous_versions` keyword argument. Space will be saved but the previous versions will then not be available.

The "row" concept in `head()/tail()` refers to the row number, not the value in the `pandas.Index`.
A compromise can be achieved by using snapshots, which allow states of the library to be saved and read back later. This allows certain versions to be protected from deletion, they will be deleted when the snapshot is deleted. See [snapshot documentation](api/library/#arcticdb.version_store.library.Library.snapshot) for details.

Read it back:

```Python
>>> from_storage_df = library.read('test_frame').data
>>> from_storage_df.head(2)
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 05:00:00 35 46 4 0 17 35 33 25 ...
2000-01-01 06:00:00 9 24 18 30 0 39 43 20 ...
```

##### Slicing and Filtering
### Storage Access

ArcticDB enables you to slice by _row_ and by _column_.
#### S3 configuration

!!! info "ArcticDB indexing"
There are two methods to configure S3 access. If you happen to know the access and secret key, simply connect as follows:

ArcticDB will construct a full index for _ordered numerical and timeseries (e.g. DatetimeIndex) Pandas indexes_. This will enable
optimised slicing across index entries. If the index is unsorted or not numeric your data can still be stored but row-slicing will
be slower.
```python
import arcticdb as adb
ac = adb.Arctic('s3://ENDPOINT:BUCKET?region=blah&access=ABCD&secret=DCBA')
```

###### Row-slicing
Otherwise, you can delegate authentication to the AWS SDK (obeys standard [AWS configuration options](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)):

```Python
>>> library.read('test_frame', date_range=(df.index[5], df.index[8])).data
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 10:00:00 43 28 36 18 10 37 31 32 ...
2000-01-01 11:00:00 36 5 30 18 44 15 31 28 ...
2000-01-01 12:00:00 6 34 0 5 19 41 17 15 ...
2000-01-01 13:00:00 14 48 6 6 2 3 44 42 ...
```python
ac = adb.Arctic('s3://ENDPOINT:BUCKET?aws_auth=true')
```

##### Column slicing
Same as above, but using HTTPS:

```Python
>>> _range = (df.index[5], df.index[8])
>>> _columns = ['COL_30', 'COL_31']
>>> library.read('test_frame', date_range=_range, columns=_columns).data
COL_30 COL_31
2000-01-01 10:00:00 7 26
2000-01-01 11:00:00 29 18
2000-01-01 12:00:00 36 26
2000-01-01 13:00:00 48 42
```python
ac = adb.Arctic('s3s://ENDPOINT:BUCKET?aws_auth=true')
```

###### Filtering
!!! s3 vs s3s

ArcticDB uses a Pandas-_like_ syntax to describe how to filter data. For more details including the limitations, please view the docstring ([`help(QueryBuilder)`](https://docs.arcticdb.io/api/query_builder)).
Use `s3s` if your S3 endpoint used HTTPS

!!! info "ArcticDB Filtering Philosphy & Restrictions"
##### Connecting to a defined storage endpoint

Note that in most cases this should be more memory efficient and performant than the equivalent Pandas operation as the processing is within the C++ storage engine and parallelized over multiple threads of execution.
Connect to local storage (not AWS - HTTP endpoint of s3.local) with a pre-defined access and storage key:

We do not intend to re-implement the entirety of the Pandas filtering/masking operations, but instead target a maximally useful subset.
```python
ac = adb.Arctic('s3://s3.local:arcticdb-test-bucket?access=EFGH&secret=HGFE')
```

```Python
>>> _range = (df.index[5], df.index[8])
>>> _cols = ['COL_30', 'COL_31']
>>> import arcticdb as adb
>>> q = adb.QueryBuilder()
>>> q = q[(q["COL_30"] > 30) & (q["COL_31"] < 50)]
>>> library.read('test_frame', date_range=_range, columns=_cols, query_builder=q).data
>>>
COL_30 COL_31
2000-01-01 12:00:00 36 26
2000-01-01 13:00:00 48 42
##### Connecting to AWS

Connecting to AWS with a pre-defined region:

```python
ac = adb.Arctic('s3s://s3.eu-west-2.amazonaws.com:arcticdb-test-bucket?aws_auth=true')
```

#### Modifications, Versioning (time travel!)
Note that no explicit credential parameters are given. When `aws_auth` is passed, authentication is delegated to the AWS SDK which is responsible for locating the appropriate credentials in the `.config` file or
in environment variables. You can manually configure which profile is being used by setting the `AWS_PROFILE` environment variable as described in the
[AWS Documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html).

#### Using a specific path within a bucket

ArcticDB fully supports modifying stored data via two primitives: _update_ and _append_.
You may want to restrict access for the ArcticDB library to a specific path within the bucket. To do this, you can use the `path_prefix` parameter:

```python
ac = adb.Arctic('s3s://s3.eu-west-2.amazonaws.com:arcticdb-test-bucket?path_prefix=test&aws_auth=true')
```

##### Update
#### Azure

The update primitive enables you to overwrite a contiguous chunk of data. In the below example, we use `update` to modify _2000-01-01 05:00:00_,
remove _2000-01-01 06:00:00_ and modify _2000-01-01 07:00:00_.
ArcticDB uses the [Azure connection string](https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string) to define the connection:

```Python
# Recreate the DataFrame with new (and different!) random data, and filter to only the first and third row
>>> random_data = np.random.randint(0, 50, size=(25, 50))
>>> df = pd.DataFrame(random_data, columns=['COL_%d' % i for i in range(50)])
>>> df.index = pd.date_range(datetime(2000, 1, 1, 5), periods=25, freq="H")
# Filter!
>>> df = df.iloc[[0,2]]
>>> df
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 05:00:00 46 24 4 20 7 32 1 18 ...
2000-01-01 07:00:00 44 37 16 27 30 1 35 25 ...
>>> library.update('test_frame', df)
VersionedItem(symbol=test_frame,library=data,data=n/a,version=1,metadata=None,host=<host>)
```python
import arcticdb as adb
ac = adb.Arctic('azure://AccountName=ABCD;AccountKey=EFGH;BlobEndpoint=ENDPOINT;Container=CONTAINER')
```

Now let's look at the first 2 rows in the symbol:
For example:

```Python
>>> library.head('test_frame', 2).data # head/tail are similar to the equivalent Pandas operations
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 05:00:00 46 24 4 20 7 32 1 18 ...
2000-01-01 07:00:00 44 37 16 27 30 1 35 25 ...
```python
import arcticdb as adb
ac = adb.Arctic("azure://CA_cert_path=/etc/ssl/certs/ca-certificates.crt;BlobEndpoint=https://arctic.blob.core.windows.net;Container=acblob;SharedAccessSignature=sp=awd&st=2001-01-01T00:00:00Z&se=2002-01-01T00:00:00Z&spr=https&rf=g&sig=awd%3D")
```

##### Append
For more information, [see the Arctic class reference](api/arctic.md#arcticdb.Arctic).

Let's append data to the end of the timeseries:
#### LMDB

```Python
>>> random_data = np.random.randint(0, 50, size=(5, 50))
>>> df_append = pd.DataFrame(random_data, columns=['COL_%d' % i for i in range(50)])
>>> df_append.index = pd.date_range(datetime(2000, 1, 2, 7), periods=5, freq="H")
>>> df_append
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-02 07:00:00 38 44 4 37 3 26 12 10 ...
2000-01-02 08:00:00 44 32 4 12 15 13 17 16 ...
2000-01-02 09:00:00 44 43 28 38 20 34 46 37 ...
2000-01-02 10:00:00 46 22 34 33 18 35 5 3 ...
2000-01-02 11:00:00 30 47 14 41 43 40 22 45 ...
```
LMDB supports configuring its map size. See its [documentation](http://www.lmdb.tech/doc/group__mdb.html#gaa2506ec8dab3d969b0e609cd82e619e5).

** Note the starting datetime of this DataFrame is after the final row written previously! **
You may need to tweak it on Windows, whereas on Linux the default is much larger and should suffice. This is because Windows allocates physical
space for the map file eagerly, whereas on Linux the map size is an upper bound to the physical space that will be used.

Let's now _append_ that DataFrame to what was written previously, and then pull back the final 7 rows from storage:
You can set a map size in the connection string:

```Python
>>> library.append('test_frame', df_append)
VersionedItem(symbol=test_frame,library=data,data=n/a,version=2,metadata=None,host=<host>)
>>> library.tail('test_frame', 7).data
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-02 04:00:00 4 13 8 14 25 11 11 11 ...
2000-01-02 05:00:00 14 41 24 7 16 10 15 36 ...
2000-01-02 07:00:00 38 44 4 37 3 26 12 10 ...
2000-01-02 08:00:00 44 32 4 12 15 13 17 16 ...
2000-01-02 09:00:00 44 43 28 38 20 34 46 37 ...
2000-01-02 10:00:00 46 22 34 33 18 35 5 3 ...
2000-01-02 11:00:00 30 47 14 41 43 40 22 45 ...
```python
import arcticdb as adb
ac = adb.Arctic('lmdb://path/to/desired/database?map_size=2GB')
```

The final 7 rows consist of the last two rows written previously followed by the 5 rows that we have just appended.
The default on Windows is 2GiB. Errors with `lmdb errror code -30792` indicate that the map is getting full and that you should
increase its size. This will happen if you are doing large writes.

##### Versioning
In each Python process, you should ensure that you only have one Arctic instance open over a given LMDB database.

You might have noticed that _read_ calls do not return the data directly - but instead returns a _VersionedItem_ structure. You may also have noticed that modification operations
(_write_, _append_ and _update_) increment the version counter. ArcticDB versions all modifications, which means you can retrieve earlier versions of data (ArcticDB is a bitemporal database!):
LMDB does not work with remote filesystems.

```Python
>>> library.tail('test_frame', 7, as_of=0).data
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 23:00:00 26 38 12 30 25 29 47 27 ...
2000-01-02 00:00:00 12 14 42 11 44 32 19 11 ...
2000-01-02 01:00:00 12 47 4 45 28 38 35 36 ...
2000-01-02 02:00:00 22 0 12 48 37 11 18 14 ...
2000-01-02 03:00:00 14 16 38 30 19 41 29 43 ...
2000-01-02 04:00:00 4 13 8 14 25 11 11 11 ...
2000-01-02 05:00:00 14 41 24 7 16 10 15 36 ...
#### In-memory configuration

An in-memory backend is provided mainly for testing and experimentation. It could be useful when creating files with LMDB is not desired.

There are no configuration parameters, and the memory is owned solely by the Arctic instance.

For example:

```python
import arcticdb as adb
ac = adb.Arctic('mem://')
```

Note the timestamps - we've read the data prior to the _append_ operation. Please note that you can also pass a _datetime_ into any _as\_of_ argument.
For concurrent access to a local backend, we recommend LMDB connected to tmpfs, see [LMDB and In-Memory Tutorial](tutorials/lmdb_and_in_memory.md).

### Transactions

- Transactions can be be very useful but are often expensive and slow
- If we unpack ACID: Atomicity, Consistency and Durability are useful, Isolation less so
- Most analytical workflows can be constructed to run without needing transactions at all
- So why pay the cost of transactions when they are often not needed?
- ArcticDB doesn't have transactions because it is designed for high throughput analytical workloads

!!! note "Versioning & Prune Previous"

By default, `write`, `append`, and `update` operations will **not** remove the previous versions. Please be aware that this will consume more space.

This behaviour can be can be controlled via the `prune_previous_versions` keyword argument.

0 comments on commit dfcba71

Please sign in to comment.