Skip to content

Commit

Permalink
update processing
Browse files Browse the repository at this point in the history
  • Loading branch information
florian-vuillemot committed Jan 27, 2025
1 parent ad273bf commit e3ce912
Show file tree
Hide file tree
Showing 4 changed files with 64 additions and 94 deletions.
148 changes: 56 additions & 92 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,6 @@
# Cloud Shelve
`Cloud Shelve (cshelve)` is a Python package that provides a seamless way to store and manage data in the cloud using the familiar [Python Shelve interface](https://docs.python.org/3/library/shelve.html). It is designed for efficient and scalable storage solutions, allowing you to leverage cloud providers for persistent storage while keeping the simplicity of the `shelve` API.

## Features

- Supports large file storage in the cloud
- Secure data in-transit encryption when using cloud storage
- Fully compatible with Python's `shelve` API
- Cross-platform compatibility for local and remote storage

## Installation

Install `cshelve` via pip:
Expand All @@ -18,135 +11,109 @@ pip install cshelve

## Usage

The `cshelve` module strictly follows the official `shelve` API. Consequently, you can refer to the [Python official documentation](https://docs.python.org/3/library/shelve.html) for general usage examples. Simply replace the `shelve` import with `cshelve`, and you're good to go.
The `cshelve` module provides a simple key-value interface for storing data in the cloud.

### Local Storage
### Quick Start Example

Here is an example, adapted from the [official shelve documentation](https://docs.python.org/3/library/shelve.html#example), demonstrating local storage usage. Just replace `shelve` with `cshelve`:
Here is a quick example demonstrating how to store and retrieve data using `cshelve`:

```python
import cshelve

d = cshelve.open('local.db') # Open the local database file

key = 'key'
data = 'data'

d[key] = data # Store data at the key (overwrites existing data)
data = d[key] # Retrieve a copy of data (raises KeyError if not found)
del d[key] # Delete data at the key (raises KeyError if not found)

flag = key in d # Check if the key exists in the database
klist = list(d.keys()) # List all existing keys (could be slow for large datasets)
# Open a local database file
d = cshelve.open('local.db')

# Note: Since writeback=True is not used, handle data carefully:
d['xx'] = [0, 1, 2] # Store a list
d['xx'].append(3) # This won't persist since writeback=True is not used
# Store data
d['my_key'] = 'my_data'

# Correct approach:
temp = d['xx'] # Extract the stored list
temp.append(5) # Modify the list
d['xx'] = temp # Store it back to persist changes
# Retrieve data
print(d['my_key']) # Output: my_data

d.close() # Close the database
# Close the database
d.close()
```

### Debug/test Storage
### Cloud Storage Example (e.g., Azure)

For testing purposes, it is possible to use an in-memory provider that can:
- Persist the data during all the program execution.
- Remove the data object is deleted.
To configure remote cloud storage, you need to provide an INI file containing your cloud provider's configuration. The file should have a `.ini` extension. Remote storage also requires the installation of optional dependencies for the cloud provider you want to use.

#### Example Azure Blob Configuration

Here is a configuration example:
First, install the Azure Blob Storage provider:
```bash
$ cat in-memory.ini
[default]
provider = in-memory
# If set, open twice the same database during the program execution will lead to open twice the same database.
persist-key = standard
pip install cshelve[azure-blob]
```

A common use case for this provider is to simplify mocking.

Example:
Then, create an INI file with the following configuration:
```bash
$ cat persist.ini
[default]
provider = in-memory
# If set, open twice the same database during the program execution will lead to open twice the same database.
persist-key = my-db

$ cat do-not-persist.ini
$ cat azure-blob.ini
[default]
provider = in-memory
provider = azure-blob
account_url = https://myaccount.blob.core.windows.net
auth_type = passwordless
container_name = mycontainer
```

Once the INI file is ready, you can interact with remote storage the same way as with local storage. Here's an example using Azure:

```python
import cshelve

with cshelve.open('persist.ini') as db:
db["Asterix"] = "Gaulois"
# Open using the remote storage configuration
d = cshelve.open('azure-blob.ini')

with cshelve.open('persist.ini') as db:
assert db["Asterix"] == "Gaulois"
# Store data
d['my_key'] = 'my_data'

with cshelve.open('do-not-persist.ini') as db:
db["Obelix"] = "Gaulois"
# Retrieve data
print(d['my_key']) # Output: my_data

with cshelve.open('do-not-persist.ini') as db:
assert "Obelix" not in db
# Close the connection to the remote storage
d.close()
```

### Remote Storage (e.g., Azure)
### Advanced Scenario: Storing DataFrames in the Cloud

To configure remote cloud storage, you need to provide an INI file containing your cloud provider's configuration. The file should have a `.ini` extension. Remote storage also requires the installation of optional dependencies for the cloud provider you want to use.
In this advanced example, we will demonstrate how to store and retrieve a Pandas DataFrame using `cshelve` with Azure Blob Storage.

#### Example Azure Blob Configuration

First, install the Azure Blob Storage provider:
First, install the required dependencies:
```bash
pip install cshelve[azure-blob]
pip install cshelve[azure-blob] pandas
```

Then, create an INI file with the following configuration:
Create an INI file with the Azure Blob Storage configuration:
```bash
$ cat azure-blob.ini
[default]
provider = azure-blob
account_url = https://myaccount.blob.core.windows.net
# Note: The auth_type can be access_key, passwordless, connection_string, or anonymous.
# The passwordless authentication method is recommended, but the Azure CLI must be installed (https://learn.microsoft.com/en-us/cli/azure/install-azure-cli).
auth_type = passwordless
container_name = mycontainer
```

Once the INI file is ready, you can interact with remote storage the same way as with local storage. Here's an example using Azure:
Here's the code to store and retrieve a DataFrame:

```python
import cshelve
import pandas as pd

d = cshelve.open('azure-blob.ini') # Open using the remote storage configuration

key = 'key'
data = 'data'

d[key] = data # Store data at the key on the remote storage
data = d[key] # Retrieve the data from the remote storage
del d[key] # Delete the data from the remote storage
# Create a sample DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['New York', 'Los Angeles', 'Chicago']
})

flag = key in d # Check if the key exists in the cloud storage
klist = list(d.keys()) # List all keys present in the remote storage
# Open the remote storage using the Azure Blob configuration
with cshelve.open('azure-blob.ini') as db:
# Store the DataFrame
db['my_dataframe'] = df

# Note: Since writeback=True is not used, handle data carefully:
d['xx'] = [0, 1, 2] # Store a list on the remote storage
d['xx'].append(3) # This won't persist since writeback=True is not used
# Retrieve the DataFrame
with cshelve.open('azure-blob.ini') as db:
retrieved_df = db['my_dataframe']

# Correct approach:
temp = d['xx'] # Extract the stored list from the remote storage
temp.append(5) # Modify the list locally
d['xx'] = temp # Store it back on the remote storage to persist changes

d.close() # Close the connection to the remote storage
print(retrieved_df)
```

More configuration examples for other cloud providers can be found [here](./tests/configurations/).
Expand All @@ -158,14 +125,13 @@ More configuration examples for other cloud providers can be found [here](./test
Provider: `in-memory`
Installation: No additional installation required.

The In Memory provider uses an in-memory data structure to simulate storage. This is useful for testing and development purposes.
The In-Memory provider uses an in-memory data structure to simulate storage. This is useful for testing and development purposes.

| Option | Description | Required | Default Value |
|----------------|------------------------------------------------------------------------------|----------|---------------|
| `persist-key` | If set, its value will be conserved and reused during the program execution. | :x: | None |
| `exists` | If True, the database exists; otherwise, it will be created. | :x: | False |


#### Azure Blob

Provider: `azure-blob`
Expand All @@ -189,7 +155,6 @@ Depending on the `open` flag, the permissions required by `cshelve` for blob sto
| `c` | Open a blob storage container for reading and writing, creating it if it doesn't exist. | [Storage Blob Data Contributor](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles#storage-blob-data-contributor) |
| `n` | Purge the blob storage container before using it. | [Storage Blob Data Contributor](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles#storage-blob-data-contributor) |


Authentication type supported:

| Auth Type | Description | Advantage | Disadvantage | Example Configuration |
Expand All @@ -199,7 +164,6 @@ Authentication type supported:
| Connection String | Uses a connection string for authentication. Credentials are provided directly in the string. | Fast startup as no additional credential retrieval is needed. | Credentials need to be securely managed and provided. | [Example](./tests/configurations/azure-integration/connection-string.ini) |
| Passwordless | Uses passwordless authentication methods such as Managed Identity. | Recommended for better security and easier credential management. | May impact startup time due to the need to retrieve authentication credentials. | [Example](./tests/configurations/azure-integration/standard.ini) |


## Contributing

We welcome contributions from the community! Have a look at our [issues](https://github.com/Standard-Cloud/cshelve/issues).
Expand All @@ -210,4 +174,4 @@ This project is licensed under the MIT License. See the [LICENSE](LICENSE) file

## Contact

If you have any questions, issues, or feedback, feel free to [open an issue]https://github.com/Standard-Cloud/cshelve/issues).
If you have any questions, issues, or feedback, feel free to [open an issue](https://github.com/Standard-Cloud/cshelve/issues).
7 changes: 6 additions & 1 deletion cshelve/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,11 @@
]


# CShelve uses the following pickle protocol instead of the default one used by shelve to support
# very large objects and improve performance (https://docs.python.org/3/library/pickle.html#data-stream-format).
DEFAULT_PICKLE_PROTOCOL = 5


class CloudShelf(shelve.Shelf):
"""
A cloud shelf is a shelf that is stored in the cloud. It is a subclass of `shelve.Shelf` and is used to store data in the cloud.
Expand Down Expand Up @@ -97,7 +102,7 @@ def __init__(
def open(
filename,
flag="c",
protocol=None,
protocol=DEFAULT_PICKLE_PROTOCOL,
writeback=False,
config_loader=_config_loader,
factory=_factory,
Expand Down
2 changes: 1 addition & 1 deletion tests/end-to-end/test_building_data_processing.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ def test_encryption():
"""
Ensure the data is encrypted.
"""
wrapper_size = 17 # Database Record + Data Processing Metadata
wrapper_size = 10 # Database Record + Data Processing Metadata
standard_configuration = "tests/configurations/in-memory/not-persisted.ini"
encryption_configuration = "tests/configurations/in-memory/encryption.ini"
key_pattern = unique_key + "test_encryption"
Expand Down
1 change: 1 addition & 0 deletions tests/end-to-end/test_large.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ def test_large(config_file):
assert new_df.equals(df)


@pytest.mark.azure
@pytest.mark.parametrize(
"config_file",
CONFIG_FILES,
Expand Down

0 comments on commit e3ce912

Please sign in to comment.