Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating rest of the notebooks in the docs #1383

Merged
merged 2 commits into from
Nov 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 85 additions & 48 deletions docs/hr/content/use_cases/mnist_torch.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,63 @@
# MNIST in Database

## Training and Maintaining MNIST Predictions with SuperDuperDB
## Training and Managing MNIST Predictions in SuperDuperDB

This notebook outlines the process of implementing a classic machine learning classification task - MNIST handwritten digit recognition, using a convolutional neural network. However, we introduce a unique twist by performing the task in a database using SuperDuperDB.
This notebook guides you through the implementation of a classic machine learning task: MNIST handwritten digit recognition. The twist? We perform the task directly in a database using SuperDuperDB.

## Prerequisites
This example makes it easy to connect any of your image recognition
model directly to your database in real-time. With SuperDuperDB, you can
skip complicated MLOps pipelines. It's a new straightforward way to
integrate your AI model with your data, ensuring simplicity, efficiency
and speed.

Before diving into the implementation, ensure that you have the necessary libraries installed by running the following commands:
## Prerequisites

Before diving into the implementation, ensure that you have the
necessary libraries installed by running the following commands:

```python
!pip install superduperdb
!pip install torch torchvision matplotlib
```

## Connect to datastore
## Connect to datastore

First, we need to establish a connection to a MongoDB datastore via SuperDuperDB. You can configure the `MongoDB_URI` based on your specific setup.
Here are some examples of MongoDB URIs:
First, we need to establish a connection to a MongoDB datastore via
SuperDuperDB. You can configure the `MongoDB_URI` based on your specific
setup.

* For testing (default connection): `mongomock://test`
* Local MongoDB instance: `mongodb://localhost:27017`
* MongoDB with authentication: `mongodb://superduper:superduper@mongodb:27017/documents`
* MongoDB Atlas: `mongodb+srv://<username>:<password>@<atlas_cluster>/<database>`
Here are some examples of MongoDB URIs:

- For testing (default connection): `mongomock://test`
- Local MongoDB instance: `mongodb://localhost:27017`
- MongoDB with authentication:
`mongodb://superduper:superduper@mongodb:27017/documents`
- MongoDB Atlas:
`mongodb+srv://<username>:<password>@<atlas_cluster>/<database>`

```python
from superduperdb import superduper
from superduperdb.backends.mongodb import Collection
import os

mongodb_uri = os.getenv("MONGODB_URI","mongomock://test")

# SuperDuperDB, now handles your MongoDB database
# It just super dupers your database
db = superduper(mongodb_uri)

# Create a collection for MNIST
mnist_collection = Collection('mnist')
```


## Load Dataset

After connecting to MongoDB, we add the MNIST dataset. SuperDuperDB excels at handling "difficult" data types, and we achieve this using an `Encoder`, which works in tandem with the `Document` wrappers. Together, they enable Python dictionaries containing non-JSONable or bytes objects to be inserted into the underlying data infrastructure.


After establishing a connection to MongoDB, the next step is to load the
MNIST dataset. SuperDuperDB's strength lies in handling diverse data
types, especially those that are challenging. To achieve this, we use an
`Encoder` in conjunction with `Document` wrappers. These components
allow Python dictionaries containing non-JSONable or bytes objects to be
seamlessly inserted into the underlying data infrastructure.

```python
import torchvision
Expand All @@ -53,14 +68,18 @@ from superduperdb.backends.mongodb import Collection
import random

# Load MNIST images as Python objects using the Python Imaging Library.
# Each MNIST item is a tuple (image, label)
mnist_data = list(torchvision.datasets.MNIST(root='./data', download=True))

# Create a list of Document instances from the MNIST data
# Each Document has an 'img' field (encoded using the Pillow library) and a 'class' field
document_list = [Document({'img': pil_image(x[0]), 'class': x[1]}) for x in mnist_data]

# Shuffle the data and select a subset of 1000 documents
random.shuffle(document_list)
data = document_list[:1000]

# Insert the selected data into the mnist_collection
# Insert the selected data into the mnist_collection which we mentioned before like: mnist_collection = Collection('mnist')
db.execute(
mnist_collection.insert_many(data[:-100]), # Insert all but the last 100 documents
encoders=(pil_image,) # Encode images using the Pillow library.
Expand All @@ -69,7 +88,6 @@ db.execute(

Now that the images and their classes are inserted into the database, we can query the data in its original format. Particularly, we can use the `PIL.Image` instances to inspect the data.


```python
# Get and display one of the images
r = db.execute(mnist_collection.find_one())
Expand All @@ -78,28 +96,33 @@ r.unpack()['img']

## Build Model

Next, we create our machine learning model. SuperDuperDB supports various frameworks out of the box, and in this case, we are using PyTorch, which is well-suited for computer vision tasks. In this example, we combine torch with torchvision.

We create `postprocess` and `preprocess` functions to handle the communication with the SuperDuperDB `Datalayer`, and then wrap model, preprocessing and postprocessing to create a native SuperDuperDB handler.

Following that, we build our machine learning model. SuperDuperDB
conveniently supports various frameworks, and for this example, we opt
for PyTorch, a suitable choice for computer vision tasks. In this
instance, we combine `torch` with `torchvision`.

To facilitate communication with the SuperDuperDB `Datalayer`, we design `postprocess` and `preprocess` functions. These functions are then encapsulated with the model, preprocessing, and postprocessing steps to create a native SuperDuperDB handler.

```python
import torch

# Define the LeNet-5 architecture for image classification
class LeNet5(torch.nn.Module):
def __init__(self, num_classes):
super().__init__()
# Layer 1
self.layer1 = torch.nn.Sequential(
torch.nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0),
torch.nn.BatchNorm2d(6),
torch.nn.ReLU(),
torch.nn.MaxPool2d(kernel_size=2, stride=2))
# Layer 2
self.layer2 = torch.nn.Sequential(
torch.nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0),
torch.nn.BatchNorm2d(16),
torch.nn.ReLU(),
torch.nn.MaxPool2d(kernel_size=2, stride=2))
# Fully connected layers
self.fc = torch.nn.Linear(400, 120)
self.relu = torch.nn.ReLU()
self.fc1 = torch.nn.Linear(120, 84)
Expand All @@ -117,29 +140,33 @@ class LeNet5(torch.nn.Module):
out = self.fc2(out)
return out


# Postprocess function for the model output
def postprocess(x):
return int(x.topk(1)[1].item())


# Preprocess function for input data
def preprocess(x):
return torchvision.transforms.Compose([
torchvision.transforms.Resize((32, 32)),
torchvision.transforms.Resize((32, 32

)),
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(mean=(0.1307,), std=(0.3081,))]
)(x)

# Create an instance of the LeNet-5 model
lenet_model = LeNet5(10)

# Create and insert a SuperDuperDB model into the database
model = superduper(LeNet5(10), preprocess=preprocess, postprocess=postprocess, preferred_devices=('cpu',))
# Create a SuperDuperDB model with the LeNet-5 model, preprocess, and postprocess functions
# Specify 'preferred_devices' as ('cpu',) indicating CPU preference
model = superduper(lenet_model, preprocess=preprocess, postprocess=postprocess, preferred_devices=('cpu',))
db.add(model)
```

## Train Model

Now we are ready to "train" or "fit" the model. Trainable models in SuperDuperDB come with a sklearn-like `.fit` method.


Now we are ready to "train" or "fit" the model. Trainable models in
SuperDuperDB come with a sklearn-like `.fit` method.

```python
from torch.nn.functional import cross_entropy
Expand All @@ -153,28 +180,30 @@ job = model.fit(
X='img', # Feature matrix used as input data
y='class', # Target variable for training
db=db, # Database used for data retrieval
select=mnist_collection.find(), # Select the dataset
select=mnist_collection.find(), # Select the dataset from the 'mnist_collection'
configuration=TorchTrainerConfiguration(
identifier='my_configuration',
objective=cross_entropy,
loader_kwargs={'batch_size': 10},
max_iterations=10,
validation_interval=5,
identifier='my_configuration', # Unique identifier for the training configuration
objective=cross_entropy, # The objective function (cross-entropy in this case)
loader_kwargs={'batch_size': 10}, # DataLoader keyword arguments, batch size is set to 10
max_iterations=10, # Maximum number of training iterations
validation_interval=5, # Interval for validation during training
),
metrics=[Metric(identifier='acc', object=lambda x, y: sum([xx == yy for xx, yy in zip(x, y)]) / len(x))],
metrics=[Metric(identifier='acc', object=lambda x, y: sum([xx == yy for xx, yy in zip(x, y)]) / len(x))], # Define a custom accuracy metric for evaluation during training
validation_sets=[
# Define a validation dataset using a subset of data with '_fold' equal to 'valid'
Dataset(
identifier='my_valid',
select=Collection('mnist').find({'_fold': 'valid'}),
)
],
distributed=False,
distributed=False, # Set to True if distributed training is enabled
)
```

## Monitoring Training Efficiency
You can monitor the training efficiency with visualization tools like Matplotlib:

You can monitor the training efficiency with visualization tools like
Matplotlib:

```python
from matplotlib import pyplot as plt
Expand All @@ -187,11 +216,9 @@ plt.plot(model.metric_values['my_valid/acc'])
plt.show()
```


## On-the-fly Predictions
Once the model is trained, you can use it to continuously predict on new data as it arrives. This is set up by enabling a `listener` for the database (without loading all the data client-side). The listen toggle activates the model to make predictions on incoming data changes.


After training the model, you can continuously predict on new data as it arrives. By activating a `listener` for the database, the model can make predictions on incoming data changes without having to load all the data client-side. The listen toggle triggers the model to predict based on updates in the incoming data.

```python
model.predict(
Expand All @@ -205,34 +232,44 @@ model.predict(

We can see that predictions are available in `_outputs.img.lenet5`.


```python
# Execute find_one() to retrieve a single document from the 'mnist_collection'.
r = db.execute(mnist_collection.find_one({'_fold': 'valid'}))

# Unpack the document and extract its content
r.unpack()
```

## Verification

The models "activated" can be seen here:


```python
# Show the status of the listener
db.show('listener')
```

We can verify that the model is activated, by inserting the rest of the data:

We can verify that the model is activated, by inserting the rest of the
data:

```python
# Iterate over the last 100 elements in the 'data' list
for r in data[-100:]:
# Update the 'update' field to True for each document
r['update'] = True

# Insert the updated documents (with 'update' set to True) into the 'mnist_collection'
db.execute(mnist_collection.insert_many(data[-100:]))
```

You can see that the inserted data, are now also populated with predictions:

You can see that the inserted data, are now also populated with
predictions:

```python
db.execute(mnist_collection.find_one({'update': True}))['_outputs']
# Execute find_one() to retrieve a single sample document from 'mnist_collection'
# where the 'update' field is True
sample_document = db.execute(mnist_collection.find_one({'update': True}))['_outputs']

# A sample document
print(sample_document)
```
13 changes: 11 additions & 2 deletions docs/hr/docusaurus.config.js
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// @ts-check
// Note: type annotations allow type checking and IDEs autocompletion
const lightCodeTheme = require('prism-react-renderer').themes.vsLight;
const lightCodeTheme = require('prism-react-renderer').themes.github;
const darkCodeTheme = require('prism-react-renderer').themes.vsDark;

/** @type {import('@docusaurus/types').Config} */
Expand Down Expand Up @@ -177,6 +177,7 @@ const config = {
routeBasePath: 'docs',
path: 'content',
sidebarPath: require.resolve('./sidebars.js'),
// sidebarCollapsible: true,
// Please change this to your repo.
// Remove this to remove the "edit this page" links.
editUrl:
Expand Down Expand Up @@ -255,7 +256,7 @@ const config = {
},
{
label: 'Use cases',
to: '/docs/category/use_cases',
to: '/docs/use_cases',
},
{
label: 'Blog',
Expand Down Expand Up @@ -318,6 +319,14 @@ const config = {
content: 'https://docs.superduperdb.com/img/superDuperDB_img.png',
},
],
announcementBar: {
id: 'support_us',
content:
'🔮 We are officially launching SuperDuperDB with the release of v0.1 on December 5th on Github! 🔮',
backgroundColor: '#7628f8',
textColor: '#fff',
isCloseable: true,
},
}),
};

Expand Down
2 changes: 1 addition & 1 deletion docs/hr/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ const sidebars = {
type: 'generated-index',
description:
'Common and useful use-cases implemented in SuperDuperDB with a walkthrough',
slug: 'use-cases',
// slug: 'use-cases',
},
},
// {
Expand Down
8 changes: 7 additions & 1 deletion docs/hr/src/css/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,12 @@
--docusaurus-highlighted-code-line-bg: rgba(0, 0, 0, 0.1);
--aa-primary-color-rgb: #7628f8 !important;
--aa-muted-color-rgb: #7628f8 !important;
/* --ifm-code-background: #f5f5f5; */
--prism-background-color: #f5f5f5 !important;
}

style attribute {
--prism-background-color: #f5f5f5;
}

/* For readability concerns, you should choose a lighter palette in dark mode. */
Expand Down Expand Up @@ -217,5 +223,5 @@ main-wrapper {
}

pre code {
background-color: #F5F5F5 !important;
background-color: var(--ifm-pre-background);
}