Skip to content

Commit

Permalink
feat: 0.6 and copy improvements (#11)
Browse files Browse the repository at this point in the history
  • Loading branch information
timonv authored Jul 12, 2024
1 parent 61c4d27 commit e81f832
Show file tree
Hide file tree
Showing 15 changed files with 67 additions and 51 deletions.
8 changes: 8 additions & 0 deletions .markdownlint.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# configuration for https://github.com/DavidAnson/markdownlint

first-line-heading: false
no-inline-html: false
line-length: false

# to support repeated headers in the changelog
no-duplicate-heading: false
2 changes: 1 addition & 1 deletion src/assets/rag-dark.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 5 additions & 5 deletions src/content/docs/examples/hello-world.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
title: Hello World
description: A simple example of an ingestion pipeline
description: A simple example of an indexing pipeline
---

## Ingesting code into Qdrant
## Indexing code with Qdrant

This example demonstrates how to ingest the Swiftide codebase itself.
This example demonstrates how to index the Swiftide codebase itself.
Note that for it to work correctly you need to have OPENAI_API_KEY set, redis and qdrant
running.

Expand All @@ -25,7 +25,7 @@ with lots of small chunks, consider the rate limits of the API.
```rust

use swiftide::{
ingestion,
indexing,
integrations::{self, qdrant::Qdrant, redis::Redis},
loaders::FileLoader,
transformers::{ChunkCode, Embed, MetadataQACode},
Expand All @@ -50,7 +50,7 @@ async fn main() -> Result<(), Box<dyn std::error::Error>> {
.unwrap_or("http://localhost:6334")
.to_owned();

ingestion::IngestionPipeline::from_loader(FileLoader::new(".").with_extensions(&["rs"]))
indexing::Pipeline::from_loader(FileLoader::new(".").with_extensions(&["rs"]))
.filter_cached(Redis::try_from_url(redis_url, "swiftide-examples")?)
.then(MetadataQACode::new(openai_client.clone()))
.then_chunk(ChunkCode::try_for_language_and_chunk_size(
Expand Down
14 changes: 7 additions & 7 deletions src/content/docs/getting-started/architecture-and-design.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -18,20 +18,20 @@ description: The architecture and design principles of the Swiftide project.

## The-things-we-talk-about

- **IngestionPipeline**: The main struct that holds the pipeline. It is a stream of IngestionNodes.
- **IngestionNode**: The main struct that holds the data. It has a path, chunk and metadata.
- **IngestionStream**: The internal stream of IngestionNodes in the pipeline.
- **Loader**: The starting point of the stream, creates and emits IngestionNodes.
- **Transformers**: Some behaviour that modifies the IngestionNodes.
- **Pipeline**: The main struct that holds the pipeline. It is a stream of Nodes.
- **Node**: The main struct that holds the data. It has a path, chunk and metadata.
- **IndexingStream**: The internal stream of Nodes in the pipeline.
- **Loader**: The starting point of the stream, creates and emits Nodes.
- **Transformers**: Some behaviour that modifies the Nodes.
- **BatchTransformers**: Transformers that transform multiple nodes.
- **Chunkers**: Transformers that split a node into multiple nodes.
- **Storages**: Persist the IngestionNodes.
- **Storages**: Persist the Nodes.
- **NodeCache**: Filters cached nodes.
- **Integrations**: External libraries that can be used with the pipeline.

### Pipeline structure and traits

- from_loader (impl Loader) starting point of the stream, creates and emits IngestionNodes
- from_loader (impl Loader) starting point of the stream, creates and emits Nodes
- filter_cached (impl NodeCache) filters cached nodes
- then (impl Transformer) transforms the node and puts it on the stream
- then_in_batch (impl BatchTransformer) transforms multiple nodes and puts them on the stream
Expand Down
4 changes: 2 additions & 2 deletions src/content/docs/in-depth/caching-and-filtering.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ Which is defined as follows:

```rust
pub trait NodeCache: Send + Sync + Debug {
async fn get(&self, node: &IngestionNode) -> bool;
async fn set(&self, node: &IngestionNode);
async fn get(&self, node: &Node) -> bool;
async fn set(&self, node: &Node);
}
```

Expand Down
2 changes: 1 addition & 1 deletion src/content/docs/in-depth/chunking.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Which is defined as follows:

```rust
pub trait ChunkerTransformer: Send + Sync + Debug {
async fn transform_node(&self, node: IngestionNode) -> IngestionStream;
async fn transform_node(&self, node: Node) -> IndexingStream;

/// Overrides the default concurrency of the pipeline
fn concurrency(&self) -> Option<usize> {
Expand Down
20 changes: 10 additions & 10 deletions src/content/docs/in-depth/introducing-step-by-step.mdx
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
title: Step-by-step Introduction
description: A step-by-step introduction on how to use swiftide as a data ingestion pipeline in your project.
description: A step-by-step introduction on how to use swiftide as a data indexing pipeline in your project.
sidebar:
order: 0
---

Swiftide provides a pipeline model. Troughout a pipeline, `IngestionNodes` are transformed and ultimately persisted. Every step with a pipeline returns the same, _owned_ pipeline.
Swiftide provides a pipeline model. Troughout a pipeline, `Nodes` are transformed and ultimately persisted. Every step with a pipeline returns the same, _owned_ pipeline.

import { Steps } from "@astrojs/starlight/components";

Expand All @@ -16,18 +16,18 @@ import { Steps } from "@astrojs/starlight/components";
1. The pipeline starts with a loader:

```rust
let pipeline = IngestionPipeline::from_loader(FileLoader::new("./"));
let pipeline = Pipeline::from_loader(FileLoader::new("./"));
```

A loader implements the `Loader` trait which yields `IngestionNodes` to a stream.
A loader implements the `Loader` trait which yields `Nodes` to a stream.

2. Nodes can then be transformed with an existing transformer:

```rust
pipeline.then(MetadataQACode::new(openai_client.clone()));
```

Any transformer has to implement the `Transformer` trait, which takes an owned `IngestionNode` and outputs a `Result<IngestionNode>`. Closures also implement this trait!
Any transformer has to implement the `Transformer` trait, which takes an owned `Node` and outputs a `Result<Node>`. Closures also implement this trait!

3. ... so you can also do this:

Expand All @@ -44,9 +44,9 @@ import { Steps } from "@astrojs/starlight/components";
pipeline.then_in_batch(10, Embed::new(FastEmbed::try_default()?));
```

Batchable transformers implement the `BatchableTransformer` trait, which takes a vector of `IngestionNodes` and outputs an `IngestionStream`.
Batchable transformers implement the `BatchableTransformer` trait, which takes a vector of `Nodes` and outputs an `IndexingStream`.

5. Nodes can be filtered using a NodeCache at any stage, based on a cache key the node cache defines. Redis uses a prefix and the hash of an `IngestionNode`, based on the path and text, by default.
5. Nodes can be filtered using a NodeCache at any stage, based on a cache key the node cache defines. Redis uses a prefix and the hash of an `Node`, based on the path and text, by default.

```rust
pipeline.filter_cached(Redis::try_from_url(
Expand All @@ -55,7 +55,7 @@ import { Steps } from "@astrojs/starlight/components";
)?);
```

Node caches implement the `NodeCache` trait, which defines a `get` and `set` method, taking an `IngestionNode` as input.
Node caches implement the `NodeCache` trait, which defines a `get` and `set` method, taking an `Node` as input.

6. At any point in the pipeline, nodes can be chunked into smaller parts:

Expand All @@ -66,7 +66,7 @@ import { Steps } from "@astrojs/starlight/components";
)?);
```

Chunkers implement the ChunkerTransformer trait, which take an `IngestionNode` and return an `IngestionStream`. By default metadata is copied over to each node.
Chunkers implement the ChunkerTransformer trait, which take an `Node` and return an `IndexingStream`. By default metadata is copied over to each node.

7. Also, nodes can be persisted (multiple times!) to storage:

Expand All @@ -80,7 +80,7 @@ import { Steps } from "@astrojs/starlight/components";
)
```

Storages implement the `Storage` trait, which define `setup`, `store`, `batch_store` and `batch_size` methods. They also provide ways to convert an `IngestionNode` to something that can be stored.
Storages implement the `Storage` trait, which define `setup`, `store`, `batch_store` and `batch_size` methods. They also provide ways to convert an `Node` to something that can be stored.

8. Finally, the pipeline can be run as follows:

Expand Down
4 changes: 2 additions & 2 deletions src/content/docs/in-depth/loading-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,11 @@ Which is defined as follows:

```rust
pub trait Loader {
fn into_stream(self) -> IngestionStream;
fn into_stream(self) -> IndexingStream;
}
```

Or in human language: "I can be turned into a stream". The assumption under the hood is that Loaders will yield the data they load as a stream of `IngestionNodes`. These can be files, messages, webpages and so on.
Or in human language: "I can be turned into a stream". The assumption under the hood is that Loaders will yield the data they load as a stream of `Nodes`. These can be files, messages, webpages and so on.

## Built in loaders

Expand Down
4 changes: 2 additions & 2 deletions src/content/docs/in-depth/storing-results.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ Which is defined as follows:
```rust
pub trait Persist: Debug + Send + Sync {
async fn setup(&self) -> Result<()>;
async fn store(&self, node: IngestionNode) -> Result<IngestionNode>;
async fn batch_store(&self, nodes: Vec<IngestionNode>) -> IngestionStream;
async fn store(&self, node: Node) -> Result<Node>;
async fn batch_store(&self, nodes: Vec<Node>) -> IndexingStream;
fn batch_size(&self) -> Option<usize> {
None
}
Expand Down
22 changes: 11 additions & 11 deletions src/content/docs/in-depth/streaming-and-concurrency.mdx
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
title: Streaming and Concurrency
description: How the ingestion pipeline handles streaming and concurrency.
description: How the indexing pipeline handles streaming and concurrency.
sidebar:
order: 6
---

The ingestion pipeline is streaming, asynchronous, unordered and concurrent.
The indexing pipeline is streaming, asynchronous, unordered and concurrent.

## Concurrency

Expand All @@ -30,30 +30,30 @@ import { Aside } from "@astrojs/starlight/components";
(short exponential) by default.
</Aside>

## Ingestion Stream
## Indexing Stream

You might have seen the `IngestionStream` type mentioned a few times. It is the internal stream that is being passed around, build on top of the Rust `Stream` and `StreamExt`. By wrapping it we have more control and less boilerplate when dealing with streams.
You might have seen the `IndexingStream` type mentioned a few times. It is the internal stream that is being passed around, build on top of the Rust `Stream` and `StreamExt`. By wrapping it we have more control and less boilerplate when dealing with streams.

When building batch transformers, storage or chunkers, you will need to return a `IngestionStream`. We've tried to make that as easy as possible and there are multiple ways.
When building batch transformers, storage or chunkers, you will need to return a `IndexingStream`. We've tried to make that as easy as possible and there are multiple ways.

### Using `Into`

From a list of `IngestionNodes` using `Into`:
From a list of `Nodes` using `Into`:

```rust
let nodes: Vec<Result<IngestionNode>>> = vec![Ok(IngestionNode::default())];
let stream: IngestionStream = nodes.into();
let nodes: Vec<Result<Node>>> = vec![Ok(Node::default())];
let stream: IndexingStream = nodes.into();
```

There is also an implementation of `Into` for Rust streams.

### Converting an iterator

You can also convert an `Iterator` into an `IngestionStream` directly. This is great, as the iterator itself will stream it's results, instead of having to collect it first.
You can also convert an `Iterator` into an `IndexingStream` directly. This is great, as the iterator itself will stream it's results, instead of having to collect it first.

```rust
let nodes: Vec<Result<IngestionNode>>> = vec![IngestionNode::default()];
let stream: IngestionStream = IngestionStream::iter(nodes.into_iter().map(|node| {
let nodes: Vec<Result<Node>>> = vec![Node::default()];
let stream: IndexingStream = IndexingStream::iter(nodes.into_iter().map(|node| {
node.metadata.insert("foo".to_string(), "bar".to_string());
Ok(node)
}));
Expand Down
4 changes: 2 additions & 2 deletions src/content/docs/in-depth/transforming-and-enriching.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ sidebar:
order: 1
---

Transformers are the bread and butter of an ingestion pipeline. They can transform the chunk, extract, modify and add metadata, adding vectors, and probably a whole lot more that we haven't thought of.
Transformers are the bread and butter of an indexing pipeline. They can transform the chunk, extract, modify and add metadata, adding vectors, and probably a whole lot more that we haven't thought of.

There's two ways to apply a transformer. Per node or in batch.

Expand All @@ -15,7 +15,7 @@ The `Transformer` trait is very straightforward:

```rust
pub trait Transformer: Send + Sync {
async fn transform_node(&self, node: IngestionNode) -> Result<IngestionNode>;
async fn transform_node(&self, node: Node) -> Result<Node>;

fn concurrency(&self) -> Option<usize> {
None
Expand Down
2 changes: 1 addition & 1 deletion src/content/docs/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ import { Card, CardGrid } from "@astrojs/starlight/components";
<summary>A quick example</summary>

```rust
IngestionPipeline::from_loader(FileLoader::new(".").with_extensions(&["md"]))
Pipeline::from_loader(FileLoader::new(".").with_extensions(&["md"]))
.then_chunk(ChunkMarkdown::with_chunk_range(10..512))
.then(MetadataQACode::new(openai_client.clone()))
.then_in_batch(10, Embed::new(openai_client.clone()))
Expand Down
4 changes: 2 additions & 2 deletions src/content/docs/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ When you then set `RUST_LOG=debug` or `RUST_LOG=trace` you will get detailed log

Tracing has best-in-class opentelemetry support. See the [tracing-opentelemetry](https://github.com/tokio-rs/tracing-opentelemetry) crate for more information.

Note that currently the IngestionNode is attached to every transformation step. Beware of large amounts of tracing data.
Note that currently the Node is attached to every transformation step. Beware of large amounts of tracing data.

## Helpers and utility functions

Expand All @@ -43,4 +43,4 @@ There are several helpers and utility functions available on the pipeline to hel
- `log_errors` Logs errors only
- `log_nodes` Logs nodes only
- `filter_errors` Filters out errors, only passing nodes
- `filter` Filter out `Result<IngestionNode>` based on a predicate
- `filter` Filter out `Result<Node>` based on a predicate
15 changes: 10 additions & 5 deletions src/content/docs/what-is-swiftide.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,24 @@ description: A brief introduction to swiftide.
import { Image } from "astro:assets";
import pipeline from "/src/assets/rag-dark.svg";

Swiftide is a straightforward, easy-to-use, easy-to-extend asynchronous data ingestion and processing library. It is designed to be used in a RAG (Retrieval Augmented Generation) system. It is built to be fast and efficient, with a focus on parallel processing and asynchronous operations.
Swiftide is an indexing and processing library, tailored for Retrieval Augmented Generation (RAG). When building applications with large language models (LLM), these LLMs need access to external resources. Data needs to be transformed, enriched, split up, embedded, and persisted. It is build in Rust, using parallel, asynchronous streams and is blazingly fast.

<Image src={pipeline} alt="ingestion-pipeline" />
At the same time, swiftide focusses on developer experience and ease of use. It is
<Image src={pipeline} alt="indexing-pipeline" />

At the same time, swiftide focuses on developer experience and ease of use. It is
designed to be simple and intuitive, with a clear and concise API that makes it easy
to get started, build complex pipelines, and bring your own transformations.

## What problem does swiftide solve?

In a RAG system, the data needs to be ingested, processed, and indexed. This can be a time-consuming process, especially when dealing with large amounts of data. Swiftide aims to solve this problem by providing a fast and efficient way to ingest and process data, allowing the RAG system to be more responsive and efficient.
In other solutions, the experimental phase is often done in python and then either rewritten from scratch, or deployed distributed. Swiftide aims to bring the result of experimentation to production as well.

In fact, swiftide is **so fast** that it enables real-time indexing before querying, opening up the possibility of real-time RAG systems. At the same time, the internet is booming with wild distributed, kafka based setups. Swiftide hopes to stretch the limits of what is possible before getting to such a setup and beyond.

## How does swiftide work?

With swiftide you define a sequence of steps, from ingestion to processing to indexing. Under the hood, swiftide uses Rust's async and streaming features to speed things up, drastically.
With swiftide you define a sequence of steps, from indexing to processing to indexing. Under the hood, swiftide uses Rust's async and streaming features to speed things up, drastically.

:::warn
Swiftide is under heavy development and can have breaking changes while we work towards 1.0. Documentation here might fall short of all features, and despite our efforts be slightly outdated. We recommend to always keep an eye on our [github](https://github.com/bosun-ai/swiftide) and [api documentation](https://docs.rs/swiftide/latest/swiftide/).
:::
3 changes: 3 additions & 0 deletions typos.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[files]
# Autogenerated
extend-exclude = ["CHANGELOG.md"]

0 comments on commit e81f832

Please sign in to comment.