feat: 0.6 and copy improvements (#11)

bosun-ai · Jul 12, 2024 · e81f832 · e81f832
1 parent 61c4d27
commit e81f832
Show file tree

Hide file tree

Showing 15 changed files with 67 additions and 51 deletions.
diff --git a/.markdownlint.yaml b/.markdownlint.yaml
@@ -0,0 +1,8 @@
+# configuration for https://github.com/DavidAnson/markdownlint
+
+first-line-heading: false
+no-inline-html: false
+line-length: false
+
+# to support repeated headers in the changelog
+no-duplicate-heading: false
diff --git a/src/assets/rag-dark.svg b/src/assets/rag-dark.svg
diff --git a/src/content/docs/examples/hello-world.md b/src/content/docs/examples/hello-world.md
@@ -1,11 +1,11 @@
 ---
 title: Hello World
-description: A simple example of an ingestion pipeline
+description: A simple example of an indexing pipeline
 ---
 
-## Ingesting code into Qdrant
+## Indexing code with Qdrant
 
-This example demonstrates how to ingest the Swiftide codebase itself.
+This example demonstrates how to index the Swiftide codebase itself.
 Note that for it to work correctly you need to have OPENAI_API_KEY set, redis and qdrant
 running.
 
@@ -25,7 +25,7 @@ with lots of small chunks, consider the rate limits of the API.
 ```rust
 
 use swiftide::{
-    ingestion,
+    indexing,
     integrations::{self, qdrant::Qdrant, redis::Redis},
     loaders::FileLoader,
     transformers::{ChunkCode, Embed, MetadataQACode},
@@ -50,7 +50,7 @@ async fn main() -> Result<(), Box<dyn std::error::Error>> {
         .unwrap_or("http://localhost:6334")
         .to_owned();
 
-    ingestion::IngestionPipeline::from_loader(FileLoader::new(".").with_extensions(&["rs"]))
+    indexing::Pipeline::from_loader(FileLoader::new(".").with_extensions(&["rs"]))
         .filter_cached(Redis::try_from_url(redis_url, "swiftide-examples")?)
         .then(MetadataQACode::new(openai_client.clone()))
         .then_chunk(ChunkCode::try_for_language_and_chunk_size(

diff --git a/src/content/docs/getting-started/architecture-and-design.mdx b/src/content/docs/getting-started/architecture-and-design.mdx
@@ -18,20 +18,20 @@ description: The architecture and design principles of the Swiftide project.
 
 ## The-things-we-talk-about
 
-- **IngestionPipeline**: The main struct that holds the pipeline. It is a stream of IngestionNodes.
-- **IngestionNode**: The main struct that holds the data. It has a path, chunk and metadata.
-- **IngestionStream**: The internal stream of IngestionNodes in the pipeline.
-- **Loader**: The starting point of the stream, creates and emits IngestionNodes.
-- **Transformers**: Some behaviour that modifies the IngestionNodes.
+- **Pipeline**: The main struct that holds the pipeline. It is a stream of Nodes.
+- **Node**: The main struct that holds the data. It has a path, chunk and metadata.
+- **IndexingStream**: The internal stream of Nodes in the pipeline.
+- **Loader**: The starting point of the stream, creates and emits Nodes.
+- **Transformers**: Some behaviour that modifies the Nodes.
 - **BatchTransformers**: Transformers that transform multiple nodes.
 - **Chunkers**: Transformers that split a node into multiple nodes.
-- **Storages**: Persist the IngestionNodes.
+- **Storages**: Persist the Nodes.
 - **NodeCache**: Filters cached nodes.
 - **Integrations**: External libraries that can be used with the pipeline.
 
 ### Pipeline structure and traits
 
-- from_loader (impl Loader) starting point of the stream, creates and emits IngestionNodes
+- from_loader (impl Loader) starting point of the stream, creates and emits Nodes
 - filter_cached (impl NodeCache) filters cached nodes
 - then (impl Transformer) transforms the node and puts it on the stream
 - then_in_batch (impl BatchTransformer) transforms multiple nodes and puts them on the stream

diff --git a/src/content/docs/in-depth/caching-and-filtering.md b/src/content/docs/in-depth/caching-and-filtering.md
@@ -13,8 +13,8 @@ Which is defined as follows:
 
 ```rust
 pub trait NodeCache: Send + Sync + Debug {
-    async fn get(&self, node: &IngestionNode) -> bool;
-    async fn set(&self, node: &IngestionNode);
+    async fn get(&self, node: &Node) -> bool;
+    async fn set(&self, node: &Node);
 }
 ```
 

diff --git a/src/content/docs/in-depth/chunking.md b/src/content/docs/in-depth/chunking.md
@@ -13,7 +13,7 @@ Which is defined as follows:
 
 ```rust
 pub trait ChunkerTransformer: Send + Sync + Debug {
-    async fn transform_node(&self, node: IngestionNode) -> IngestionStream;
+    async fn transform_node(&self, node: Node) -> IndexingStream;
 
     /// Overrides the default concurrency of the pipeline
     fn concurrency(&self) -> Option<usize> {

diff --git a/src/content/docs/in-depth/introducing-step-by-step.mdx b/src/content/docs/in-depth/introducing-step-by-step.mdx
@@ -1,11 +1,11 @@
 ---
 title: Step-by-step Introduction
-description: A step-by-step introduction on how to use swiftide as a data ingestion pipeline in your project.
+description: A step-by-step introduction on how to use swiftide as a data indexing pipeline in your project.
 sidebar:
   order: 0
 ---
 
-Swiftide provides a pipeline model. Troughout a pipeline, `IngestionNodes` are transformed and ultimately persisted. Every step with a pipeline returns the same, _owned_ pipeline.
+Swiftide provides a pipeline model. Troughout a pipeline, `Nodes` are transformed and ultimately persisted. Every step with a pipeline returns the same, _owned_ pipeline.
 
 import { Steps } from "@astrojs/starlight/components";
 
@@ -16,18 +16,18 @@ import { Steps } from "@astrojs/starlight/components";
 1. The pipeline starts with a loader:
 
    ```rust
-   let pipeline = IngestionPipeline::from_loader(FileLoader::new("./"));
+   let pipeline = Pipeline::from_loader(FileLoader::new("./"));
    ```
 
-   A loader implements the `Loader` trait which yields `IngestionNodes` to a stream.
+   A loader implements the `Loader` trait which yields `Nodes` to a stream.
 
 2. Nodes can then be transformed with an existing transformer:
 
    ```rust
    pipeline.then(MetadataQACode::new(openai_client.clone()));
    ```
 
-   Any transformer has to implement the `Transformer` trait, which takes an owned `IngestionNode` and outputs a `Result<IngestionNode>`. Closures also implement this trait!
+   Any transformer has to implement the `Transformer` trait, which takes an owned `Node` and outputs a `Result<Node>`. Closures also implement this trait!
 
 3. ... so you can also do this:
 
@@ -44,9 +44,9 @@ import { Steps } from "@astrojs/starlight/components";
    pipeline.then_in_batch(10, Embed::new(FastEmbed::try_default()?));
    ```
 
-   Batchable transformers implement the `BatchableTransformer` trait, which takes a vector of `IngestionNodes` and outputs an `IngestionStream`.
+   Batchable transformers implement the `BatchableTransformer` trait, which takes a vector of `Nodes` and outputs an `IndexingStream`.
 
-5. Nodes can be filtered using a NodeCache at any stage, based on a cache key the node cache defines. Redis uses a prefix and the hash of an `IngestionNode`, based on the path and text, by default.
+5. Nodes can be filtered using a NodeCache at any stage, based on a cache key the node cache defines. Redis uses a prefix and the hash of an `Node`, based on the path and text, by default.
 
    ```rust
    pipeline.filter_cached(Redis::try_from_url(
@@ -55,7 +55,7 @@ import { Steps } from "@astrojs/starlight/components";
          )?);
    ```
 
-   Node caches implement the `NodeCache` trait, which defines a `get` and `set` method, taking an `IngestionNode` as input.
+   Node caches implement the `NodeCache` trait, which defines a `get` and `set` method, taking an `Node` as input.
 
 6. At any point in the pipeline, nodes can be chunked into smaller parts:
 
@@ -66,7 +66,7 @@ import { Steps } from "@astrojs/starlight/components";
          )?);
    ```
 
-   Chunkers implement the ChunkerTransformer trait, which take an `IngestionNode` and return an `IngestionStream`. By default metadata is copied over to each node.
+   Chunkers implement the ChunkerTransformer trait, which take an `Node` and return an `IndexingStream`. By default metadata is copied over to each node.
 
 7. Also, nodes can be persisted (multiple times!) to storage:
 
@@ -80,7 +80,7 @@ import { Steps } from "@astrojs/starlight/components";
    )
    ```
 
-   Storages implement the `Storage` trait, which define `setup`, `store`, `batch_store` and `batch_size` methods. They also provide ways to convert an `IngestionNode` to something that can be stored.
+   Storages implement the `Storage` trait, which define `setup`, `store`, `batch_store` and `batch_size` methods. They also provide ways to convert an `Node` to something that can be stored.
 
 8. Finally, the pipeline can be run as follows:
 

diff --git a/src/content/docs/in-depth/loading-data.md b/src/content/docs/in-depth/loading-data.md
@@ -13,11 +13,11 @@ Which is defined as follows:
 
 ```rust
 pub trait Loader {
-    fn into_stream(self) -> IngestionStream;
+    fn into_stream(self) -> IndexingStream;
 }
 ```
 
-Or in human language: "I can be turned into a stream". The assumption under the hood is that Loaders will yield the data they load as a stream of `IngestionNodes`. These can be files, messages, webpages and so on.
+Or in human language: "I can be turned into a stream". The assumption under the hood is that Loaders will yield the data they load as a stream of `Nodes`. These can be files, messages, webpages and so on.
 
 ## Built in loaders
 

diff --git a/src/content/docs/in-depth/storing-results.md b/src/content/docs/in-depth/storing-results.md
@@ -14,8 +14,8 @@ Which is defined as follows:
 ```rust
 pub trait Persist: Debug + Send + Sync {
     async fn setup(&self) -> Result<()>;
-    async fn store(&self, node: IngestionNode) -> Result<IngestionNode>;
-    async fn batch_store(&self, nodes: Vec<IngestionNode>) -> IngestionStream;
+    async fn store(&self, node: Node) -> Result<Node>;
+    async fn batch_store(&self, nodes: Vec<Node>) -> IndexingStream;
     fn batch_size(&self) -> Option<usize> {
         None
     }

diff --git a/src/content/docs/in-depth/streaming-and-concurrency.mdx b/src/content/docs/in-depth/streaming-and-concurrency.mdx
@@ -1,11 +1,11 @@
 ---
 title: Streaming and Concurrency
-description: How the ingestion pipeline handles streaming and concurrency.
+description: How the indexing pipeline handles streaming and concurrency.
 sidebar:
   order: 6
 ---
 
-The ingestion pipeline is streaming, asynchronous, unordered and concurrent.
+The indexing pipeline is streaming, asynchronous, unordered and concurrent.
 
 ## Concurrency
 
@@ -30,30 +30,30 @@ import { Aside } from "@astrojs/starlight/components";
   (short exponential) by default.
 </Aside>
 
-## Ingestion Stream
+## Indexing Stream
 
-You might have seen the `IngestionStream` type mentioned a few times. It is the internal stream that is being passed around, build on top of the Rust `Stream` and `StreamExt`. By wrapping it we have more control and less boilerplate when dealing with streams.
+You might have seen the `IndexingStream` type mentioned a few times. It is the internal stream that is being passed around, build on top of the Rust `Stream` and `StreamExt`. By wrapping it we have more control and less boilerplate when dealing with streams.
 
-When building batch transformers, storage or chunkers, you will need to return a `IngestionStream`. We've tried to make that as easy as possible and there are multiple ways.
+When building batch transformers, storage or chunkers, you will need to return a `IndexingStream`. We've tried to make that as easy as possible and there are multiple ways.
 
 ### Using `Into`
 
-From a list of `IngestionNodes` using `Into`:
+From a list of `Nodes` using `Into`:
 
 ```rust
-let nodes: Vec<Result<IngestionNode>>> = vec![Ok(IngestionNode::default())];
-let stream: IngestionStream = nodes.into();
+let nodes: Vec<Result<Node>>> = vec![Ok(Node::default())];
+let stream: IndexingStream = nodes.into();
 ```
 
 There is also an implementation of `Into` for Rust streams.
 
 ### Converting an iterator
 
-You can also convert an `Iterator` into an `IngestionStream` directly. This is great, as the iterator itself will stream it's results, instead of having to collect it first.
+You can also convert an `Iterator` into an `IndexingStream` directly. This is great, as the iterator itself will stream it's results, instead of having to collect it first.
 
 ```rust
-let nodes: Vec<Result<IngestionNode>>> = vec![IngestionNode::default()];
-let stream: IngestionStream = IngestionStream::iter(nodes.into_iter().map(|node| {
+let nodes: Vec<Result<Node>>> = vec![Node::default()];
+let stream: IndexingStream = IndexingStream::iter(nodes.into_iter().map(|node| {
     node.metadata.insert("foo".to_string(), "bar".to_string());
     Ok(node)
 }));

diff --git a/src/content/docs/in-depth/transforming-and-enriching.mdx b/src/content/docs/in-depth/transforming-and-enriching.mdx
@@ -5,7 +5,7 @@ sidebar:
   order: 1
 ---
 
-Transformers are the bread and butter of an ingestion pipeline. They can transform the chunk, extract, modify and add metadata, adding vectors, and probably a whole lot more that we haven't thought of.
+Transformers are the bread and butter of an indexing pipeline. They can transform the chunk, extract, modify and add metadata, adding vectors, and probably a whole lot more that we haven't thought of.
 
 There's two ways to apply a transformer. Per node or in batch.
 
@@ -15,7 +15,7 @@ The `Transformer` trait is very straightforward:
 
 ```rust
 pub trait Transformer: Send + Sync {
-    async fn transform_node(&self, node: IngestionNode) -> Result<IngestionNode>;
+    async fn transform_node(&self, node: Node) -> Result<Node>;
 
     fn concurrency(&self) -> Option<usize> {
         None

diff --git a/src/content/docs/index.mdx b/src/content/docs/index.mdx
@@ -23,7 +23,7 @@ import { Card, CardGrid } from "@astrojs/starlight/components";
   <summary>A quick example</summary>
 
     ```rust
-    IngestionPipeline::from_loader(FileLoader::new(".").with_extensions(&["md"]))
+    Pipeline::from_loader(FileLoader::new(".").with_extensions(&["md"]))
             .then_chunk(ChunkMarkdown::with_chunk_range(10..512))
             .then(MetadataQACode::new(openai_client.clone()))
             .then_in_batch(10, Embed::new(openai_client.clone()))

diff --git a/src/content/docs/troubleshooting.md b/src/content/docs/troubleshooting.md
@@ -33,7 +33,7 @@ When you then set `RUST_LOG=debug` or `RUST_LOG=trace` you will get detailed log
 
 Tracing has best-in-class opentelemetry support. See the [tracing-opentelemetry](https://github.com/tokio-rs/tracing-opentelemetry) crate for more information.
 
-Note that currently the IngestionNode is attached to every transformation step. Beware of large amounts of tracing data.
+Note that currently the Node is attached to every transformation step. Beware of large amounts of tracing data.
 
 ## Helpers and utility functions
 
@@ -43,4 +43,4 @@ There are several helpers and utility functions available on the pipeline to hel
 - `log_errors` Logs errors only
 - `log_nodes` Logs nodes only
 - `filter_errors` Filters out errors, only passing nodes
-- `filter` Filter out `Result<IngestionNode>` based on a predicate
+- `filter` Filter out `Result<Node>` based on a predicate
diff --git a/src/content/docs/what-is-swiftide.mdx b/src/content/docs/what-is-swiftide.mdx
@@ -8,19 +8,24 @@ description: A brief introduction to swiftide.
 import { Image } from "astro:assets";
 import pipeline from "/src/assets/rag-dark.svg";
 
-Swiftide is a straightforward, easy-to-use, easy-to-extend asynchronous data ingestion and processing library. It is designed to be used in a RAG (Retrieval Augmented Generation) system. It is built to be fast and efficient, with a focus on parallel processing and asynchronous operations.
+Swiftide is an indexing and processing library, tailored for Retrieval Augmented Generation (RAG). When building applications with large language models (LLM), these LLMs need access to external resources. Data needs to be transformed, enriched, split up, embedded, and persisted. It is build in Rust, using parallel, asynchronous streams and is blazingly fast.
 
-<Image src={pipeline} alt="ingestion-pipeline" />
-At the same time, swiftide focusses on developer experience and ease of use. It is
+<Image src={pipeline} alt="indexing-pipeline" />
+
+At the same time, swiftide focuses on developer experience and ease of use. It is
 designed to be simple and intuitive, with a clear and concise API that makes it easy
 to get started, build complex pipelines, and bring your own transformations.
 
 ## What problem does swiftide solve?
 
-In a RAG system, the data needs to be ingested, processed, and indexed. This can be a time-consuming process, especially when dealing with large amounts of data. Swiftide aims to solve this problem by providing a fast and efficient way to ingest and process data, allowing the RAG system to be more responsive and efficient.
+In other solutions, the experimental phase is often done in python and then either rewritten from scratch, or deployed distributed. Swiftide aims to bring the result of experimentation to production as well.
 
 In fact, swiftide is **so fast** that it enables real-time indexing before querying, opening up the possibility of real-time RAG systems. At the same time, the internet is booming with wild distributed, kafka based setups. Swiftide hopes to stretch the limits of what is possible before getting to such a setup and beyond.
 
 ## How does swiftide work?
 
-With swiftide you define a sequence of steps, from ingestion to processing to indexing. Under the hood, swiftide uses Rust's async and streaming features to speed things up, drastically.
+With swiftide you define a sequence of steps, from indexing to processing to indexing. Under the hood, swiftide uses Rust's async and streaming features to speed things up, drastically.
+
+:::warn
+Swiftide is under heavy development and can have breaking changes while we work towards 1.0. Documentation here might fall short of all features, and despite our efforts be slightly outdated. We recommend to always keep an eye on our [github](https://github.com/bosun-ai/swiftide) and [api documentation](https://docs.rs/swiftide/latest/swiftide/).
+:::
diff --git a/typos.toml b/typos.toml
@@ -0,0 +1,3 @@
+[files]
+# Autogenerated
+extend-exclude = ["CHANGELOG.md"]