Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content Layer #982

Merged
merged 14 commits into from
Sep 17, 2024
345 changes: 345 additions & 0 deletions proposals/content-layer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,345 @@
**If you have feedback and the feature is released as experimental, please leave it on the Stage 3 PR. Otherwise, comment on the Stage 2 issue (links below).**

- Start Date: 2024-07-23
- Implementation PR: https://github.com/withastro/astro/pull/11360
- Stage 2 Issue: https://github.com/withastro/roadmap/issues/946
- Stage 3 PR: <!-- related roadmap PR, leave it empty if you don't have a PR yet -->

# Summary

Creates a successor to content collections with expanded use cases and improved performance.
ascorbic marked this conversation as resolved.
Show resolved Hide resolved

# Example

Collections are defined using a new `loader` property. There are built-in `file` and `glob` loaders, which load data and content from the filesystem, and users can define their own loaders.

```ts
// src/content/config.ts
import { defineCollection, z } from "astro:content";
import { glob, file } from "astro/loaders";
// Loaders can be distributed as packages
import { feedLoader } from "@ascorbic/feed-loader";

// The `glob()` loader loads multiple files, with one entry per file
const spacecraft = defineCollection({
type: "experimental_content",
loader: glob({ pattern: "*.md", base: "src/data/spacecraft" }),
// A schema is optional, but provides validation and type safety for data.
// It can also be used to transform data before it is stored.
schema: ({ image }) =>
z.object({
title: z.string(),
description: z.string(),
heroImage: image().optional(),
}),
});

// The `file()` loader loads multiple entries from one file
const dogs = defineCollection({
type: "experimental_data",
loader: file("src/data/dogs.json"),
schema: z.object({
id: z.string(),
breed: z.string(),
temperament: z.array(z.string()),
}),
});

// Custom loaders can be defined inline or imported from packages
const podcasts = defineCollection({
type: "experimental_content",
loader: feedLoader({
url: "https://feeds.99percentinvisible.org/99percentinvisible",
}),
// A loader can provide its own schema, but a user-defined schema will override it.
});
ascorbic marked this conversation as resolved.
Show resolved Hide resolved

export const collections = { spacecraft, dogs, podcasts };
```

This is then used in Astro pages in the same way as current content collections.

# Background & Motivation

Content collections are a key primitive that brings people to Astro. Content Collections make it easy to work with local content (MD, MDX, Markdoc, etc) inside of your Astro project. They give you structure (`src/content/[collection-name]/*`), schema validation for frontmatter, and querying APIs. However they are limited in a few ways:

- They can only be used with local data
- The data must be in a specific location in the project
- Large collections can be slow to load and use a lot of memory. They can also take up a lot of space when bundling for server deployment. This places an upper limit on the number of entries that can be practically included in a collection.

Content layer is designed to be a successor to content collections that addresses these limitations and opens up more use cases. It is inspired by the Gatsby data layer, but with a simpler API and no GraphQL or graph data store.

# Goals

- Create a successor to content collections that can be used with local and remote data.
- Decouple content from Vite, so that data is no longer implemented as Rollup chunks.
ascorbic marked this conversation as resolved.
Show resolved Hide resolved
- Provide a simple API for defining collections with a migration path from content collections.
- Support local files in user-defined locations with built-in file and glob loaders.
- Initially support Markdown rendering and JSON data for local files, with support for other formats in the future.
ascorbic marked this conversation as resolved.
Show resolved Hide resolved
- Provide an API for defining custom loaders that is powerful enough to handle a wide range of use cases, including CMS integrations.
ascorbic marked this conversation as resolved.
Show resolved Hide resolved
- Provide a higher-level API for simple inline loaders.
ascorbic marked this conversation as resolved.
Show resolved Hide resolved
- Make the API scalable to tens of thousands of entries, with good performance and low memory usage.
ascorbic marked this conversation as resolved.
Show resolved Hide resolved
- Allow loaders to define their own schemas, including dynamic schemas that can be introspected from the data source.
ascorbic marked this conversation as resolved.
Show resolved Hide resolved

# Non-Goals

- Non-goal: allowing loaders to define multiple collections automatically. e.g. separate collections would need to be manually defined for posts and categories in a blog.
- Non-goal: dependency tracing for entries.
- Out of scope: hot-reloading remote data.
- Out of scope: rendering markdown from remote data. A loader could store rendered HTML, but it would be up to the loader to handle this.
- Out of scope: custom `Content` components.
ascorbic marked this conversation as resolved.
Show resolved Hide resolved
- Future: support for Markdoc and MDX rendering.
- Future: SQLite-based backend for collections.
- Future: support for queries more complex than get by ID.
ascorbic marked this conversation as resolved.
Show resolved Hide resolved

# Detailed Design

## Collection Definition

Collections are defined in a similar way to current content collections, using `defineCollection()` in `src/content/config.ts`. There are two new collection types: `experimental_content` and `experimental_data`. The only difference between them is that when a collection is defined as `experimental_content`, entries will have a `render()` method that generates a component for rendering HTML content. This requires the loader to have set the `rendered.html` property.
ascorbic marked this conversation as resolved.
Show resolved Hide resolved

The `reference()` helper can be used in the same way as content collections, to reference other collections.

## Built-in loaders

There are two built-in loaders: `file()` and `glob()`, which load data from the local filesystem. The `glob()` loader covers the current use case of directories full of markdown or JSON content. The `glob()` helper is more flexible than in current content collections, as it can load data from anywhere on the filesystem. The `file()` loader loads multiple entries from a single file. Both loaders can process markdown in the same way as content collections. They can also extract images in the same way as content collections.
ascorbic marked this conversation as resolved.
Show resolved Hide resolved

```ts
const spacecraft = defineCollection({
// The glob loader can be used for either markdown files (defined as experimental_content) or JSON files (defined as experimental_data).
type: "experimental_content",
// The pattern is any valid glob pattern. It is relative to the "base" directory.
// "base" is optional and defaults to the project root. It is defined relative to the project root, or as an absolute path.
loader: glob({ pattern: "*.md", base: "src/data/spacecraft" }),
schema: ({ image }) =>
z.object({
title: z.string(),
description: z.string(),
heroImage: image().optional(),
}),
});

const dogs = defineCollection({
type: "experimental_data",
// The file loader loads a single file which contains multiple entries. The path is relative to the project root, or an absolute path.
// The data must be an array of objects, each with a unique `id` property, or an object with IDs as keys and entries as values.
loader: file("src/data/dogs.json"),
schema: z.object({
id: z.string(),
breed: z.string(),
temperament: z.array(z.string()),
}),
});
```

## Custom loaders

The new collection `loader` property is required, and supports two different loader APIs:

### High-level API

A loader is an async function that returns an array of entries. This higher-level API is useful for defining custom loaders inline that don't need access to features such as incremental updates, content digests, or caching.

```ts
const dogs = defineCollection({
type: "experimental_data",
loader: async () => {
const response = await fetch("https://restcountries.com/v3.1/all");
const data = await response.json();
// Must return an array of entries with an id property,
// or an object with IDs as keys and entries as values
return data.map((country) => ({
id: country.cca3,
...country,
}));
},
});
```

### Low-level API

For advanced loaders, the low-level API provides more control over the loading process. The loader is in control of adding entries to the data store, and has access to various helper tools. It can define its own schema, including generating it dynamically.

The loader is an object with a `load` method and optional `schema` property. The recommended pattern is to define a function that accepts configuration options and returns the loader object.

```ts
import type { Loader } from "astro/loaders";
import { ItemSchema, type Item } from "./schema.js";
import { parseFeed } from "./feed.js";

export interface FeedLoaderOptions {
/** URL of the feed */
url: URL | string;
}

export function feedLoader({ url }: FeedLoaderOptions): Loader {
const feedUrl = new URL(url);
return {
name: "feed-loader",
// The load method is called to load data
load: async ({ store, logger, parseData, meta }) => {
logger.info("Loading posts");

// The meta store is used to store metadata, such as sync tokens
// etags or last-modified times. It is persisted between builds.
const lastModified = meta.get("last-modified");

// Make a conditional request for the feed
const headers = lastModified ? { "If-Modified-Since": lastModified } : {};

const res = await fetch(feedUrl, { headers });

// If the feed hasn't changed, you do not need to update the store
if (res.status === 304) {
logger.info("Feed not modified, skipping");
return;
}
if (!res.ok || !res.body) {
throw new Error(`Failed to fetch feed: ${res.statusText}`);
}

// Store the last-modified header in the meta store so we can
// send it with the next request
meta.set("last-modified", res.headers.get("last-modified"));

const feed = parseFeed(res.body);

// If the loader doesn't handle incremental updates, clear the store before inserting new entries
store.clear();

for (const item of feed.items) {
// The parseData helper uses the schema to validate and transform data
const data = await parseData({
id: item.guid,
data: item,
});

store.set({
id,
data,
// If the data source provides HTML, it can be set in the `rendered` property
// This will allow users to use the `<Content />` component in their pages to render the HTML.
rendered: {
html: data.description ?? "",
},
});
}
},
// A loader can optionally provide its own Zod schema. This can be static, or it can be an async function
// that returns a schema. This allows an API to use introspection to determine the schema.
schema: ItemSchema,
};
}
```

### The data store

Each loader is provided with a data store object. This is an in-memory key/value store, scoped to that loader and is used to store entries. The store is persisted to disk between builds, so loaders can handle incremental updates.
The store has the following methods:

```ts
export interface ScopedDataStore {
get: (key: string) => DataEntry | undefined;
entries: () => Array<[id: string, DataEntry]>;
/**
* Sets an entry in the store. Returns true if the entry was added or updated,
* or false if the entry was not changed.
*/
set: (opts: {
/** The ID of the entry. Must be unique per collection. */
id: string;
/** The data to store. Any JSON-serializable object */
data: TData;
/** The raw body of the content, if applicable. */
body?: string;
/** The file path of the content, if applicable. Relative to the site root. */
filePath?: string;
/** An optional content digest, to check if the content has changed. */
digest?: number | string;
/** The rendered content, if applicable. */
rendered?: RenderedContent;
}) => boolean;
values: () => Array<DataEntry>;
keys: () => Array<string>;
delete: (key: string) => void;
clear: () => void;
has: (key: string) => boolean;
}
```

### The meta store

Each loader is provided with a meta store object. This is a key/value store, scoped to that loader and is used to store metadata. This data isn't available to pages, but is instead used to store information such as sync tokens, etags, or last-modified times. The meta store has the following methods:

```ts
export interface MetaStore {
get: (key: string) => string | undefined;
set: (key: string, value: string) => void;
has: (key: string) => boolean;
delete: (key: string) => void;
}
```

## Using the data

The data is accessed in the same way as content collections, using the `getCollection()` or `getEntry()` functions.

```astro
---
// src/pages/spacecraft/[id].astro
import type { GetStaticPaths } from "astro";
import { getCollection } from "astro:content";
import { Image } from "astro:assets";

export const getStaticPaths: GetStaticPaths = async () => {
const collection = await getCollection("spacecraft");
if (!collection) return [];
return collection.map((craft) => ({
params: {
id: craft.id,
},
props: {
craft,
},
}));
}

const { craft } = Astro.props;

// If a collection is defined as `experimental_content` and the loader has set the `rendered.html` property,
// the entry will have a `render()` method that generates a component for rendering HTML content.
// The raw HTML can also be accessed with `craft.rendered.html`.
// If it is markdown, the frontmatter and headings will be available as properties on `rendered.metadata`.
const { Content } = await craft.render();
---

<h1>{craft.data.title}</h1>

<Content />

```

# Testing Strategy

- Integration tests for the built-in loaders, covering markdown and JSON data.
- Integration tests for image handling in markdown.
- Integration tests for rendering components from markdown.
- Integration tests for custom loaders, covering incremental updates and schema validation.
- Unit tests for the data store and meta store.

# Drawbacks

- A lot of the performance benefits will not be available for MDX, as that is code that is executed at runtime rather than content that can be pre-rendered and persisted in the store.
- The DX for loading data from APIs is already good, so it may be harder to show the benefits (mostly the ability to cache and query the data locally, and persist it between builds).

# Alternatives

- We could keep content collections with the current scope of local content, but add support for custom directories and multiple entries per file.

# Adoption strategy

- New collections can be defined alongside existing content collections by setting the `type` property to `experimental_content` or `experimental_data`.

# Unresolved Questions

The implementation of MDX is currently uncertain. It is a key feature of Astro, but cannot be handled in the same way as markdown. MDX is more like code than content, so it's harder to prerender. This removes some of the most important performance benefits of the content layer, as the entries still need to be handled by Vite at runtime. The minimum requirement is that `glob()` supports MDX but the implementation is the same as today under the hood. This would allow users to use MDX in the same way as markdown, but without the performance benefits. The ideal solution would be to find a way to pre-render MDX content and persist it in the store, but this is a much bigger challenge.