diff --git a/examples/notebooks/databento_data_catalog.ipynb b/examples/notebooks/databento_data_catalog.ipynb
index 5f7b23009246..a11079005e7c 100644
--- a/examples/notebooks/databento_data_catalog.ipynb
+++ b/examples/notebooks/databento_data_catalog.ipynb
@@ -13,15 +13,45 @@
"id": "1",
"metadata": {},
"source": [
- "This tutorial will walk through how to setup a Nautilus Parquet data catalog with databento order book data.\n",
+ "**Info:**\n",
"\n",
- "We choose to work with the MBP-10 schema (which is just an aggregation of the top 10 levels) so that the data is more manageable and easier to work with for the example."
+ "
\n",
+ "This tutorial is currently a work in progress (WIP).\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2",
+ "metadata": {},
+ "source": [
+ "This tutorial will walk through how to setup a Nautilus Parquet data catalog with various Databento schemas.\n",
+ "\n",
+ "Prerequities:\n",
+ "- The `databento` Python client library should be installed to make data requests `pip install -U databento`\n",
+ "- A Databento account (there is a free tier)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3",
+ "metadata": {},
+ "source": [
+ "## Requesting data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4",
+ "metadata": {},
+ "source": [
+ "We'll use a Databento historical client for the rest of this tutorial. You can either initialize one by passing your Databento API key to the constructor, or implicitly use the `DATABENTO_API_KEY` environment variable (as shown)."
]
},
{
"cell_type": "code",
"execution_count": null,
- "id": "2",
+ "id": "5",
"metadata": {},
"outputs": [],
"source": [
@@ -32,52 +62,129 @@
},
{
"cell_type": "markdown",
- "id": "3",
+ "id": "6",
"metadata": {},
"source": [
- "## Request data\n",
- "\n",
- "Use the historical API to request the front-month ES futures contract for January 2024.\n",
+ "**It's important to note that every historical streaming request from `timeseries.get_range` will incur a cost (even for the same data), therefore we need to:**\n",
+ "- Know and understand the cost prior to making a request\n",
+ "- Not make requests for the same data more than once (not efficient)\n",
+ "- Persist the responses to disk by writing zstd compressed DBN files (so that we don't have to request again)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7",
+ "metadata": {},
+ "source": [
+ "We can use a metadata [get_cost endpoint](https://docs.databento.com/api-reference-historical/metadata/metadata-get-cost?historical=python&live=python) from the Databento API to get a quote on the cost, prior to each request.\n",
+ "Each request sequence will first request the cost of the data, and then make a request only if the data doesn't already exist on disk.\n",
"\n",
- "**CAUTION: This will incur a cost for every request (only run the request cell once)**"
+ "Note the response returned is in USD, displayed as fractional cents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8",
+ "metadata": {},
+ "source": [
+ "The following request is only for a small amount of data (as used in this Medium article [Building high-frequency trading signals in Python with Databento and sklearn](https://databento.com/blog/hft-sklearn-python)), just to demonstrate the basic workflow. "
]
},
{
"cell_type": "code",
"execution_count": null,
- "id": "4",
+ "id": "9",
"metadata": {},
"outputs": [],
"source": [
- "# Path we'll use for persisting this request to disk\n",
- "path = \"es-front-glbx-mbp10.dbn.zst\"\n",
- "\n",
- "# Request lead month\n",
- "data = client.timeseries.get_range(\n",
+ "from pathlib import Path\n",
+ "from databento import DBNStore"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "10",
+ "metadata": {},
+ "source": [
+ "We'll prepare a directory for the raw Databento DBN format data, which we'll use for the rest of the tutorial."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "11",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "DATABENTO_DATA_DIR = Path(\"databento\")\n",
+ "DATABENTO_DATA_DIR.mkdir(exist_ok=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "12",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Request cost quote (USD) - this endpoint is 'free'\n",
+ "client.metadata.get_cost(\n",
" dataset=\"GLBX.MDP3\",\n",
" symbols=[\"ES.n.0\"],\n",
" stype_in=\"continuous\",\n",
" schema=\"mbp-10\",\n",
" start=\"2023-12-06T14:30:00\",\n",
" end=\"2023-12-06T20:30:00\",\n",
- " path=path,\n",
")"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "13",
+ "metadata": {},
+ "source": [
+ "Use the historical API to request for the data used in the Medium article."
+ ]
+ },
{
"cell_type": "code",
"execution_count": null,
- "id": "5",
+ "id": "14",
"metadata": {},
"outputs": [],
"source": [
+ "path = DATABENTO_DATA_DIR / \"es-front-glbx-mbp10.dbn.zst\"\n",
+ "\n",
+ "if not path.exists():\n",
+ " # Request data\n",
+ " client.timeseries.get_range(\n",
+ " dataset=\"GLBX.MDP3\",\n",
+ " symbols=[\"ES.n.0\"],\n",
+ " stype_in=\"continuous\",\n",
+ " schema=\"mbp-10\",\n",
+ " start=\"2023-12-06T14:30:00\",\n",
+ " end=\"2023-12-06T20:30:00\",\n",
+ " path=path, # <--- Passing a `path` parameter will ensure the data is written to disk\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "15",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Inspect the data by reading from disk and convert to a pandas.DataFrame\n",
+ "data = DBNStore.from_file(path)\n",
+ "\n",
"df = data.to_df()\n",
"df"
]
},
{
"cell_type": "markdown",
- "id": "6",
+ "id": "16",
"metadata": {},
"source": [
"## Write to data catalog"
@@ -86,7 +193,7 @@
{
"cell_type": "code",
"execution_count": null,
- "id": "7",
+ "id": "17",
"metadata": {},
"outputs": [],
"source": [
@@ -101,91 +208,214 @@
{
"cell_type": "code",
"execution_count": null,
- "id": "8",
+ "id": "18",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "CATALOG_PATH = Path.cwd() / \"catalog\"\n",
+ "\n",
+ "# Clear if it already exists\n",
+ "if CATALOG_PATH.exists():\n",
+ " shutil.rmtree(CATALOG_PATH)\n",
+ "CATALOG_PATH.mkdir()\n",
+ "\n",
+ "# Create a catalog instance\n",
+ "catalog = ParquetDataCatalog(CATALOG_PATH)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "19",
+ "metadata": {},
+ "source": [
+ "Now that we've prepared the data catalog, we need a `DatabentoDataLoader` which we'll use to decode and load the data into Nautilus objects."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "20",
"metadata": {},
"outputs": [],
"source": [
+ "loader = DatabentoDataLoader()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "21",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "path = DATABENTO_DATA_DIR / \"es-front-glbx-mbp10.dbn.zst\"\n",
"instrument_id = InstrumentId.from_str(\"ES.n.0\") # This should be the raw symbol (update)\n",
- "loader = DatabentoDataLoader()\n",
+ "\n",
"depth10 = loader.from_dbn_file(\n",
" path=path,\n",
" instrument_id=instrument_id, # Not required but makes data loading faster (symbology mapping not required)\n",
- " as_legacy_cython=False, # This will load Rust pyo3 objects to write to the catalog\n",
+ " as_legacy_cython=False, # This will load Rust pyo3 objects to write to the catalog (we could use legacy Cython objects, but this is slightly more efficient)\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
- "id": "9",
+ "id": "22",
"metadata": {},
"outputs": [],
"source": [
- "CATALOG_PATH = Path.cwd() / \"catalog\"\n",
- "\n",
- "# Clear if it already exists, then create fresh\n",
- "if CATALOG_PATH.exists():\n",
- " shutil.rmtree(CATALOG_PATH)\n",
- "CATALOG_PATH.mkdir()\n",
- "\n",
- "# Create a catalog instance\n",
- "catalog = ParquetDataCatalog(CATALOG_PATH)"
+ "# Write data to catalog (this takes ~20 seconds or ~250,000/second for writing MBP-10 at the moment)\n",
+ "catalog.write_data(depth10)"
]
},
{
"cell_type": "code",
"execution_count": null,
- "id": "10",
+ "id": "23",
"metadata": {},
"outputs": [],
"source": [
- "# Write instrument and ticks to catalog (this takes ~20 seconds)\n",
- "catalog.write_data(depth10)"
+ "# Test reading from catalog\n",
+ "depths = catalog.order_book_depth10()\n",
+ "len(depths)"
]
},
{
"cell_type": "code",
"execution_count": null,
- "id": "11",
+ "id": "24",
"metadata": {},
"outputs": [],
"source": []
},
+ {
+ "cell_type": "markdown",
+ "id": "25",
+ "metadata": {},
+ "source": [
+ "## Preparing a month of AAPL trades"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "26",
+ "metadata": {},
+ "source": [
+ "Now we'll expand on this workflow by preparing a month of AAPL trades on the Nasdaq exchange using the Databento `trade` schema, which will translate to Nautilus `TradeTick` objects."
+ ]
+ },
{
"cell_type": "code",
"execution_count": null,
- "id": "12",
+ "id": "27",
"metadata": {},
"outputs": [],
"source": [
- "import pyarrow.parquet as pq"
+ "# Request cost quote (USD) - this endpoint is 'free'\n",
+ "client.metadata.get_cost(\n",
+ " dataset=\"XNAS.ITCH\",\n",
+ " symbols=[\"AAPL\"],\n",
+ " schema=\"trades\",\n",
+ " start=\"2024-01\",\n",
+ ")"
]
},
{
"cell_type": "code",
"execution_count": null,
- "id": "13",
+ "id": "28",
"metadata": {},
"outputs": [],
"source": [
- "depth10_parquet_path = \"catalog/data/order_book_depth10/ES.n.0/part-0.parquet\""
+ "path = DATABENTO_DATA_DIR / \"aapl-xnas-202401.trades.dbn.zst\"\n",
+ "\n",
+ "if not path.exists():\n",
+ " # Request data\n",
+ " client.timeseries.get_range(\n",
+ " dataset=\"XNAS.ITCH\",\n",
+ " symbols=[\"AAPL\"],\n",
+ " schema=\"trades\",\n",
+ " start=\"2024-01\",\n",
+ " path=path, # <--- Passing a `path` parameter will ensure the data is written to disk\n",
+ " )"
]
},
{
"cell_type": "code",
"execution_count": null,
- "id": "14",
+ "id": "29",
"metadata": {},
"outputs": [],
"source": [
- "table = pq.read_table(depth10_parquet_path)\n",
- "table.schema"
+ "# Inspect the data by reading from disk and convert to a pandas.DataFrame\n",
+ "data = DBNStore.from_file(path)\n",
+ "\n",
+ "df = data.to_df()\n",
+ "df"
]
},
{
"cell_type": "code",
"execution_count": null,
- "id": "15",
+ "id": "30",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "instrument_id = InstrumentId.from_str(\"AAPL.XNAS\") # Using the Nasdaq ISO 10383 MIC (Market Identifier Code) as the venue\n",
+ "\n",
+ "trades = loader.from_dbn_file(\n",
+ " path=path,\n",
+ " instrument_id=instrument_id, # Not required but makes data loading faster (symbology mapping not required)\n",
+ " as_legacy_cython=False, # This will load Rust pyo3 objects to write to the catalog (we could use legacy Cython objects, but this is slightly more efficient)\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "31",
+ "metadata": {},
+ "source": [
+ "Here we'll organize our data in a file per month, this is a rather arbitrary choice and a file per day could be equally valid.\n",
+ "\n",
+ "It may also be a good idea to create a function which can return the correct `basename_template` value for a given chunk of data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "32",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Write data to catalog\n",
+ "catalog.write_data(trades, basename_template=\"2024-01\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "33",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "trades = catalog.trade_ticks([instrument_id])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "34",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "len(trades)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "35",
"metadata": {},
"outputs": [],
"source": []