diff --git a/examples/notebooks/databento_data_catalog.ipynb b/examples/notebooks/databento_data_catalog.ipynb index 5f7b23009246..a11079005e7c 100644 --- a/examples/notebooks/databento_data_catalog.ipynb +++ b/examples/notebooks/databento_data_catalog.ipynb @@ -13,15 +13,45 @@ "id": "1", "metadata": {}, "source": [ - "This tutorial will walk through how to setup a Nautilus Parquet data catalog with databento order book data.\n", + "**Info:**\n", "\n", - "We choose to work with the MBP-10 schema (which is just an aggregation of the top 10 levels) so that the data is more manageable and easier to work with for the example." + "
\n", + "This tutorial is currently a work in progress (WIP).\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "2", + "metadata": {}, + "source": [ + "This tutorial will walk through how to setup a Nautilus Parquet data catalog with various Databento schemas.\n", + "\n", + "Prerequities:\n", + "- The `databento` Python client library should be installed to make data requests `pip install -U databento`\n", + "- A Databento account (there is a free tier)" + ] + }, + { + "cell_type": "markdown", + "id": "3", + "metadata": {}, + "source": [ + "## Requesting data" + ] + }, + { + "cell_type": "markdown", + "id": "4", + "metadata": {}, + "source": [ + "We'll use a Databento historical client for the rest of this tutorial. You can either initialize one by passing your Databento API key to the constructor, or implicitly use the `DATABENTO_API_KEY` environment variable (as shown)." ] }, { "cell_type": "code", "execution_count": null, - "id": "2", + "id": "5", "metadata": {}, "outputs": [], "source": [ @@ -32,52 +62,129 @@ }, { "cell_type": "markdown", - "id": "3", + "id": "6", "metadata": {}, "source": [ - "## Request data\n", - "\n", - "Use the historical API to request the front-month ES futures contract for January 2024.\n", + "**It's important to note that every historical streaming request from `timeseries.get_range` will incur a cost (even for the same data), therefore we need to:**\n", + "- Know and understand the cost prior to making a request\n", + "- Not make requests for the same data more than once (not efficient)\n", + "- Persist the responses to disk by writing zstd compressed DBN files (so that we don't have to request again)" + ] + }, + { + "cell_type": "markdown", + "id": "7", + "metadata": {}, + "source": [ + "We can use a metadata [get_cost endpoint](https://docs.databento.com/api-reference-historical/metadata/metadata-get-cost?historical=python&live=python) from the Databento API to get a quote on the cost, prior to each request.\n", + "Each request sequence will first request the cost of the data, and then make a request only if the data doesn't already exist on disk.\n", "\n", - "**CAUTION: This will incur a cost for every request (only run the request cell once)**" + "Note the response returned is in USD, displayed as fractional cents." + ] + }, + { + "cell_type": "markdown", + "id": "8", + "metadata": {}, + "source": [ + "The following request is only for a small amount of data (as used in this Medium article [Building high-frequency trading signals in Python with Databento and sklearn](https://databento.com/blog/hft-sklearn-python)), just to demonstrate the basic workflow. " ] }, { "cell_type": "code", "execution_count": null, - "id": "4", + "id": "9", "metadata": {}, "outputs": [], "source": [ - "# Path we'll use for persisting this request to disk\n", - "path = \"es-front-glbx-mbp10.dbn.zst\"\n", - "\n", - "# Request lead month\n", - "data = client.timeseries.get_range(\n", + "from pathlib import Path\n", + "from databento import DBNStore" + ] + }, + { + "cell_type": "markdown", + "id": "10", + "metadata": {}, + "source": [ + "We'll prepare a directory for the raw Databento DBN format data, which we'll use for the rest of the tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "11", + "metadata": {}, + "outputs": [], + "source": [ + "DATABENTO_DATA_DIR = Path(\"databento\")\n", + "DATABENTO_DATA_DIR.mkdir(exist_ok=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "12", + "metadata": {}, + "outputs": [], + "source": [ + "# Request cost quote (USD) - this endpoint is 'free'\n", + "client.metadata.get_cost(\n", " dataset=\"GLBX.MDP3\",\n", " symbols=[\"ES.n.0\"],\n", " stype_in=\"continuous\",\n", " schema=\"mbp-10\",\n", " start=\"2023-12-06T14:30:00\",\n", " end=\"2023-12-06T20:30:00\",\n", - " path=path,\n", ")" ] }, + { + "cell_type": "markdown", + "id": "13", + "metadata": {}, + "source": [ + "Use the historical API to request for the data used in the Medium article." + ] + }, { "cell_type": "code", "execution_count": null, - "id": "5", + "id": "14", "metadata": {}, "outputs": [], "source": [ + "path = DATABENTO_DATA_DIR / \"es-front-glbx-mbp10.dbn.zst\"\n", + "\n", + "if not path.exists():\n", + " # Request data\n", + " client.timeseries.get_range(\n", + " dataset=\"GLBX.MDP3\",\n", + " symbols=[\"ES.n.0\"],\n", + " stype_in=\"continuous\",\n", + " schema=\"mbp-10\",\n", + " start=\"2023-12-06T14:30:00\",\n", + " end=\"2023-12-06T20:30:00\",\n", + " path=path, # <--- Passing a `path` parameter will ensure the data is written to disk\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "15", + "metadata": {}, + "outputs": [], + "source": [ + "# Inspect the data by reading from disk and convert to a pandas.DataFrame\n", + "data = DBNStore.from_file(path)\n", + "\n", "df = data.to_df()\n", "df" ] }, { "cell_type": "markdown", - "id": "6", + "id": "16", "metadata": {}, "source": [ "## Write to data catalog" @@ -86,7 +193,7 @@ { "cell_type": "code", "execution_count": null, - "id": "7", + "id": "17", "metadata": {}, "outputs": [], "source": [ @@ -101,91 +208,214 @@ { "cell_type": "code", "execution_count": null, - "id": "8", + "id": "18", + "metadata": {}, + "outputs": [], + "source": [ + "CATALOG_PATH = Path.cwd() / \"catalog\"\n", + "\n", + "# Clear if it already exists\n", + "if CATALOG_PATH.exists():\n", + " shutil.rmtree(CATALOG_PATH)\n", + "CATALOG_PATH.mkdir()\n", + "\n", + "# Create a catalog instance\n", + "catalog = ParquetDataCatalog(CATALOG_PATH)" + ] + }, + { + "cell_type": "markdown", + "id": "19", + "metadata": {}, + "source": [ + "Now that we've prepared the data catalog, we need a `DatabentoDataLoader` which we'll use to decode and load the data into Nautilus objects." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "20", "metadata": {}, "outputs": [], "source": [ + "loader = DatabentoDataLoader()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21", + "metadata": {}, + "outputs": [], + "source": [ + "path = DATABENTO_DATA_DIR / \"es-front-glbx-mbp10.dbn.zst\"\n", "instrument_id = InstrumentId.from_str(\"ES.n.0\") # This should be the raw symbol (update)\n", - "loader = DatabentoDataLoader()\n", + "\n", "depth10 = loader.from_dbn_file(\n", " path=path,\n", " instrument_id=instrument_id, # Not required but makes data loading faster (symbology mapping not required)\n", - " as_legacy_cython=False, # This will load Rust pyo3 objects to write to the catalog\n", + " as_legacy_cython=False, # This will load Rust pyo3 objects to write to the catalog (we could use legacy Cython objects, but this is slightly more efficient)\n", ")" ] }, { "cell_type": "code", "execution_count": null, - "id": "9", + "id": "22", "metadata": {}, "outputs": [], "source": [ - "CATALOG_PATH = Path.cwd() / \"catalog\"\n", - "\n", - "# Clear if it already exists, then create fresh\n", - "if CATALOG_PATH.exists():\n", - " shutil.rmtree(CATALOG_PATH)\n", - "CATALOG_PATH.mkdir()\n", - "\n", - "# Create a catalog instance\n", - "catalog = ParquetDataCatalog(CATALOG_PATH)" + "# Write data to catalog (this takes ~20 seconds or ~250,000/second for writing MBP-10 at the moment)\n", + "catalog.write_data(depth10)" ] }, { "cell_type": "code", "execution_count": null, - "id": "10", + "id": "23", "metadata": {}, "outputs": [], "source": [ - "# Write instrument and ticks to catalog (this takes ~20 seconds)\n", - "catalog.write_data(depth10)" + "# Test reading from catalog\n", + "depths = catalog.order_book_depth10()\n", + "len(depths)" ] }, { "cell_type": "code", "execution_count": null, - "id": "11", + "id": "24", "metadata": {}, "outputs": [], "source": [] }, + { + "cell_type": "markdown", + "id": "25", + "metadata": {}, + "source": [ + "## Preparing a month of AAPL trades" + ] + }, + { + "cell_type": "markdown", + "id": "26", + "metadata": {}, + "source": [ + "Now we'll expand on this workflow by preparing a month of AAPL trades on the Nasdaq exchange using the Databento `trade` schema, which will translate to Nautilus `TradeTick` objects." + ] + }, { "cell_type": "code", "execution_count": null, - "id": "12", + "id": "27", "metadata": {}, "outputs": [], "source": [ - "import pyarrow.parquet as pq" + "# Request cost quote (USD) - this endpoint is 'free'\n", + "client.metadata.get_cost(\n", + " dataset=\"XNAS.ITCH\",\n", + " symbols=[\"AAPL\"],\n", + " schema=\"trades\",\n", + " start=\"2024-01\",\n", + ")" ] }, { "cell_type": "code", "execution_count": null, - "id": "13", + "id": "28", "metadata": {}, "outputs": [], "source": [ - "depth10_parquet_path = \"catalog/data/order_book_depth10/ES.n.0/part-0.parquet\"" + "path = DATABENTO_DATA_DIR / \"aapl-xnas-202401.trades.dbn.zst\"\n", + "\n", + "if not path.exists():\n", + " # Request data\n", + " client.timeseries.get_range(\n", + " dataset=\"XNAS.ITCH\",\n", + " symbols=[\"AAPL\"],\n", + " schema=\"trades\",\n", + " start=\"2024-01\",\n", + " path=path, # <--- Passing a `path` parameter will ensure the data is written to disk\n", + " )" ] }, { "cell_type": "code", "execution_count": null, - "id": "14", + "id": "29", "metadata": {}, "outputs": [], "source": [ - "table = pq.read_table(depth10_parquet_path)\n", - "table.schema" + "# Inspect the data by reading from disk and convert to a pandas.DataFrame\n", + "data = DBNStore.from_file(path)\n", + "\n", + "df = data.to_df()\n", + "df" ] }, { "cell_type": "code", "execution_count": null, - "id": "15", + "id": "30", + "metadata": {}, + "outputs": [], + "source": [ + "instrument_id = InstrumentId.from_str(\"AAPL.XNAS\") # Using the Nasdaq ISO 10383 MIC (Market Identifier Code) as the venue\n", + "\n", + "trades = loader.from_dbn_file(\n", + " path=path,\n", + " instrument_id=instrument_id, # Not required but makes data loading faster (symbology mapping not required)\n", + " as_legacy_cython=False, # This will load Rust pyo3 objects to write to the catalog (we could use legacy Cython objects, but this is slightly more efficient)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "31", + "metadata": {}, + "source": [ + "Here we'll organize our data in a file per month, this is a rather arbitrary choice and a file per day could be equally valid.\n", + "\n", + "It may also be a good idea to create a function which can return the correct `basename_template` value for a given chunk of data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "32", + "metadata": {}, + "outputs": [], + "source": [ + "# Write data to catalog\n", + "catalog.write_data(trades, basename_template=\"2024-01\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "33", + "metadata": {}, + "outputs": [], + "source": [ + "trades = catalog.trade_ticks([instrument_id])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "34", + "metadata": {}, + "outputs": [], + "source": [ + "len(trades)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "35", "metadata": {}, "outputs": [], "source": []