feat(store-sync,store-indexer): schemaless indexer #1965

holic · 2023-11-27T15:55:39Z

The client libs don't use the indexer's new getLogs method yet, planning to do that in a follow up PR.

changeset-bot · 2023-11-27T15:55:43Z

🦋 Changeset detected

Latest commit: b02aedc

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 30 packages

Name	Type
@latticexyz/common	Major
@latticexyz/store-indexer	Major
@latticexyz/store-sync	Major
@latticexyz/block-logs-stream	Major
@latticexyz/cli	Major
@latticexyz/config	Major
@latticexyz/dev-tools	Major
@latticexyz/protocol-parser	Major
@latticexyz/store	Major
@latticexyz/world-modules	Major
@latticexyz/world	Major
@latticexyz/react	Major
@latticexyz/abi-ts	Major
create-mud	Major
@latticexyz/ecs-browser	Major
@latticexyz/faucet	Major
@latticexyz/gas-report	Major
@latticexyz/network	Major
@latticexyz/noise	Major
@latticexyz/phaserx	Major
@latticexyz/recs	Major
@latticexyz/schema-type	Major
@latticexyz/services	Major
@latticexyz/solecs	Major
solhint-config-mud	Major
solhint-plugin-mud	Major
@latticexyz/std-client	Major
@latticexyz/std-contracts	Major
@latticexyz/store-cache	Major
@latticexyz/utils	Major

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

alvrs · 2023-11-28T19:48:29Z

packages/store-indexer/bin/postgres-indexer.ts

+    .from(tables.chainTable)
+    .where(eq(tables.chainTable.chainId, chainId))
+    .execute()
+    .then((rows) => rows.find(() => true));


for context, what is the purpose of the (rows) => rows.find(() => true) line?

I'll leave some comments around but this basically sidesteps our lack of noUncheckedIndexedAccess: true.

const rows = query(...); const row = rows[0]; // type will just be `Row`, which is inaccurate for an empty array // vs const row = rows.find(() => true); // type will be `Row | undefined`

ohh so it's essentially "give me the first item if it exists"; i had head it as "filter"

alvrs · 2023-11-28T19:50:01Z

packages/store-indexer/package.json

@@ -27,7 +27,7 @@
    "lint": "eslint .",
    "start:postgres": "concurrently -n indexer,frontend -c cyan,magenta 'tsx bin/postgres-indexer' 'tsx bin/postgres-frontend'",
    "start:postgres:local": "DEBUG=mud:store-sync:createStoreSync DATABASE_URL=postgres://127.0.0.1/postgres RPC_HTTP_URL=http://127.0.0.1:8545 pnpm start:postgres",
-    "start:postgres:testnet": "DEBUG=mud:store-sync:createStoreSync DATABASE_URL=postgres://127.0.0.1/postgres RPC_HTTP_URL=https://follower.testnet-chain.linfra.xyz pnpm start:postgres",
+    "start:postgres:testnet": "DEBUG=mud:* DATABASE_URL=postgres://127.0.0.1/postgres RPC_HTTP_URL=https://follower.testnet-chain.linfra.xyz pnpm start:postgres",


is this change intended to be merged or for dev debug purposes?

alvrs · 2023-11-28T19:52:16Z

packages/store-indexer/src/postgres/createQueryAdapter.ts

-      const { lastUpdatedBlockNumber } = metadata[0] ?? {};
+      const tablesWithRecords: TableWithRecords[] = tables.map((table) => {
+        const records = logs
+          .filter((log) => getAddress(log.address) === getAddress(table.address) && log.args.tableId === table.tableId)


should the filter on address and tableId be passed to getLogs to let postgres handle it instead of filtering the result array?

This does less querying than we were doing before but for the same amount of data.

Before when records were distributed across many tables, we had to query for a list of tables then query each table individually for their records.

Now we query for all logs (passing the filters to the SQL query), then we just group them by table for the purposes of findAll.

This is mostly a backwards compat thing for findAll with the new table structure. The clients will start using the new getLogs endpoint in a PR coming soon.

Ahh gotcha, this part is just grouping the logs by table, not really filtering the data. Even though this is mostly used for backwards compatibility, I wonder should we slightly change the logic here to iterate through the logs once and group them by table (ie have an object with address/tableId as key and array as value, iterate over the logs to push to the right array, and then do Object.values), instead of iterating through all logs number of table times?

will adjust here but I'd prefer to save these kinds of things for an optimization pass where we can set up some measurement tools etc. (assuming we even want to keep supporting this endpoint)

although it will prob be more performant, I find your suggested approach a bit harder to read and maintain, so it'd be good to know what perf we're trading off in terms of readability vs speed/memory

agree with no premature optimization and instead prioritizing benchmarks to find the actual bottlenecks!

ah so the other reason I did this like this is because we need the table schema to decode the log, and didn't want to create another lookup mapping

going to keep this as-is for readability for now, and can follow up with measurement

alvrs · 2023-11-28T19:54:30Z

packages/store-indexer/src/postgres/getLogs.ts

+    .from(tables.chainTable)
+    .where(eq(tables.chainTable.chainId, chainId))
+    .execute()
+    .then((rows) => rows.find(() => true));


same question here as before, why do we need the .then((rows) => rows.find(() => true)) line?

alvrs · 2023-11-28T19:57:08Z

packages/store-indexer/src/postgres/getLogs.ts

+  blockNumber = bigIntMax(
+    blockNumber,
+    records.reduce((max, record) => bigIntMax(max, record.lastUpdatedBlockNumber ?? 0n), 0n)
+  );


nit, if use blockNumber as the initial value for the accumulator we wouldn't need the outer bigIntMax. No strong opinion, could argue the intent is clearer with the explicit outer bigIntMax.

If the lastUpdatedBlockNumber of a record is higher than the chainState.lastUpdatedBlockNumber, would that mean there is an issue with the way we update chainState.lastUpdatedBlockNumber or is there a valid situation where it could happen? If it means a bug we should probably log something and aim to remove this step at some point.

I only did this here because of the order of queries that are run separately, one for the chain's current block number and one for all the records. I could probably do this at query time (i.e. a single query), but I was being lazy about crafting that query since Drizzle is a bit clunky/annoying with joins.

The issue I am solving for is that the indexer may be up to date for a chain but a world may not have had any recent activity. If we query for just records and return the highest block number of those records, that would signal to the client that it needs to start fetching from the RPC from a much older block number than is necessary.

So basically: we get the chain's block number first, as the "minimum" block number value. Then we get all the records and their highest block number. If it's higher than the chain, then we know some data updated between our two queries and we should use the higher number. If it's lower than the chain, we use the chain block number, because that's where the indexer is at in terms of up-to-date data.

alvrs · 2023-11-28T20:10:20Z

packages/store-indexer/src/postgres/getLogs.ts

+          eventName: "Store_SetRecord",
+          args: {
+            tableId: record.tableId,
+            keyTuple: decodeDynamicField("bytes32[]", record.keyBytes),


at some point we should do benchmarks of this hot path, I hope the decoding here doesn't become a bottleneck if a large number of records is requested

I doubt it, decoding individual fields is fairly lightweight. This one just chunks the concatenated hex into individual bytes32 hex items. There's room to optimize that method but I doubt this is where our bottleneck is.

reason why i'm saying this is because we had performance issues with string manipulations in the past (back in network stack v1 times), but I agree we should definitely benchmark first to find the bottlenecks

alvrs · 2023-11-28T20:13:21Z

packages/store-sync/src/postgres/createStorageAdapter.ts

+import { setupTables } from "./setupTables";
+import { StorageAdapter, StorageAdapterBlock } from "../common";
+
+// Currently assumes one DB per chain ID


is this comment still accurate?

I was confused about the chain table but I now understand the expectation is to have only one row in that table, and we don't have a chainId column in the data table so couldn't support multiple chains in the same data table with the current schema

alvrs · 2023-11-28T20:19:26Z

packages/store-indexer/src/postgres/getLogs.ts

+  );
+
+  const logs = records
+    .filter((record) => !record.isDeleted)


should we just include !isDeleted in the sql query?

I intentionally didn't for a few reasons:

on the whole, I expect the number of actual deleted records to be low, because it costs gas to delete vs leaving a record laying around

doing it at the query level will change the query planning and I want to make sure we hit the indexes I've set up here (not to say that we can't but I wanna make sure we can make use of the indexes)

we'll be doing similar shaped queries for the update stream (query all records and return whether a record is deleted) and seems wise to use the same SQL query path for keeping cache warm

on the whole, I expect the number of actual deleted records to be low, because it costs gas to delete vs leaving a record laying around

Independent of the number of deleted records it would save us one linear operation over the entire logs array if we don't have to filter for isDeleted after fetching all the logs. If we'd add another index for isDeleted we could avoid that linear operation, at the cost of more expensive inserts (but since inserts are throttled by the blockchain, write performance is less critical than read performance)

we'll be doing similar shaped queries for the update stream (query all records and filter or return whether a record is deleted) and seems wise to use the same SQL query path for caching

For the update stream the query would also include the latest block number no? So wouldn't it already be a slightly different query shape?

For the update stream the query would also include the latest block number no? So wouldn't it already be a slightly different query shape?

ah yeah, that's true

mind if we come back to this? it would be nice to set up some scaffolding for testing/measuring query performance, processing performance (SQL query -> HTTP response), etc. and I don't want to blow up this PR

some of this might be easier if we move to static SQL queries (rather than ORM) that we can run EXPLAIN on and include as part of snapshot testing etc. to make sure indices are being hit in the way we expect

yeah definitely, this is not blocking, approved the PR #1965 (review)

alvrs · 2023-11-28T20:23:27Z

packages/store-sync/src/postgres/tables.ts

+    lastUpdatedBlockNumber: asBigInt("last_updated_block_number", "numeric"),
+  },
+  (table) => ({
+    pk: primaryKey(table.address, table.tableId, table.keyBytes),


if the primary key is the combination of these three columns, do we need individual indices on address and tableId for efficient querying on those (where we don't include the keyBytes in the query)?

the indexes below this handle that and I believe that an index on [address, tableId, key0] can still benefit a query using address=... AND tableId=...

alvrs · 2023-11-28T20:24:13Z

packages/store-sync/src/postgres/tables.ts

+    pk: primaryKey(table.address, table.tableId, table.keyBytes),
+    key0Index: index("key0_index").on(table.address, table.tableId, table.key0),
+    key1Index: index("key1_index").on(table.address, table.tableId, table.key1),
+    // TODO: add indices for querying without table ID
+    // TODO: add indices for querying multiple keys


should we also add an index for isDeleted to be able to efficiently use it in getLogs instead of having to filter the array in js?

see comment above

alvrs

my comments are mostly about performance, which is hard to fully reason about without benchmarks and its not blocking for functionality, so I'm happy merging as is. But I think we should prioritise adding a benchmark for getLogs on a table with lots of rows so we can more efficiently identify bottlenecks in hot paths.

holic added 4 commits November 27, 2023 10:12

rename current postgres approach to postgres-decoded

e61cdaa

set up new default postgres indexer

4f23d52

add new build for decoded postgres

e38a0aa

update postgres indexer, add new getLogs endpoint

9b76bcf

missed a spot during rename

0ebc7d3

vercel bot deployed to Preview November 27, 2023 16:03 View deployment

vercel bot deployed to Preview November 27, 2023 16:17 View deployment

reduce log noise

d996cd8

holic force-pushed the holic/postgres-megatable branch from 12e0781 to d996cd8 Compare November 27, 2023 16:52

vercel bot deployed to Preview November 27, 2023 16:54 View deployment

holic added 2 commits November 27, 2023 17:11

replace jsdom with happy-dom

30ecb92

attempt to fix PID issues

b5c4252

vercel bot deployed to Preview November 27, 2023 17:24 View deployment

holic added 2 commits November 28, 2023 09:00

clean up

7b1c8b1

simplify postgres decoded using postgres adapter

d3381b8

vercel bot deployed to Preview November 28, 2023 11:07 View deployment

missed a spot

909cbec

vercel bot deployed to Preview November 28, 2023 11:12 View deployment

vercel bot deployed to Preview November 28, 2023 11:22 View deployment

update snapshots

837ed6b

holic force-pushed the holic/postgres-megatable branch from c15b263 to 837ed6b Compare November 28, 2023 11:32

vercel bot deployed to Preview November 28, 2023 11:34 View deployment

simplify buildTable snapshot test

276774b

vercel bot deployed to Preview November 28, 2023 11:59 View deployment

holic marked this pull request as ready for review November 28, 2023 12:36

holic requested a review from alvrs as a code owner November 28, 2023 12:36

Create poor-waves-occur.md

2b79402

vercel bot deployed to Preview November 28, 2023 14:16 View deployment

Create wild-moose-smile.md

4263f9f

Update wild-moose-smile.md

79aeb26

vercel bot deployed to Preview November 28, 2023 14:26 View deployment

Create seven-rice-dance.md

0aeb052

vercel bot deployed to Preview November 28, 2023 14:36 View deployment

alvrs reviewed Nov 28, 2023

View reviewed changes

alvrs previously approved these changes Nov 29, 2023

View reviewed changes

holic added 2 commits November 29, 2023 11:24

remove DEBUG from script in favor of .env

061c1d3

add comments

d860a56

holic dismissed alvrs’s stale review via d860a56 November 29, 2023 11:24

vercel bot deployed to Preview November 29, 2023 11:26 View deployment

group logs by table

5da6247

vercel bot deployed to Preview November 29, 2023 11:34 View deployment

holic added 2 commits November 29, 2023 11:39

add tests

64748cf

Update poor-waves-occur.md

5c29a16

holic mentioned this pull request Nov 29, 2023

fix(store-indexer): revert postgres changes #1941

Closed

vercel bot deployed to Preview November 29, 2023 11:43 View deployment

prettier

b02aedc

holic merged commit 1b5eb0d into main Nov 29, 2023
10 checks passed

holic deleted the holic/postgres-megatable branch November 29, 2023 11:46

holic mentioned this pull request Nov 29, 2023

refactor sqlite indexer to use same bytes records approach as postgres #1970

Open

vercel bot deployed to Preview November 29, 2023 11:47 View deployment

This was referenced Nov 29, 2023

fix(store-sync,store-indexer): make last updated block number not null #1972

Merged

rework store sync data model to accommodate cross-table querying/filters #1821

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(store-sync,store-indexer): schemaless indexer #1965

feat(store-sync,store-indexer): schemaless indexer #1965

holic commented Nov 27, 2023 •

edited

Loading

changeset-bot bot commented Nov 27, 2023 •

edited

Loading

alvrs Nov 28, 2023

holic Nov 29, 2023

alvrs Nov 29, 2023

alvrs Nov 28, 2023

alvrs Nov 28, 2023

holic Nov 29, 2023

alvrs Nov 29, 2023

holic Nov 29, 2023

alvrs Nov 29, 2023

holic Nov 29, 2023 •

edited

Loading

alvrs Nov 28, 2023

alvrs Nov 28, 2023

alvrs Nov 28, 2023

holic Nov 29, 2023

alvrs Nov 28, 2023

holic Nov 29, 2023

alvrs Nov 29, 2023

alvrs Nov 28, 2023

holic Nov 29, 2023

alvrs Nov 29, 2023

alvrs Nov 28, 2023

holic Nov 29, 2023 •

edited

Loading

alvrs Nov 29, 2023

holic Nov 29, 2023 •

edited

Loading

alvrs Nov 29, 2023

alvrs Nov 28, 2023

holic Nov 29, 2023

alvrs Nov 28, 2023

holic Nov 29, 2023

alvrs left a comment

feat(store-sync,store-indexer): schemaless indexer #1965

feat(store-sync,store-indexer): schemaless indexer #1965

Conversation

holic commented Nov 27, 2023 • edited Loading

changeset-bot bot commented Nov 27, 2023 • edited Loading

🦋 Changeset detected

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holic Nov 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holic Nov 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holic Nov 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alvrs left a comment

Choose a reason for hiding this comment

holic commented Nov 27, 2023 •

edited

Loading

changeset-bot bot commented Nov 27, 2023 •

edited

Loading

holic Nov 29, 2023 •

edited

Loading

holic Nov 29, 2023 •

edited

Loading

holic Nov 29, 2023 •

edited

Loading