Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(store-sync,store-indexer): schemaless indexer #1965

Merged
merged 23 commits into from
Nov 29, 2023
Merged

Conversation

holic
Copy link
Member

@holic holic commented Nov 27, 2023

closes #1640

The client libs don't use the indexer's new getLogs method yet, planning to do that in a follow up PR.

Copy link

changeset-bot bot commented Nov 27, 2023

🦋 Changeset detected

Latest commit: b02aedc

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 30 packages
Name Type
@latticexyz/common Major
@latticexyz/store-indexer Major
@latticexyz/store-sync Major
@latticexyz/block-logs-stream Major
@latticexyz/cli Major
@latticexyz/config Major
@latticexyz/dev-tools Major
@latticexyz/protocol-parser Major
@latticexyz/store Major
@latticexyz/world-modules Major
@latticexyz/world Major
@latticexyz/react Major
@latticexyz/abi-ts Major
create-mud Major
@latticexyz/ecs-browser Major
@latticexyz/faucet Major
@latticexyz/gas-report Major
@latticexyz/network Major
@latticexyz/noise Major
@latticexyz/phaserx Major
@latticexyz/recs Major
@latticexyz/schema-type Major
@latticexyz/services Major
@latticexyz/solecs Major
solhint-config-mud Major
solhint-plugin-mud Major
@latticexyz/std-client Major
@latticexyz/std-contracts Major
@latticexyz/store-cache Major
@latticexyz/utils Major

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@holic holic marked this pull request as ready for review November 28, 2023 12:36
@holic holic requested a review from alvrs as a code owner November 28, 2023 12:36
.from(tables.chainTable)
.where(eq(tables.chainTable.chainId, chainId))
.execute()
.then((rows) => rows.find(() => true));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for context, what is the purpose of the (rows) => rows.find(() => true) line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave some comments around but this basically sidesteps our lack of noUncheckedIndexedAccess: true.

const rows = query(...);
const row = rows[0]; // type will just be `Row`, which is inaccurate for an empty array
// vs
const row = rows.find(() => true); // type will be `Row | undefined`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh so it's essentially "give me the first item if it exists"; i had head it as "filter"

@@ -27,7 +27,7 @@
"lint": "eslint .",
"start:postgres": "concurrently -n indexer,frontend -c cyan,magenta 'tsx bin/postgres-indexer' 'tsx bin/postgres-frontend'",
"start:postgres:local": "DEBUG=mud:store-sync:createStoreSync DATABASE_URL=postgres://127.0.0.1/postgres RPC_HTTP_URL=http://127.0.0.1:8545 pnpm start:postgres",
"start:postgres:testnet": "DEBUG=mud:store-sync:createStoreSync DATABASE_URL=postgres://127.0.0.1/postgres RPC_HTTP_URL=https://follower.testnet-chain.linfra.xyz pnpm start:postgres",
"start:postgres:testnet": "DEBUG=mud:* DATABASE_URL=postgres://127.0.0.1/postgres RPC_HTTP_URL=https://follower.testnet-chain.linfra.xyz pnpm start:postgres",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change intended to be merged or for dev debug purposes?

const { lastUpdatedBlockNumber } = metadata[0] ?? {};
const tablesWithRecords: TableWithRecords[] = tables.map((table) => {
const records = logs
.filter((log) => getAddress(log.address) === getAddress(table.address) && log.args.tableId === table.tableId)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the filter on address and tableId be passed to getLogs to let postgres handle it instead of filtering the result array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does less querying than we were doing before but for the same amount of data.

Before when records were distributed across many tables, we had to query for a list of tables then query each table individually for their records.

Now we query for all logs (passing the filters to the SQL query), then we just group them by table for the purposes of findAll.

This is mostly a backwards compat thing for findAll with the new table structure. The clients will start using the new getLogs endpoint in a PR coming soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh gotcha, this part is just grouping the logs by table, not really filtering the data. Even though this is mostly used for backwards compatibility, I wonder should we slightly change the logic here to iterate through the logs once and group them by table (ie have an object with address/tableId as key and array as value, iterate over the logs to push to the right array, and then do Object.values), instead of iterating through all logs number of table times?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will adjust here but I'd prefer to save these kinds of things for an optimization pass where we can set up some measurement tools etc. (assuming we even want to keep supporting this endpoint)

although it will prob be more performant, I find your suggested approach a bit harder to read and maintain, so it'd be good to know what perf we're trading off in terms of readability vs speed/memory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with no premature optimization and instead prioritizing benchmarks to find the actual bottlenecks!

Copy link
Member Author

@holic holic Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah so the other reason I did this like this is because we need the table schema to decode the log, and didn't want to create another lookup mapping

going to keep this as-is for readability for now, and can follow up with measurement

.from(tables.chainTable)
.where(eq(tables.chainTable.chainId, chainId))
.execute()
.then((rows) => rows.find(() => true));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question here as before, why do we need the .then((rows) => rows.find(() => true)) line?

Comment on lines 46 to 49
blockNumber = bigIntMax(
blockNumber,
records.reduce((max, record) => bigIntMax(max, record.lastUpdatedBlockNumber ?? 0n), 0n)
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, if use blockNumber as the initial value for the accumulator we wouldn't need the outer bigIntMax. No strong opinion, could argue the intent is clearer with the explicit outer bigIntMax.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the lastUpdatedBlockNumber of a record is higher than the chainState.lastUpdatedBlockNumber, would that mean there is an issue with the way we update chainState.lastUpdatedBlockNumber or is there a valid situation where it could happen? If it means a bug we should probably log something and aim to remove this step at some point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only did this here because of the order of queries that are run separately, one for the chain's current block number and one for all the records. I could probably do this at query time (i.e. a single query), but I was being lazy about crafting that query since Drizzle is a bit clunky/annoying with joins.

The issue I am solving for is that the indexer may be up to date for a chain but a world may not have had any recent activity. If we query for just records and return the highest block number of those records, that would signal to the client that it needs to start fetching from the RPC from a much older block number than is necessary.

So basically: we get the chain's block number first, as the "minimum" block number value. Then we get all the records and their highest block number. If it's higher than the chain, then we know some data updated between our two queries and we should use the higher number. If it's lower than the chain, we use the chain block number, because that's where the indexer is at in terms of up-to-date data.

eventName: "Store_SetRecord",
args: {
tableId: record.tableId,
keyTuple: decodeDynamicField("bytes32[]", record.keyBytes),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at some point we should do benchmarks of this hot path, I hope the decoding here doesn't become a bottleneck if a large number of records is requested

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt it, decoding individual fields is fairly lightweight. This one just chunks the concatenated hex into individual bytes32 hex items. There's room to optimize that method but I doubt this is where our bottleneck is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reason why i'm saying this is because we had performance issues with string manipulations in the past (back in network stack v1 times), but I agree we should definitely benchmark first to find the bottlenecks

import { setupTables } from "./setupTables";
import { StorageAdapter, StorageAdapterBlock } from "../common";

// Currently assumes one DB per chain ID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment still accurate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused about the chain table but I now understand the expectation is to have only one row in that table, and we don't have a chainId column in the data table so couldn't support multiple chains in the same data table with the current schema

);

const logs = records
.filter((record) => !record.isDeleted)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just include !isDeleted in the sql query?

Copy link
Member Author

@holic holic Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally didn't for a few reasons:

  • on the whole, I expect the number of actual deleted records to be low, because it costs gas to delete vs leaving a record laying around
  • doing it at the query level will change the query planning and I want to make sure we hit the indexes I've set up here (not to say that we can't but I wanna make sure we can make use of the indexes)
  • we'll be doing similar shaped queries for the update stream (query all records and return whether a record is deleted) and seems wise to use the same SQL query path for keeping cache warm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the whole, I expect the number of actual deleted records to be low, because it costs gas to delete vs leaving a record laying around

Independent of the number of deleted records it would save us one linear operation over the entire logs array if we don't have to filter for isDeleted after fetching all the logs. If we'd add another index for isDeleted we could avoid that linear operation, at the cost of more expensive inserts (but since inserts are throttled by the blockchain, write performance is less critical than read performance)

we'll be doing similar shaped queries for the update stream (query all records and filter or return whether a record is deleted) and seems wise to use the same SQL query path for caching

For the update stream the query would also include the latest block number no? So wouldn't it already be a slightly different query shape?

Copy link
Member Author

@holic holic Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the update stream the query would also include the latest block number no? So wouldn't it already be a slightly different query shape?

ah yeah, that's true

mind if we come back to this? it would be nice to set up some scaffolding for testing/measuring query performance, processing performance (SQL query -> HTTP response), etc. and I don't want to blow up this PR

some of this might be easier if we move to static SQL queries (rather than ORM) that we can run EXPLAIN on and include as part of snapshot testing etc. to make sure indices are being hit in the way we expect

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah definitely, this is not blocking, approved the PR #1965 (review)

lastUpdatedBlockNumber: asBigInt("last_updated_block_number", "numeric"),
},
(table) => ({
pk: primaryKey(table.address, table.tableId, table.keyBytes),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the primary key is the combination of these three columns, do we need individual indices on address and tableId for efficient querying on those (where we don't include the keyBytes in the query)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the indexes below this handle that and I believe that an index on [address, tableId, key0] can still benefit a query using address=... AND tableId=...

Comment on lines +33 to +37
pk: primaryKey(table.address, table.tableId, table.keyBytes),
key0Index: index("key0_index").on(table.address, table.tableId, table.key0),
key1Index: index("key1_index").on(table.address, table.tableId, table.key1),
// TODO: add indices for querying without table ID
// TODO: add indices for querying multiple keys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also add an index for isDeleted to be able to efficiently use it in getLogs instead of having to filter the array in js?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above

alvrs
alvrs previously approved these changes Nov 29, 2023
Copy link
Member

@alvrs alvrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my comments are mostly about performance, which is hard to fully reason about without benchmarks and its not blocking for functionality, so I'm happy merging as is. But I think we should prioritise adding a benchmark for getLogs on a table with lots of rows so we can more efficiently identify bottlenecks in hot paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add schemaless indexer
2 participants