Skip to content
This repository has been archived by the owner on Jun 11, 2024. It is now read-only.

Add a storetheindex delegated provider #158

Merged
merged 8 commits into from
Feb 16, 2022
Merged

Add a storetheindex delegated provider #158

merged 8 commits into from
Feb 16, 2022

Conversation

willscott
Copy link
Contributor

@willscott willscott commented Feb 2, 2022

This looks a lot like the current delegated provider, but makes the request with the json / url format spoken by the current storetheindex find http server.

This PR does connect up metrics to views so that stats on these delegated providers will become visible to prometheus.

The code in providers/storetheindex is a re-homing of this PR which has an end-to-end test. The go-delegated-routing repo isn't a good home for it, as this is more of a current cludge than the long-term protocol we want to support.
I'm not including the test from that PR in this repo as it depends on the storetheindex codebase, which is a newer/incompatible version of all the libp2p core dependencies.

Copy link
Contributor

@petar petar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine. One small bug/typo which I noted. Next steps:

  • We keep this PR in this branch and don't merge it until we test it in production.
  • Tommy has a way of deploying a target commit to a single Hydra machine.

head/head.go Outdated Show resolved Hide resolved
Copy link
Contributor

@aschmahmann aschmahmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a very cursory look at this and found a config option bug. I'd recommend running this locally and testing that it works before trying to deploy this into production.

FWIW you may want to use something like https://github.com/aschmahmann/vole to issue DHT queries to a head.

head/head.go Outdated Show resolved Hide resolved
providers/storetheindex/findproviders.go Outdated Show resolved Hide resolved
@willscott
Copy link
Contributor Author

tested locally using vole.
made queries, and can see
a) the result from sti
b) the success measure on the prometheus /metrics http site go up
c) the wireshark http request/response made to the STI instance

@willscott
Copy link
Contributor Author

willscott commented Feb 4, 2022

@thattommyhall when you get a chance can you test with the latest commit in this branch, and the config flag

-store-the-index-addr https://a190ab46c53bb433487ff687e39d34b6-795906228.us-east-1.elb.amazonaws.com/

or the env setting:

HYDRA_STORE_THE_INDEX_ADDR=https://a190ab46c53bb433487ff687e39d34b6-795906228.us-east-1.elb.amazonaws.com/

@willscott willscott temporarily deployed to DockerBuilders February 9, 2022 13:51 Inactive
@willscott willscott temporarily deployed to DockerBuilders February 10, 2022 14:25 Inactive
@@ -80,7 +80,9 @@ func mergeAddrInfos(infos []peer.AddrInfo) []peer.AddrInfo {
}
var r []peer.AddrInfo
for k, v := range m {
r = append(r, peer.AddrInfo{ID: k, Addrs: v})
if k.Validate() == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine, but why wouldn't we do this check further up next to the if r.Err == nil check when accumulating the addresses before merging them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we aren't doing an explicit iteration through the addrInfo's / keys during that part of accumulation

@aschmahmann
Copy link
Contributor

This needs to be rebased on master before we can merge it since it has conflicts

@willscott
Copy link
Contributor Author

that's the merge commit that got pushed this morning, no? github says there are not conflicts


func (c *client) FindProviders(ctx context.Context, mh multihash.Multihash) ([]peer.AddrInfo, error) {
// encode request in URL
u := fmt.Sprint(c.endpoint.String(), "/", mh.B58String())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to use multibase b58 encoding here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but this is an existing endpoint which we are planning on replacing with the delegated routing one anyway.

Side note: @willscott you probably want to change the endpoint at some point in the future to use multibase. The cost of not having that one extra character is almost never worth it.

Copy link
Contributor

@guseggert guseggert Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but this is an existing endpoint which we are planning on replacing with the delegated routing one anyway.

Yeah I was just wondering if we should change the endpoint to use multibase encoding, if it's not too late. If getting replaced soon, then disregard :). (and consider doing this for the replacement)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole setup is for the 'works now' variant until go-delegated-routing is solidified and migrated to.

providers/storetheindex/findproviders.go Show resolved Hide resolved
Comment on lines +48 to +50
if len(parsedResponse.MultihashResults) != 1 {
return nil, fmt.Errorf("unexpected number of responses")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this always has one then why is it in an array? Is it expected the change in the future? If so can we just loop over the array so this can be forwards compatible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the query endpoint allows an array of multihashes to be queried. this client only queries for an individual one at a time.

Comment on lines 69 to 70
ContextID []byte
Metadata json.RawMessage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these fields for? They look unused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the response from the indexer node contains these fields, which are used by some providers. They're here for completeness of the message format. There has been some conversations of providers considering using them for cases that they could be relevant here, for instance in expressing priorities, or that multiple records with the same contextID should be de-duplicated

providers/storetheindex.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
k8s/alasybil.yaml Show resolved Hide resolved
head/head.go Outdated
if err != nil {
return nil, nil, fmt.Errorf("failed to instantiate delegation client (%w)", err)
}
providerStore = hproviders.CombineProviders(providerStore, hproviders.AddProviderNotSupported(stiProvider))
Copy link
Contributor

@guseggert guseggert Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this will try the caching provider store concurrently, which we expect to fail, which will then enqueue an async DHT lookup. Those are expensive, will always fail, and will contend for resources (the work queue) with regular DHT queries...is there a way to avoid that?

Copy link
Contributor

@aschmahmann aschmahmann Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying you're concerned that all the requests that end up being fulfilled by the indexers will result in DHT queries that will likely fail and you're concerned about the load?

If so we have some options here depending on the semantics we want. One might be that if we make a "fallback provider" that instead of trying all the providers in parallel does them sequentially only if the previous ones fail. In this case we could then decide to only do a DHT lookup in the event the Datastore and Indexer systems returned no records.

This wouldn't cover every case (e.g. if there's some record in the DHT that we're missing for a given multihash, but the data is separately advertised by the indexers)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there's a change in logic here it should be in a different PR in order to keep the scope of this one reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, but can we agree to not deploy this to the Hydras until this is fixed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not making the current situation any worse right?

Do we have a consensus agreement for something better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change does make it worse, for the reasons listed above.

What @aschmahmann brought up seems like a good compromise. I can make the change if it helps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with Gus. In practice merging this code will make it worse. The change I proposed should be pretty small though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i wasn't clear earlier - what i meant by 'this change' was that this PR uses the same composition structure as the already merged delegated routing structure. I agree that spinning up load in this path is something we need to watch in case it leads to lots of failing dht queries, and that the change to the composition structure propose seems good.

  • There isn't going to be substantial amount of upstream bitswap data that we expect loaded into store the index in the coming week. it would be useful for providers to begin testing the end-to-end flow though, so if the additional change is going to take more than this coming week, we should consider if we can get away without it temporarily.

  • @guseggert if you're able to make the proposed change, that would be great!

head/head.go Outdated Show resolved Hide resolved
@willscott
Copy link
Contributor Author

I think I've responded to the actionable things brought up. please take another look.

@guseggert
Copy link
Contributor

I pushed some changes to do the logic layed out above, and fixes a few other things, let me know if this works for you.

@guseggert
Copy link
Contributor

Actually there are some problems with my commit, let me fix them.

@guseggert
Copy link
Contributor

Okay I think it's correct now, and fixed up the end-to-end test to exercise the StoreTheIndex code path.

Also:

* Reuse the delegated routing HTTP client across all heads
* Don't set arbitrary error string as Prometheus label, to avoid
hitting time series limit
* Unexport some structs
* Rip out the other delegated routing stuff since it's unused & dangerous
@BigLep
Copy link

BigLep commented Feb 15, 2022

2022-02-15: @petar will review. Potentially sync with @guseggert .

Copy link
Contributor

@petar petar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm + fyi notes

}
if cfg.ProvidersFinder != nil && cfg.StoreTheIndexAddr != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sequence of if statements is correct, but for future reference: When your intention is to have mutually-exclusive cases, both for readability and safety, it is best to capture them either as a switch statement, or with "if else" chain. Here switch statement would be best:

switch {
case cfg.ProvidersFinder != nil && cfg.StoreTheIndexAddr == "":
...
break
case cfg.ProvidersFinder != nil && cfg.StoreTheIndexAddr != "":
...
break
case cfg.ProvidersFinder == nil && cfg.StoreTheIndexAddr != "":
...
break
default:
... something is not right ...
}

if err != nil {
return addrInfos, err
}

if len(addrInfos) > 0 {
recordPrefetches(ctx, "local")
return addrInfos, nil
}

return nil, d.Finder.Find(ctx, d.Router, key, func(ai peer.AddrInfo) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not worth the effort, but for future reference. It is bad style to reuse the error return value for two different purposes. Above, returned errors indicate errors in the providerstores. Here, errors indicate errors in the finder. Considering that the finder is an async functionality, independent of the GetProviders execution path, its error should be logged, not returned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise this function seems to be a correct implementation of the logic we discussed in colo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't heard of that style guideline before, if a method is supposed to do two things, and it can't do one of them, then it's normal to return an error, regardless of which one failed. The finder is best effort, so an error from the finder here doesn't represent an error in the finding itself, it means that we couldn't even try to find (e.g. in async case, that the work couldn't be queued for some reason). My intention was for the CachingProviderStore to not know nor care that things are happening async, but given that we don't have a sync version of this, I can see how that just adds confusion, so I can rename things a bit here to clarify.

@petar
Copy link
Contributor

petar commented Feb 16, 2022 via email

@BigLep
Copy link

BigLep commented Mar 9, 2022

This work will be undone in future as part of #162

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants