-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The big migration: from the Channels to the Knowledge Graph #7398
Comments
Channels are a way of organizing content on Tribler Network. As it is created by a human (and not autogenerated), the way the channels are organized into a hierarchy of directories and torrents are placed on different directories, makes sense at least to the creator of the channel if not for more people who subscribe to the channel. This involves the channel creator's human intelligence in the organization of content which likely required some time and effort. While moving away from channels to knowledge database, we should try to incorporate this content organization done by the user and not simply erase it. |
Following a discussion with @grimadas, it has become clear that MeritRank cannot be used for spam prevention with the current tags/knowledge. MeritRank requires peers to interact with each other, whether directly or indirectly. Based on these interactions, peer rank can be calculated, and potential spam can be ignored. As demonstrated by @devos50 in his recent research, Tribler's users are not actively generating tags. Consequently, for the average user, there won't be enough paths in the Knowledge Graph to reach another user and rank their knowledge statements (and, therefore, ignore spam). Thus, the first problem is that we need to encourage users to interact with each other through metadata or content. It remains an open question as to how we can achieve this. Another issue is that MeritRank was not designed specifically for spam prevention, and its effectiveness in addressing this particular problem is uncertain. @grimadas has taken the time to contemplate this and suggest the best way for MeritRank to be utilized. |
"Evil fairy" 😈 🧚 😆 : kindly reminds:
|
Good point. MeritRank assumes a fully connected interaction graph. We could spend a long time to make a good one. Lets do a minimal viable connectivity graph, recycle what we had operational before. verified neighbors are a great idea, lowest latency peers when bootstrapping, implement this as random sampling during startup, or build an age verification service of this. update even more minimal minimal design: 👀 👁️ Signed record that I have seen you 👁️ 👀 Simply a cryptographic signed record of who chatted with you. Nothing more. Beyond trivial. update: proof-of-rendezvous. No need to complicate it with birth, just proof that you once existed, according to this signed witness statement. |
First draft of division of responsibilities 💥 15-day counter is running for next RC4 deployment {Tribler 7.14} of an approved "Pong-certificate PR". Feature: I hereby irrefutable certify that I have gotten a message from this public IPv8 identity/key. No zero-knowledge proof yet of past-pongs 🤣 💥
|
We discussed these questions with @grimadas, @InvictusRMC and Georgy and came to understand that the current algorithm of knowledge dissemination results in unpredictable search quality (below is the draft solution of how we can fix it). How the current channels search worksThe contemporary full-text search algorithm from the most recent release combines results from local search with remote results of sending non-forwarding requests to The logic behind the dissemination of channel metadata is intricate. However, the primary methods for acquiring metadata are the following:
Docs: https://tribler.readthedocs.io/en/latest/metadata_store/channels_architecture.htm Hence, many nodes in the network possess metadata about popular content due to reasons (1) and (2). As a result, searches for popular content often yield quite relevant results (@ichorid correct me if I'm wrong). Proposed search algorithm for the 7.14 versionFor the The current algorithm for knowledge dissemination is based on gossip. Information is randomly distributed across the network amongst all peers. This information may not be accessible to a peer when it conducts a search query, as it might not be in its neighborhood or even within the network itself due to the probability of certain peers being offline. So, nodes with the knowledge about search topic should be in the neighborhood of each node in the network. Light NodesAll Tribler clients are Light Nodes by default. Light Nodes store only the minimal information necessary to keep the network responsive. They maintain and compute information about trends, and exchange this data with other nodes. By 'trends', we refer to popular queries. API:
The results returned will be subject to the There should be an option for a user to switch the type of their node from Light to Full and vice versa. Full NodesSince Light Nodes store only a limited amount of knowledge, they should have access to larger knowledge storage to be able to update their knowledge based on changing trends. Full Nodes store a comprehensive set of metadata. They are capable of responding to both search and metadata queries. API:
The results returned will be subject to the The discovery of Full Nodes will be conducted in the same manner as Exit Node discovery. As an option, we could implement a rule requiring that at least one Full Node should be within each node's neighborhood to provide search results on less popular content. Questions:
|
Thank you for that solid design sketch for 7.14! Can you make it more minimal and easier please? Let's leave the long-tail and rare content search problem out. First do a minimal transition from channels to tags. Minimal viable MeritRank spam prevention. Introducing complexities such as full nodes and metadata-servers is best left after we have tags, Meritrank, and existing PopularityCommunity fully working. For 7.14 we could add simply "rendezvous certificates". Somebody signed that you exist. Temptation is obviously to make this stronger using CRDT plus grow-only counter: Brief Announcement: The Only Undoable CRDTs are Counters. Is that smart @grimadas ? |
Sure. For 7.14, we can proceed as follows:
|
Updated Plan for Upcoming Releases 7.14
The end of this release may not present noticeable changes for regular Tribler users when comparing version 7.14 to 7.13. The primary search engine and metadata source will continue to be Channels. However, with the extension of Tribler's search API, we as developers can test the new search functionality without disrupting the user experience. 8.0
The MeritRank implementation is difficult to plan at this stage as it's not closely integrated with the Knowledge Graph yet. We've decided that spam prevention and reputation features are not priorities for the upcoming releases. As we gather more information on the MeritRank, I'll update the plan accordingly. |
Migration StrategyFor the
We have implemented this migration as an extension of the existing For the The full metadata migration will be integrated into the Upgrader class and will form part of the standard upgrade procedure. |
Enhancements in Gossip ProtocolThere's an ongoing discussion about how we should refine the knowledge gossip protocol. Current AlgorithmAs it stands, our current algorithm proceeds as follows:
This approach was modeled after the Popularity Community gossip algorithm, assuming its long-standing effectiveness would make it a good starting point for the Knowledge Community. Potential ImprovementsOne immediate suggestion to enhance the protocol would be to shift the type of items gossiped. Rather than gossiping single statements, we could share comprehensive information about a specific torrent—its title, tags, and content items. This approach would maintain the integrity of the torrent's information, which could become crucial as we start use it in the search. However, this is based on intuition and may warrant further evaluation. For instance, this approach might introduce biases favoring statements for torrents with extensive metadata. Do We Need a Separate Gossip Protocol?Another significant question is whether a dedicated gossip protocol for knowledge statements is necessary at all. My inclination is that we might be able to leverage the existing gossip mechanisms implemented in the Popularity Community. Here's a brief overview of how the high-level algorithm in the Popularity Community operates:
If a peer receives torrent health info for a torrent whose metadata is missing, the Popularity Community subsequently requests the missing metadata. Given that there is already a mechanism in Tribler for disseminating metadata, it might be sufficient to replace the source of metadata from Ref: #3868 (comment) This method could be synergized with what intuitively appears to be an efficient way to disseminate metadata—namely, including torrent metadata in the Remote Search Results Response. The logic here is straightforward: the more a specific torrent is searched for by Tribler users, the more widely its metadata will be distributed across the network. This mechanism is currently operational within the Channels. The Experimental Approach vs. The Pragmatic ApproachThere's ample opportunity for experimentation here. We could simulate a network with a predefined number of agents, assuming that they have complete and accurate metadata. This would allow us to evaluate the proposed algorithm's impact on search efficiency, network fault tolerance, and the speed of metadata dissemination for both popular and random torrents. Conducting this research rigorously would require the full-time commitment of a dedicated scientist. Alternatively, we could take a more pragmatic approach by adopting methods that seem effective in existing communities. This would be quicker and easier, but it offers no guarantees that the methods will perform as expected. |
The search algorithmThe most recent version of the Channels Search Algorithm was introduced in #7025. It contains two parts:
Remote search could be readily adapted for use with the Knowledge Community, as it is not dependent on the underlying database structure. Adapting the existing local search algorithm, on the contrary, is challenging due to differences in database structure. Here are some potential approaches for implementing search within the Knowledge Graph, along with rough time estimates for each:
These are rough estimates and the actual time required may vary. DB schemeThe following lines were extracted from the #7608: @kozlovsky:
Basically, it is a denormalized table that keeps all the necessary information for the full-text search. But it is not necessary to add it right now in this PR; we can do it later in a separate PR. |
DB schemeThe
They could be divided into 4 groups:
The final two categories ( Regarding
From a developmental effort perspective, the first option is the most straightforward. However, this choice comes with two significant disadvantages:
The second option seems like a middle ground. While it may not impact speed as drastically as the first, it does blur the lines of abstraction. One could then question why certain knowledge elements are designated as Statements, while others are treated as distinct entities within the same database. This might also raise questions about the database's name ( The third option is the most abstractly clear, separating different data types into distinct databases. However, it introduces an additional database to the current roster (knowledge.db, bandwidth.db, mds.db (which will eventually be phased out)). In terms of performance, it might not be as efficient as the second option. |
No exact database schema definition, |
The exact database schema definition can be found here: tribler/src/tribler/core/components/database/db/layers/knowledge_data_access_layer.py Lines 77 to 144 in 20fb224
The tribler/src/tribler/core/components/knowledge/community/knowledge_payload.py Lines 9 to 21 in 20fb224
tribler/src/tribler/core/components/knowledge/community/knowledge_payload.py Lines 43 to 45 in 20fb224
|
Complexity is the biggest risk of Tribler failure, hence my obsession with it. Quick list of open issues and priorities (ongoing editing). I'm struggling what to do first. We should collectively think more about this. We need a clear roadmap. A clear division of responsibility for each team member in this roadmap. Conflicting priorities:
(new architecture picture also discussed in this PopularityCommunity issue) |
The roadmap from the most recent dev meeting:
|
As the migration is not going to happen (we decided to retain some parts of channels, including the Related: |
One brainstorm idea to complete the migration is to also provide tooling for format compatibility and platform compatibility. Youtube Creative Commons content with metadata in the form of tags could be exported and made compatible with tooling of the Bittorrent swarm world. (TikTok compability is still out of scope) |
This issue describes the migration plan for the Tribler's metadata from the Channels to the KnowledgeDatabase. See more information here: #6214
The recent history
Current state
Currently, we have the 7.13 release candidate that includes a migration from the
TagDatabase
to theKnowledgeDatabase
(as it's a crucial step towards a global knowledge graph): #7070After this release is successfully launched, we will be ready for the next steps.
The direction for the next step could involve completing the replacement of Channels with the global Knowledge Graph.
Future
Questions should be answered before the big migration:
Should we use Skip Graph?7.14
mds.db
->knowledge.db
Outcome: tribler network is ready for testing search by knowledge.
7.14-experimental
Outcome: Search quality evaluation based on users' feedback and our own experience.
7.15
MeritRank
for theKnowledgeCommunity
Outcome:
8.0
Outcome: Tribler network migrated from the Channels to the KnowledgeDatabase.
The text was updated successfully, but these errors were encountered: