trim input to TGI, moved clustering and summarization to dataprep and…

… store in DB (#893) * trim input to TGI, moved clustering and summarization to dataprep and DB store Signed-off-by: Rita Brugarolas <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed inspect_db causing error in precommit Signed-off-by: Rita Brugarolas <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add HF token to dataprep container because tokenizer is used now Signed-off-by: Rita Brugarolas <[email protected]> * updated READMEs to reflect latest changes Signed-off-by: Rita Brugarolas <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bug fix all files are ingested and graph extracted first followed by 1 cluster call for full graph in database Signed-off-by: Rita Brugarolas <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update README based on fix for multifile Signed-off-by: Rita Brugarolas <[email protected]> * Changes to make graphrag ui work Signed-off-by: theresa <[email protected]> * fix bug build communities done once at end of ingestion Signed-off-by: Rita Brugarolas <[email protected]> * minor fixes Signed-off-by: Rita Brugarolas <[email protected]> * README fixes Signed-off-by: Rita Brugarolas <[email protected]> --------- Signed-off-by: Rita Brugarolas <[email protected]> Signed-off-by: theresa <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: theresa <[email protected]>
opea-project · Nov 14, 2024 · 0163ea6 · 0163ea6
1 parent 517a5b0
commit 0163ea6
Show file tree

Hide file tree

Showing 9 changed files with 239 additions and 105 deletions.
diff --git a/comps/cores/mega/gateway.py b/comps/cores/mega/gateway.py
@@ -1048,13 +1048,13 @@ def parser_input(data, TypeClass, key):
             if isinstance(response, StreamingResponse):
                 return response
         last_node = runtime_graph.all_leaves()[-1]
-        response = result_dict[last_node]["text"]
+        response_content = result_dict[last_node]["choices"][0]["message"]["content"]
         choices = []
         usage = UsageInfo()
         choices.append(
             ChatCompletionResponseChoice(
                 index=0,
-                message=ChatMessage(role="assistant", content=response),
+                message=ChatMessage(role="assistant", content=response_content),
                 finish_reason="stop",
             )
         )

diff --git a/comps/dataprep/neo4j/llama_index/README.md b/comps/dataprep/neo4j/llama_index/README.md
@@ -1,5 +1,14 @@
 # Dataprep Microservice with Neo4J
 
+This Dataprep microservice performs:
+
+- Graph extraction (entities, relationships and descripttions) using LLM
+- Performs hierarchical_leiden clustering to identify communities in the knowledge graph
+- Generates a community symmary for each community
+- Stores all of the above in Neo4j Graph DB
+
+This microservice follows the graphRAG approached defined by Microsoft paper ["From Local to Global: A Graph RAG Approach to Query-Focused Summarization"](https://www.microsoft.com/en-us/research/publication/from-local-to-global-a-graph-rag-approach-to-query-focused-summarization/) with some differences such as: 1) only level zero cluster summaries are leveraged, 2) The input context to the final answer generation is trimmed to fit maximum context length.
+
 This dataprep microservice ingests the input files and uses LLM (TGI or OpenAI model when OPENAI_API_KEY is set) to extract entities, relationships and descriptions of those to build a graph-based text index.
 
 ## Setup Environment Variables
@@ -78,6 +87,11 @@ curl -X POST \
     http://${host_ip}:6004/v1/dataprep
 ```
 
+Please note that clustering of extracted entities and summarization happens in this data preparation step. The result of this is:
+
+- Large processing time for large dataset. An LLM call is done to summarize each cluster which may result in large volume of LLM calls
+- Need to clean graph GB entity_info and Cluster if dataprep is run multiple times since the resulting cluster numbering will differ between consecutive calls and will corrupt the results.
+
 We support table extraction from pdf documents. You can specify process_table and table_strategy by the following commands. "table_strategy" refers to the strategies to understand tables for table retrieval. As the setting progresses from "fast" to "hq" to "llm," the focus shifts towards deeper table understanding at the expense of processing speed. The default strategy is "fast".
 
 Note: If you specify "table_strategy=llm" TGI service will be used.