Merge remote-tracking branch 'upstream/main'

tiwater · May 17, 2024 · b664bbd · b664bbd
2 parents bb73251 + 434abf2
commit b664bbd
Show file tree

Hide file tree

Showing 110 changed files with 7,542 additions and 5,341 deletions.
diff --git a/.gitignore b/.gitignore
@@ -29,3 +29,4 @@ Cargo.lock
 docker/ragflow-logs/
 /flask_session
 /logs
+rag/res/deepdoc
diff --git a/Dockerfile b/Dockerfile
@@ -1,4 +1,4 @@
-FROM swr.cn-north-4.myhuaweicloud.com/infiniflow/ragflow-base:v1.0
+FROM infiniflow/ragflow-base:v2.0
 USER  root
 
 WORKDIR /ragflow
@@ -24,6 +24,7 @@ ADD ./deepdoc ./deepdoc
 ADD ./rag ./rag
 
 ADD docker/entrypoint.sh ./entrypoint.sh
+ADD docker/.env ./
 RUN chmod +x ./entrypoint.sh
 
 ENTRYPOINT ["./entrypoint.sh"]
diff --git a/Dockerfile.cuda b/Dockerfile.cuda
@@ -1,4 +1,4 @@
-FROM swr.cn-north-4.myhuaweicloud.com/infiniflow/ragflow-base:v1.0
+FROM FROM infiniflow/ragflow-base:v2.0
 USER  root
 
 WORKDIR /ragflow

diff --git a/README.md b/README.md
@@ -26,7 +26,19 @@
 
 ## 💡 What is RAGFlow?
 
-[RAGFlow](https://demo.ragflow.io) is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding. It offers a streamlined RAG workflow for businesses of any scale, combining LLM (Large Language Models) to provide truthful question-answering capabilities, backed by well-founded citations from various complex formatted data.
+[RAGFlow](https://ragflow.io/) is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding. It offers a streamlined RAG workflow for businesses of any scale, combining LLM (Large Language Models) to provide truthful question-answering capabilities, backed by well-founded citations from various complex formatted data.
+
+## 📌 Latest Updates
+
+- 2024-05-15 Integrates OpenAI GPT-4o.
+- 2024-05-08 Integrates LLM DeepSeek-V2.
+- 2024-04-26 Adds file management.
+- 2024-04-19 Supports conversation API ([detail](./docs/conversation_api.md)).
+- 2024-04-16 Integrates an embedding model 'bce-embedding-base_v1' from [BCEmbedding](https://github.com/netease-youdao/BCEmbedding), and [FastEmbed](https://github.com/qdrant/fastembed), which is designed specifically for light and speedy embedding.
+- 2024-04-11 Supports [Xinference](./docs/xinference.md) for local LLM deployment.
+- 2024-04-10 Adds a new layout recognition model for analyzing legal documents.
+- 2024-04-08 Supports [Ollama](./docs/ollama.md) for local LLM deployment.
+- 2024-04-07 Supports Chinese UI.
 
 ## 🌟 Key Features
 
@@ -56,17 +68,6 @@
 - Multiple recall paired with fused re-ranking.
 - Intuitive APIs for seamless integration with business.
 
-## 📌 Latest Features
-
-- 2024-05-08 Integrates LLM DeepSeek-V2.
-- 2024-04-26 Adds file management.
-- 2024-04-19 Supports conversation API ([detail](./docs/conversation_api.md)).
-- 2024-04-16 Integrates an embedding model 'bce-embedding-base_v1' from [BCEmbedding](https://github.com/netease-youdao/BCEmbedding), and [FastEmbed](https://github.com/qdrant/fastembed), which is designed specifically for light and speedy embedding.
-- 2024-04-11 Supports [Xinference](./docs/xinference.md) for local LLM deployment.
-- 2024-04-10 Adds a new layout recognition model for analyzing legal documents.
-- 2024-04-08 Supports [Ollama](./docs/ollama.md) for local LLM deployment.
-- 2024-04-07 Supports Chinese UI.
-
 ## 🔎 System Architecture
 
 <div align="center" style="margin-top:20px;margin-bottom:20px;">
@@ -282,6 +283,7 @@ $ systemctl start nginx
 
 ## 📚 Documentation
 
+- [Quickstart](./docs/quickstart.md)
 - [FAQ](./docs/faq.md)
 
 ## 📜 Roadmap

diff --git a/README_ja.md b/README_ja.md
@@ -26,7 +26,21 @@
 
 ## 💡 RAGFlow とは？
 
-[RAGFlow](https://demo.ragflow.io) は、深い文書理解に基づいたオープンソースの RAG (Retrieval-Augmented Generation) エンジンである。LLM（大規模言語モデル）を組み合わせることで、様々な複雑なフォーマットのデータから根拠のある引用に裏打ちされた、信頼できる質問応答機能を実現し、あらゆる規模のビジネスに適した RAG ワークフローを提供します。
+[RAGFlow](https://ragflow.io/) は、深い文書理解に基づいたオープンソースの RAG (Retrieval-Augmented Generation) エンジンである。LLM（大規模言語モデル）を組み合わせることで、様々な複雑なフォーマットのデータから根拠のある引用に裏打ちされた、信頼できる質問応答機能を実現し、あらゆる規模のビジネスに適した RAG ワークフローを提供します。
+
+## 📌 最新情報
+
+- 2024-05-15 OpenAI GPT-4oを統合しました。
+- 2024-05-08 LLM DeepSeek-V2を統合しました。
+- 2024-04-26 「ファイル管理」機能を追加しました。
+- 2024-04-19 会話 API をサポートします ([詳細](./docs/conversation_api.md))。
+- 2024-04-16 [BCEmbedding](https://github.com/netease-youdao/BCEmbedding) から埋め込みモデル「bce-embedding-base_v1」を追加します。
+- 2024-04-16 [FastEmbed](https://github.com/qdrant/fastembed) は、軽量かつ高速な埋め込み用に設計されています。
+- 2024-04-11 ローカル LLM デプロイメント用に [Xinference](./docs/xinference.md) をサポートします。
+- 2024-04-10 メソッド「Laws」に新しいレイアウト認識モデルを追加します。
+- 2024-04-08 [Ollama](./docs/ollama.md) を使用した大規模モデルのローカライズされたデプロイメントをサポートします。
+- 2024-04-07 中国語インターフェースをサポートします。
+
 
 ## 🌟 主な特徴
 
@@ -56,18 +70,6 @@
 - 複数の想起と融合された再ランク付け。
 - 直感的な API によってビジネスとの統合がシームレスに。
 
-## 📌 最新の機能
-
-- 2024-05-08 
-- 2024-04-26 「ファイル管理」機能を追加しました。
-- 2024-04-19 会話 API をサポートします ([詳細](./docs/conversation_api.md))。
-- 2024-04-16 [BCEmbedding](https://github.com/netease-youdao/BCEmbedding) から埋め込みモデル「bce-embedding-base_v1」を追加します。
-- 2024-04-16 [FastEmbed](https://github.com/qdrant/fastembed) は、軽量かつ高速な埋め込み用に設計されています。
-- 2024-04-11 ローカル LLM デプロイメント用に [Xinference](./docs/xinference.md) をサポートします。
-- 2024-04-10 メソッド「Laws」に新しいレイアウト認識モデルを追加します。
-- 2024-04-08 [Ollama](./docs/ollama.md) を使用した大規模モデルのローカライズされたデプロイメントをサポートします。
-- 2024-04-07 中国語インターフェースをサポートします。
-
 ## 🔎 システム構成
 
 <div align="center" style="margin-top:20px;margin-bottom:20px;">
@@ -251,6 +253,7 @@ $ bash ./entrypoint.sh
 
 ## 📚 ドキュメンテーション
 
+- [Quickstart](./docs/quickstart.md)
 - [FAQ](./docs/faq.md)
 
 ## 📜 ロードマップ

diff --git a/README_zh.md b/README_zh.md
@@ -26,7 +26,19 @@
 
 ## 💡 RAGFlow 是什么？
 
-[RAGFlow](https://demo.ragflow.io) 是一款基于深度文档理解构建的开源 RAG（Retrieval-Augmented Generation）引擎。RAGFlow 可以为各种规模的企业及个人提供一套精简的 RAG 工作流程，结合大语言模型（LLM）针对用户各类不同的复杂格式数据提供可靠的问答以及有理有据的引用。
+[RAGFlow](https://ragflow.io/) 是一款基于深度文档理解构建的开源 RAG（Retrieval-Augmented Generation）引擎。RAGFlow 可以为各种规模的企业及个人提供一套精简的 RAG 工作流程，结合大语言模型（LLM）针对用户各类不同的复杂格式数据提供可靠的问答以及有理有据的引用。
+
+## 📌 近期更新
+
+- 2024-05-15 集成大模型 OpenAI GPT-4o。
+- 2024-05-08 集成大模型 DeepSeek。
+- 2024-04-26 增添了'文件管理'功能。
+- 2024-04-19 支持对话 API ([更多](./docs/conversation_api.md))。
+- 2024-04-16 集成嵌入模型 [BCEmbedding](https://github.com/netease-youdao/BCEmbedding) 和 专为轻型和高速嵌入而设计的 [FastEmbed](https://github.com/qdrant/fastembed)。
+- 2024-04-11 支持用 [Xinference](./docs/xinference.md) 本地化部署大模型。
+- 2024-04-10 为‘Laws’版面分析增加了底层模型。
+- 2024-04-08 支持用 [Ollama](./docs/ollama.md) 本地化部署大模型。
+- 2024-04-07 支持中文界面。
 
 ## 🌟 主要功能
 
@@ -56,17 +68,6 @@
 - 基于多路召回、融合重排序。
 - 提供易用的 API，可以轻松集成到各类企业系统。
 
-## 📌 新增功能
-
-- 2024-05-08 集成大模型 DeepSeek
-- 2024-04-26 增添了'文件管理'功能.
-- 2024-04-19 支持对话 API ([更多](./docs/conversation_api.md)).
-- 2024-04-16 集成嵌入模型 [BCEmbedding](https://github.com/netease-youdao/BCEmbedding) 和 专为轻型和高速嵌入而设计的 [FastEmbed](https://github.com/qdrant/fastembed) 。
-- 2024-04-11 支持用 [Xinference](./docs/xinference.md) 本地化部署大模型。
-- 2024-04-10 为‘Laws’版面分析增加了底层模型。
-- 2024-04-08 支持用 [Ollama](./docs/ollama.md) 本地化部署大模型。
-- 2024-04-07 支持中文界面。
-
 ## 🔎 系统架构
 
 <div align="center" style="margin-top:20px;margin-bottom:20px;">
@@ -271,6 +272,7 @@ $ systemctl start nginx
 ```
 ## 📚 技术文档
 
+- [Quickstart](./docs/quickstart.md)
 - [FAQ](./docs/faq.md)
 
 ## 📜 路线图

diff --git a/api/apps/api_app.py b/api/apps/api_app.py
@@ -13,19 +13,23 @@
 #  See the License for the specific language governing permissions and
 #  limitations under the License.
 #
+import json
 import os
 import re
 from datetime import datetime, timedelta
-from flask import request
+from flask import request, Response
 from flask_login import login_required, current_user
 
 from api.db import FileType, ParserType
-from api.db.db_models import APIToken, API4Conversation
+from api.db.db_models import APIToken, API4Conversation, Task
 from api.db.services import duplicate_name
 from api.db.services.api_service import APITokenService, API4ConversationService
 from api.db.services.dialog_service import DialogService, chat
 from api.db.services.document_service import DocumentService
+from api.db.services.file2document_service import File2DocumentService
+from api.db.services.file_service import FileService
 from api.db.services.knowledgebase_service import KnowledgebaseService
+from api.db.services.task_service import queue_tasks, TaskService
 from api.db.services.user_service import UserTenantService
 from api.settings import RetCode
 from api.utils import get_uuid, current_timestamp, datetime_format
@@ -35,6 +39,9 @@
 from api.utils.file_utils import filename_type, thumbnail
 from rag.utils.minio_conn import MINIO
 
+from rag.utils.es_conn import ELASTICSEARCH
+from rag.nlp import search
+from elasticsearch_dsl import Q
 
 def generate_confirmation_token(tenent_id):
     serializer = URLSafeTimedSerializer(tenent_id)
@@ -164,6 +171,7 @@ def completion():
     e, conv = API4ConversationService.get_by_id(req["conversation_id"])
     if not e:
         return get_data_error_result(retmsg="Conversation not found!")
+    if "quote" not in req: req["quote"] = False
 
     msg = []
     for m in req["messages"]:
@@ -180,13 +188,45 @@ def completion():
             return get_data_error_result(retmsg="Dialog not found!")
         del req["conversation_id"]
         del req["messages"]
-        ans = chat(dia, msg, **req)
+
         if not conv.reference:
             conv.reference = []
-        conv.reference.append(ans["reference"])
-        conv.message.append({"role": "assistant", "content": ans["answer"]})
-        API4ConversationService.append_message(conv.id, conv.to_dict())
-        return get_json_result(data=ans)
+        conv.message.append({"role": "assistant", "content": ""})
+        conv.reference.append({"chunks": [], "doc_aggs": []})
+
+        def fillin_conv(ans):
+            nonlocal conv
+            if not conv.reference:
+                conv.reference.append(ans["reference"])
+            else: conv.reference[-1] = ans["reference"]
+            conv.message[-1] = {"role": "assistant", "content": ans["answer"]}
+
+        def stream():
+            nonlocal dia, msg, req, conv
+            try:
+                for ans in chat(dia, msg, True, **req):
+                    fillin_conv(ans)
+                    yield "data:"+json.dumps({"retcode": 0, "retmsg": "", "data": ans}, ensure_ascii=False) + "\n\n"
+                API4ConversationService.append_message(conv.id, conv.to_dict())
+            except Exception as e:
+                yield "data:" + json.dumps({"retcode": 500, "retmsg": str(e),
+                                            "data": {"answer": "**ERROR**: "+str(e), "reference": []}},
+                                           ensure_ascii=False) + "\n\n"
+            yield "data:"+json.dumps({"retcode": 0, "retmsg": "", "data": True}, ensure_ascii=False) + "\n\n"
+
+        if req.get("stream", True):
+            resp = Response(stream(), mimetype="text/event-stream")
+            resp.headers.add_header("Cache-control", "no-cache")
+            resp.headers.add_header("Connection", "keep-alive")
+            resp.headers.add_header("X-Accel-Buffering", "no")
+            resp.headers.add_header("Content-Type", "text/event-stream; charset=utf-8")
+            return resp
+        else:
+            ans = chat(dia, msg, False, **req)
+            fillin_conv(ans)
+            API4ConversationService.append_message(conv.id, conv.to_dict())
+            return get_json_result(data=ans)
+
     except Exception as e:
         return server_error_response(e)
 
@@ -233,6 +273,13 @@ def upload():
     if file.filename == '':
         return get_json_result(
             data=False, retmsg='No file selected!', retcode=RetCode.ARGUMENT_ERROR)
+
+    root_folder = FileService.get_root_folder(tenant_id)
+    pf_id = root_folder["id"]
+    FileService.init_knowledgebase_docs(pf_id, tenant_id)
+    kb_root_folder = FileService.get_kb_folder(tenant_id)
+    kb_folder = FileService.new_a_file_from_kb(kb.tenant_id, kb.name, kb_root_folder["id"])
+
     try:
         if DocumentService.get_doc_count(kb.tenant_id) >= int(os.environ.get('MAX_FILE_NUM_PER_USER', 8192)):
             return get_data_error_result(
@@ -264,11 +311,82 @@ def upload():
             "size": len(blob),
             "thumbnail": thumbnail(filename, blob)
         }
+
+        form_data=request.form
+        if "parser_id" in form_data.keys():
+            if request.form.get("parser_id").strip() in list(vars(ParserType).values())[1:-3]:
+                doc["parser_id"] = request.form.get("parser_id").strip()
         if doc["type"] == FileType.VISUAL:
             doc["parser_id"] = ParserType.PICTURE.value
         if re.search(r"\.(ppt|pptx|pages)$", filename):
             doc["parser_id"] = ParserType.PRESENTATION.value
-        doc = DocumentService.insert(doc)
-        return get_json_result(data=doc.to_json())
+
+        doc_result = DocumentService.insert(doc)
+        FileService.add_file_from_kb(doc, kb_folder["id"], kb.tenant_id)
+    except Exception as e:
+        return server_error_response(e)
+
+    if "run" in form_data.keys():
+        if request.form.get("run").strip() == "1":
+            try:
+                info = {"run": 1, "progress": 0}
+                info["progress_msg"] = ""
+                info["chunk_num"] = 0
+                info["token_num"] = 0
+                DocumentService.update_by_id(doc["id"], info)
+                # if str(req["run"]) == TaskStatus.CANCEL.value:
+                tenant_id = DocumentService.get_tenant_id(doc["id"])
+                if not tenant_id:
+                    return get_data_error_result(retmsg="Tenant not found!")
+
+                #e, doc = DocumentService.get_by_id(doc["id"])
+                TaskService.filter_delete([Task.doc_id == doc["id"]])
+                e, doc = DocumentService.get_by_id(doc["id"])
+                doc = doc.to_dict()
+                doc["tenant_id"] = tenant_id
+                bucket, name = File2DocumentService.get_minio_address(doc_id=doc["id"])
+                queue_tasks(doc, bucket, name)
+            except Exception as e:
+                 return server_error_response(e)
+
+    return get_json_result(data=doc_result.to_json())
+
+
+@manager.route('/list_chunks', methods=['POST'])
+# @login_required
+def list_chunks():
+    token = request.headers.get('Authorization').split()[1]
+    objs = APIToken.query(token=token)
+    if not objs:
+        return get_json_result(
+            data=False, retmsg='Token is not valid!"', retcode=RetCode.AUTHENTICATION_ERROR)
+
+    form_data = request.form
+
+    try:
+        if "doc_name" in form_data.keys():
+            tenant_id = DocumentService.get_tenant_id_by_name(form_data['doc_name'])
+            q = Q("match", docnm_kwd=form_data['doc_name'])
+
+        elif "doc_id" in form_data.keys():
+            tenant_id = DocumentService.get_tenant_id(form_data['doc_id'])
+            q = Q("match", doc_id=form_data['doc_id'])
+        else:
+            return get_json_result(
+                data=False,retmsg="Can't find doc_name or doc_id"
+            )
+
+        res_es_search = ELASTICSEARCH.search(q,idxnm=search.index_name(tenant_id),timeout="600s")
+
+        res = [{} for _ in range(len(res_es_search['hits']['hits']))]
+
+        for index , chunk in enumerate(res_es_search['hits']['hits']):
+            res[index]['doc_name'] = chunk['_source']['docnm_kwd']
+            res[index]['content'] = chunk['_source']['content_with_weight']
+            if 'img_id' in chunk['_source'].keys():
+                res[index]['img_id'] = chunk['_source']['img_id']
+
     except Exception as e:
         return server_error_response(e)
+
+    return get_json_result(data=res)
diff --git a/api/apps/chunk_app.py b/api/apps/chunk_app.py
@@ -38,7 +38,7 @@
 @manager.route('/list', methods=['POST'])
 @login_required
 @validate_request("doc_id")
-def list():
+def list_chunk():
     req = request.json
     doc_id = req["doc_id"]
     page = int(req.get("page", 1))