Skip to content

Chinese Full Text Search

wolfkdy edited this page Jun 26, 2020 · 4 revisions

Implementation

dds intergrated cppJieba for chinese tokenizer. Please refer this file for more details. When a document is inserted, the tokenizer splits the document into several terms, in the underneath kvstore(i.e, wiredTiger/rocksDB) part, each term will be a splitted key-value pair. key is (term+weight), value is the RecordId(you may imagine it as rowid).

When a sentence is queried in fts module, the sentence will also be splitted into terms, for each query term, an IndexScan Stage shall be invoked to find the most matchable documents which already exists in the database. Several NLP-related algorithm will be invoked to evaluate the similarity of the query and the matched document.

How to Use

Please refer to mongodb's fts manuals for details, all the operands mongodb supports for fts are supported in our chinese fts implementaion.

> db.quotes1.createIndex(    { content : "text" },    { default_language: "chinese" } )
{
	"createdCollectionAutomatically" : true,
	"numIndexesBefore" : 1,
	"numIndexesAfter" : 2,
	"ok" : 1
}
> db.quotes1.insert({content: "南京市长江大桥"})
WriteResult({ "nInserted" : 1 })
> db.quotes1.insert({content: "南京市长"})
WriteResult({ "nInserted" : 1 })
> db.quotes1.insert({content: "江大桥"})
> db.quotes1.find({$text: {$search: "南京"}} )
{ "_id" : ObjectId("5ef5706a520ee86cb3cd7528"), "content" : "南京市长" }
{ "_id" : ObjectId("5ef5705f520ee86cb3cd7527"), "content" : "南京市长江大桥" }

Configuration

add lines below to your mongod's yaml file, download the dict files from jieba dict and put the files in fts.dictDir. without these files, chinese fts is not enabled.

fts:
  dictDir: /path/to/jieba/dict

Notice

Mongodb's fts framework does a synchronous write with all the splitted terms when users does an insert. Which means the insert speed will downgrade heavily with long documents to index. You may handoff all the service backend work(i.e, message queue, database, search engine) to mongodb for convenient. But it's better to find and use the proper tool for proper tasks, especially when the workload scales.