-
Notifications
You must be signed in to change notification settings - Fork 16
Chinese Full Text Search
dds intergrated cppJieba for chinese tokenizer. Please refer this file for more details. When a document is inserted, the tokenizer splits the document into several terms, in the underneath kvstore(i.e, wiredTiger/rocksDB) part, each term will be a splitted key-value pair. key is (term+weight), value is the RecordId(you may imagine it as rowid).
When a sentence is queried in fts module, the sentence will also be splitted into terms, for each query term, an IndexScan Stage shall be invoked to find the most matchable documents which already exists in the database. Several NLP-related algorithm will be invoked to evaluate the similarity of the query and the matched document.
Please refer to mongodb's fts manuals for details, all the operands mongodb supports for fts are supported in our chinese
fts implementaion.
> db.quotes1.createIndex( { content : "text" }, { default_language: "chinese" } )
{
"createdCollectionAutomatically" : true,
"numIndexesBefore" : 1,
"numIndexesAfter" : 2,
"ok" : 1
}
> db.quotes1.insert({content: "南京市长江大桥"})
WriteResult({ "nInserted" : 1 })
> db.quotes1.insert({content: "南京市长"})
WriteResult({ "nInserted" : 1 })
> db.quotes1.insert({content: "江大桥"})
> db.quotes1.find({$text: {$search: "南京"}} )
{ "_id" : ObjectId("5ef5706a520ee86cb3cd7528"), "content" : "南京市长" }
{ "_id" : ObjectId("5ef5705f520ee86cb3cd7527"), "content" : "南京市长江大桥" }
add lines below to your mongod's yaml file, download the dict files from jieba dict and put the files in fts.dictDir. without these files, chinese fts is not enabled.
fts:
dictDir: /path/to/jieba/dict
Mongodb's fts framework does a synchronous write with all the splitted terms when users does an insert. Which means the insert speed will downgrade heavily with long documents to index. You may handoff all the service backend work(i.e, message queue, database, search engine) to mongodb for convenient. But it's better to find and use the proper tool for proper tasks, especially when the workload scales.