Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong match result with large data #16

Open
lightyen opened this issue Aug 23, 2019 · 3 comments
Open

Wrong match result with large data #16

lightyen opened this issue Aug 23, 2019 · 3 comments

Comments

@lightyen
Copy link

lightyen commented Aug 23, 2019

I run a simple test to count the number of occurrence for every keyword, but the result seems incorrect.
For the benchmark testdata, the number of keyword "abbe" should be 36, but I only got 10.

func testIohub(dict [][]byte, content []byte) {
	mem := new(runtime.MemStats)
	runtime.GC()
	runtime.ReadMemStats(mem)
	before := mem.TotalAlloc

	var m *iohub.Matcher

	func() {
		defer calcTime(time.Now(), "iohub/ahocorasick [build]")
		m = iohub.NewMatcher()
		for i, bs := range dict {
			m.Insert(bs, i)
		}
		m.Compile()
	}()

	runtime.GC()
	runtime.ReadMemStats(mem)
	after := mem.TotalAlloc
	fmt.Printf("iohub/ahocorasick [mem]\t\t %d KBytes\n", (after-before)/1024)

	func() {
		result := map[string]int{}
		defer func() {
			writeResult("result/iohub.txt", result)
		}()
		defer calcTime(time.Now(), "iohub/ahocorasick [match]")
		it := m.Match(content)
		for it.HasNext() {
			tokens := it.NextMatchItem(content)
			for _, t := range tokens {
				key := m.Key(content, t)
				result[string(key)]++
			}
		}

	}()
}
@shima-park
Copy link

shima-park commented Aug 7, 2020

I also found the same problem in the test

func NewMatcher(dictPath string) *cedar.Matcher {
	m := cedar.NewMatcher()

	f, err := os.Open(dictPath)
	if err != nil {
		panic(err)
	}

	r := bufio.NewReader(f)
	for {
		l, err := r.ReadBytes('\n')
		if err != nil {
			break
		}
		l = bytes.TrimSpace(l)
		m.Insert(l, 1)
	}

	return m
}

func Match(m *cedar.Matcher, key []byte) map[string]interface{} {
	result := map[string]interface{}{}
	resp := m.Match(key)
	for resp.HasNext() {
		items := resp.NextMatchItem(key)
		for _, itr := range items {
			result[string(m.Key(key, itr))] = itr.Value.(int)
		}
	}
	resp.Release()
	return result
}

func main() {
	testCase := []byte("我是一只小白鼠")

	m := NewMatcher("./test1.txt")
	m.Compile()
	fmt.Println("Result:", Match(m, testCase))

	m2 := NewMatcher("./test2.txt")
	m2.Compile()
	fmt.Println("Result2:", Match(m2, testCase))
}
// Output: 
Result: map[一:1 只:1 小:1 小白:1 小白鼠:1 我:1 是:1 白:1 鼠:1]
Result2: map[只:1 小白:1 小白鼠:1 我:1 是:1 白:1 鼠:1]

test1.txt is a subset of test2.txt, But the match [一:1 小:1] is missing

test1.txt

test2.txt

@echoface
Copy link

echoface commented Jan 8, 2022

same issue here, base on my test, the result is affected by Insert order, it loss part of result in some cases

@iohub
Copy link
Owner

iohub commented Jan 17, 2024

I found this bug caused by inserting huge words into cedar trie. I plan to rewrite cedar trie to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants