Wrong match result with large data #16

lightyen · 2019-08-23T01:59:22Z

I run a simple test to count the number of occurrence for every keyword, but the result seems incorrect.
For the benchmark testdata, the number of keyword "abbe" should be 36, but I only got 10.

func testIohub(dict [][]byte, content []byte) {
	mem := new(runtime.MemStats)
	runtime.GC()
	runtime.ReadMemStats(mem)
	before := mem.TotalAlloc

	var m *iohub.Matcher

	func() {
		defer calcTime(time.Now(), "iohub/ahocorasick [build]")
		m = iohub.NewMatcher()
		for i, bs := range dict {
			m.Insert(bs, i)
		}
		m.Compile()
	}()

	runtime.GC()
	runtime.ReadMemStats(mem)
	after := mem.TotalAlloc
	fmt.Printf("iohub/ahocorasick [mem]\t\t %d KBytes\n", (after-before)/1024)

	func() {
		result := map[string]int{}
		defer func() {
			writeResult("result/iohub.txt", result)
		}()
		defer calcTime(time.Now(), "iohub/ahocorasick [match]")
		it := m.Match(content)
		for it.HasNext() {
			tokens := it.NextMatchItem(content)
			for _, t := range tokens {
				key := m.Key(content, t)
				result[string(key)]++
			}
		}

	}()
}

shima-park · 2020-08-07T09:43:46Z

I also found the same problem in the test

func NewMatcher(dictPath string) *cedar.Matcher {
	m := cedar.NewMatcher()

	f, err := os.Open(dictPath)
	if err != nil {
		panic(err)
	}

	r := bufio.NewReader(f)
	for {
		l, err := r.ReadBytes('\n')
		if err != nil {
			break
		}
		l = bytes.TrimSpace(l)
		m.Insert(l, 1)
	}

	return m
}

func Match(m *cedar.Matcher, key []byte) map[string]interface{} {
	result := map[string]interface{}{}
	resp := m.Match(key)
	for resp.HasNext() {
		items := resp.NextMatchItem(key)
		for _, itr := range items {
			result[string(m.Key(key, itr))] = itr.Value.(int)
		}
	}
	resp.Release()
	return result
}

func main() {
	testCase := []byte("我是一只小白鼠")

	m := NewMatcher("./test1.txt")
	m.Compile()
	fmt.Println("Result:", Match(m, testCase))

	m2 := NewMatcher("./test2.txt")
	m2.Compile()
	fmt.Println("Result2:", Match(m2, testCase))
}

// Output: 
Result: map[一:1 只:1 小:1 小白:1 小白鼠:1 我:1 是:1 白:1 鼠:1]
Result2: map[只:1 小白:1 小白鼠:1 我:1 是:1 白:1 鼠:1]

test1.txt is a subset of test2.txt, But the match [一:1 小:1] is missing

test1.txt

test2.txt

echoface · 2022-01-08T00:18:03Z

same issue here, base on my test, the result is affected by Insert order, it loss part of result in some cases

iohub · 2024-01-17T10:57:47Z

I found this bug caused by inserting huge words into cedar trie. I plan to rewrite cedar trie to fix it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong match result with large data #16

Wrong match result with large data #16

lightyen commented Aug 23, 2019 •

edited

Loading

shima-park commented Aug 7, 2020 •

edited

Loading

echoface commented Jan 8, 2022

iohub commented Jan 17, 2024

Wrong match result with large data #16

Wrong match result with large data #16

Comments

lightyen commented Aug 23, 2019 • edited Loading

shima-park commented Aug 7, 2020 • edited Loading

echoface commented Jan 8, 2022

iohub commented Jan 17, 2024

lightyen commented Aug 23, 2019 •

edited

Loading

shima-park commented Aug 7, 2020 •

edited

Loading