query on indexed field only returns 29 records #170

cbrake · 2020-08-27T15:00:40Z

If I run a query on an indexed field where there are 100 duplicate records in the database, I only get 29 results. For a similar field that is not indexed, a query returns the full 100 documents. Is this expected?

Source for test: https://github.com/simpleiot/simpleiot/tree/feature-genji/cmd/genji-test2

output of test:

2020/08/27 10:54:50 SELECT * FROM tests WHERE field1 = "test1": documents found: 29, time: 306.057µs
2020/08/27 10:54:50 SELECT * FROM tests WHERE field2 = "test1": documents found: 100, time: 618.836µs
2020/08/27 10:54:50 indexed field returned 29 records, non indexed filed returned 100 records, expected 100 records for both
2020/08/27 10:54:50 All done :-)

test source:

package main

import (
	"log"
	"time"

	"github.com/genjidb/genji"
	"github.com/genjidb/genji/database"
	"github.com/genjidb/genji/document"
)

// Test type
type Test struct {
	Field1 string
	Field2 string
}

// returns true if db already exists
func setup(db *genji.DB) bool {
	err := db.Exec("CREATE TABLE tests;")
	if err != nil {
		if err != database.ErrTableAlreadyExists {
			log.Fatal("error creating tests: ", err)
		}
	}

	err = db.Exec("CREATE INDEX idx_tests_field1 ON tests(field1)")
	if err != nil {
		if err != database.ErrIndexAlreadyExists {
			log.Fatal("error creating index: ", err)
		}
		return true
	}

	return false
}

func populateData(db *genji.DB) {
	// insert first user, then a lot of another user
	t := Test{
		Field1: "test1",
		Field2: "test1",
	}

	count := 100
	for i := 0; i < count; i++ {
		err := db.Exec("INSERT INTO tests VALUES ?", &t)
		if err != nil {
			log.Fatal("Error inserting test: ", err)
		}
	}
}

// returns # of docs found
func query(db *genji.DB, q string) int {
	start := time.Now()

	res, err := db.Query(q)

	if err != nil {
		log.Fatal("query error: ", err)
	}

	defer res.Close()

	count := 0

	err = res.Iterate(func(d document.Document) error {
		t := Test{}
		err := document.StructScan(d, &t)
		if err != nil {
			log.Fatal("Error scanning document: ", err)
		}

		count++

		return nil
	})

	log.Printf("%v: documents found: %v, time: %v", q, count, time.Since(start))
	return count
}

func main() {
	db, err := genji.Open(":memory:")

	if err != nil {
		log.Fatal(err)
	}

	defer db.Close()

	dataExists := setup(db)

	if !dataExists {
		populateData(db)
	}

	// look for the record at the beginning of the collection
	count1 := query(db, `SELECT * FROM tests WHERE field1 = "test1"`)
	count2 := query(db, `SELECT * FROM tests WHERE field2 = "test1"`)

	if count1 != count2 {
		log.Printf("indexed field returned %v records, non indexed filed returned %v records, expected 100 records for both", count1, count2)
	}

	log.Println("All done :-)")
}

The text was updated successfully, but these errors were encountered:

asdine · 2020-08-27T15:07:16Z

Thank you for the code sample, I can reproduce it, it is a regression brought by 0.7.0.

cbrake · 2020-08-27T15:17:32Z

thanks! If it's not too much trouble, can you provide a high level overview of how indexes work? I did some tracing through the code but don't have any experience in DB implementations so had some trouble figuring it out.

asdine · 2020-08-27T15:21:52Z

Basically, in Genji at least, an index is a key-value store which stores the value you want to index as the key of that store, and It uses the primary key of the associated document as the value of that store.
There is a slight difference between how list indexes and unique indexes work but that's the general idea.
You can find more information in the index package with the two implementations.

In Genji, each type gets its own separate subtree, as you can see in the implementation of the Index#Set method

cbrake · 2020-08-27T16:19:42Z

noted that if I change the primary key to TEXT, SELECT on indexed field seems to work as expected.

cbrake · 2020-08-27T18:34:13Z

out of curiosity, I created a little program that dumps a bbolt db created with Genji storing the following record and DB setup:

// Test type
type Test struct {
        ID     string
        Field1 string
        Field2 string
}

CREATE TABLE tests (id TEXT PRIMARY KEY NOT NULL);
CREATE INDEX idx_tests_field1 ON tests(field1);

And then inserted 10 duplicate values. After dumping, I get the below. So, it appears the index stores everything in the key and does not use the value at all. In the key, the format is <indexed field data>0x1e<record key>. This is really neat, because then you have a separate document for each index value (which scales indefinitely) vs putting the keys in an array of a single index record -- very elegant! Note, in the key values below, I am only printing the printable chars -- there is a record separator (0x1e) between "hi" and "0" in the index keys.

Is there any chance of things getting confused if the indexed field value has a 0x1e in it?

Bucket:  __genji_indexes
   key=idx_tests_field1, value=unique©indexnameidx_tests_field1tablenametestspathfield1
Bucket:  __genji_tables
   key=tests, value=table_nametestsstore_idt_GXfield_constraintspathidtype@is_primary_keyëis_not_nulléread_only
Bucket:  iidx_tests_field1@
   key=hi0, value=
   key=hi1, value=
   key=hi2, value=
   key=hi3, value=
   key=hi4, value=
   key=hi5, value=
   key=hi6, value=
   key=hi7, value=
   key=hi8, value=
   key=hi9, value=
Bucket:  t_GX
   key=0, value=id0field1hifield2there
   key=1, value=id1field1hifield2there
   key=2, value=id2field1hifield2there
   key=3, value=id3field1hifield2there
   key=4, value=id4field1hifield2there
   key=5, value=id5field1hifield2there
   key=6, value=id6field1hifield2there
   key=7, value=id7field1hifield2there
   key=8, value=id8field1hifield2there
   key=9, value=id9field1hifield2there

asdine · 2020-08-29T19:35:22Z

The problem was caused by the separator used 0x1E (which is equal to 30). The current implementation uses it to determine where the value ends and when the key of the document begins, which is not ideal since both the value and the key can contain the byte 0x1E. PR #171 solves the issue.

Regarding the binary, you can also use BoltBrowser which is great to analyze databases created by Genji.

asdine added the bug Something isn't working label Aug 27, 2020

asdine added this to the v0.7.1 milestone Aug 27, 2020

asdine mentioned this issue Aug 29, 2020

Fix list index implementation #171

Merged

cbrake mentioned this issue Sep 1, 2020

Switch DB layer to Genji simpleiot/simpleiot#81

Closed

2 tasks

asdine closed this as completed in #171 Sep 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query on indexed field only returns 29 records #170

query on indexed field only returns 29 records #170

cbrake commented Aug 27, 2020

asdine commented Aug 27, 2020 •

edited

Loading

cbrake commented Aug 27, 2020

asdine commented Aug 27, 2020 •

edited

Loading

cbrake commented Aug 27, 2020

cbrake commented Aug 27, 2020 •

edited

Loading

asdine commented Aug 29, 2020

query on indexed field only returns 29 records #170

query on indexed field only returns 29 records #170

Comments

cbrake commented Aug 27, 2020

asdine commented Aug 27, 2020 • edited Loading

cbrake commented Aug 27, 2020

asdine commented Aug 27, 2020 • edited Loading

cbrake commented Aug 27, 2020

cbrake commented Aug 27, 2020 • edited Loading

asdine commented Aug 29, 2020

asdine commented Aug 27, 2020 •

edited

Loading

asdine commented Aug 27, 2020 •

edited

Loading

cbrake commented Aug 27, 2020 •

edited

Loading