Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query on indexed field only returns 29 records #170

Closed
cbrake opened this issue Aug 27, 2020 · 6 comments · Fixed by #171
Closed

query on indexed field only returns 29 records #170

cbrake opened this issue Aug 27, 2020 · 6 comments · Fixed by #171
Labels
bug Something isn't working
Milestone

Comments

@cbrake
Copy link

cbrake commented Aug 27, 2020

If I run a query on an indexed field where there are 100 duplicate records in the database, I only get 29 results. For a similar field that is not indexed, a query returns the full 100 documents. Is this expected?

Source for test: https://github.com/simpleiot/simpleiot/tree/feature-genji/cmd/genji-test2

output of test:

2020/08/27 10:54:50 SELECT * FROM tests WHERE field1 = "test1": documents found: 29, time: 306.057µs
2020/08/27 10:54:50 SELECT * FROM tests WHERE field2 = "test1": documents found: 100, time: 618.836µs
2020/08/27 10:54:50 indexed field returned 29 records, non indexed filed returned 100 records, expected 100 records for both
2020/08/27 10:54:50 All done :-)

test source:

package main

import (
	"log"
	"time"

	"github.com/genjidb/genji"
	"github.com/genjidb/genji/database"
	"github.com/genjidb/genji/document"
)

// Test type
type Test struct {
	Field1 string
	Field2 string
}

// returns true if db already exists
func setup(db *genji.DB) bool {
	err := db.Exec("CREATE TABLE tests;")
	if err != nil {
		if err != database.ErrTableAlreadyExists {
			log.Fatal("error creating tests: ", err)
		}
	}

	err = db.Exec("CREATE INDEX idx_tests_field1 ON tests(field1)")
	if err != nil {
		if err != database.ErrIndexAlreadyExists {
			log.Fatal("error creating index: ", err)
		}
		return true
	}

	return false
}

func populateData(db *genji.DB) {
	// insert first user, then a lot of another user
	t := Test{
		Field1: "test1",
		Field2: "test1",
	}

	count := 100
	for i := 0; i < count; i++ {
		err := db.Exec("INSERT INTO tests VALUES ?", &t)
		if err != nil {
			log.Fatal("Error inserting test: ", err)
		}
	}
}

// returns # of docs found
func query(db *genji.DB, q string) int {
	start := time.Now()

	res, err := db.Query(q)

	if err != nil {
		log.Fatal("query error: ", err)
	}

	defer res.Close()

	count := 0

	err = res.Iterate(func(d document.Document) error {
		t := Test{}
		err := document.StructScan(d, &t)
		if err != nil {
			log.Fatal("Error scanning document: ", err)
		}

		count++

		return nil
	})

	log.Printf("%v: documents found: %v, time: %v", q, count, time.Since(start))
	return count
}

func main() {
	db, err := genji.Open(":memory:")

	if err != nil {
		log.Fatal(err)
	}

	defer db.Close()

	dataExists := setup(db)

	if !dataExists {
		populateData(db)
	}

	// look for the record at the beginning of the collection
	count1 := query(db, `SELECT * FROM tests WHERE field1 = "test1"`)
	count2 := query(db, `SELECT * FROM tests WHERE field2 = "test1"`)

	if count1 != count2 {
		log.Printf("indexed field returned %v records, non indexed filed returned %v records, expected 100 records for both", count1, count2)
	}

	log.Println("All done :-)")
}
@asdine
Copy link
Collaborator

asdine commented Aug 27, 2020

Thank you for the code sample, I can reproduce it, it is a regression brought by 0.7.0.

@asdine asdine added the bug Something isn't working label Aug 27, 2020
@asdine asdine added this to the v0.7.1 milestone Aug 27, 2020
@cbrake
Copy link
Author

cbrake commented Aug 27, 2020

thanks! If it's not too much trouble, can you provide a high level overview of how indexes work? I did some tracing through the code but don't have any experience in DB implementations so had some trouble figuring it out.

@asdine
Copy link
Collaborator

asdine commented Aug 27, 2020

Basically, in Genji at least, an index is a key-value store which stores the value you want to index as the key of that store, and It uses the primary key of the associated document as the value of that store.
There is a slight difference between how list indexes and unique indexes work but that's the general idea.
You can find more information in the index package with the two implementations.

In Genji, each type gets its own separate subtree, as you can see in the implementation of the Index#Set method

@cbrake
Copy link
Author

cbrake commented Aug 27, 2020

noted that if I change the primary key to TEXT, SELECT on indexed field seems to work as expected.

@cbrake
Copy link
Author

cbrake commented Aug 27, 2020

out of curiosity, I created a little program that dumps a bbolt db created with Genji storing the following record and DB setup:

// Test type
type Test struct {
        ID     string
        Field1 string
        Field2 string
}

CREATE TABLE tests (id TEXT PRIMARY KEY NOT NULL);
CREATE INDEX idx_tests_field1 ON tests(field1);

And then inserted 10 duplicate values. After dumping, I get the below. So, it appears the index stores everything in the key and does not use the value at all. In the key, the format is <indexed field data>0x1e<record key>. This is really neat, because then you have a separate document for each index value (which scales indefinitely) vs putting the keys in an array of a single index record -- very elegant! Note, in the key values below, I am only printing the printable chars -- there is a record separator (0x1e) between "hi" and "0" in the index keys.

Is there any chance of things getting confused if the indexed field value has a 0x1e in it?

Bucket:  __genji_indexes
   key=idx_tests_field1, value=unique©indexnameidx_tests_field1tablenametestspathfield1
Bucket:  __genji_tables
   key=tests, value=table_nametestsstore_idt_GXfield_constraintspathidtype@is_primary_keyëis_not_nulléread_only
Bucket:  iidx_tests_field1@
   key=hi0, value=
   key=hi1, value=
   key=hi2, value=
   key=hi3, value=
   key=hi4, value=
   key=hi5, value=
   key=hi6, value=
   key=hi7, value=
   key=hi8, value=
   key=hi9, value=
Bucket:  t_GX
   key=0, value=id0field1hifield2there
   key=1, value=id1field1hifield2there
   key=2, value=id2field1hifield2there
   key=3, value=id3field1hifield2there
   key=4, value=id4field1hifield2there
   key=5, value=id5field1hifield2there
   key=6, value=id6field1hifield2there
   key=7, value=id7field1hifield2there
   key=8, value=id8field1hifield2there
   key=9, value=id9field1hifield2there

@asdine
Copy link
Collaborator

asdine commented Aug 29, 2020

The problem was caused by the separator used 0x1E (which is equal to 30). The current implementation uses it to determine where the value ends and when the key of the document begins, which is not ideal since both the value and the key can contain the byte 0x1E. PR #171 solves the issue.

Regarding the binary, you can also use BoltBrowser which is great to analyze databases created by Genji.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants