Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

key_info[secondary_key].actual_key_parts does not include primary key on partitioned tables #105

Closed
spetrunia opened this issue Nov 17, 2015 · 16 comments
Assignees
Milestone

Comments

@spetrunia
Copy link
Contributor

The bug was found by Daniel Lee @dleeyh .

Start with a clear data directory.
Start the server with following my.cnf settings:

datadir=/home/myrocks/tests/rdbdata/install.db
socket=/tmp/mysql.sock
gdb
language=./share/english
default-storage-engine=rocksdb
skip-innodb
default-tmp-storage-engine=MyISAM
rocksdb
rocksdb_strict_collation_exceptions=.*

Use MariaDB's version of RQG from
http://bazaar.launchpad.net/~elenst/randgen/mariadb-patches/changes , I used revision 1050.

Run this command (adjust host/port/socket as appropriate)

perl gentest.pl \
   --dsn=dbi:mysql:host=localhost:port=3306:mysql_socket=/tmp/mysql.sock:user=root:database=test \
   --gendata=conf/examples/example.zz \
   --grammar=conf/examples/example.yy

It doesn't happen on every run, but on my machine there is about 30% chance that the server will crash like this:

mysqld: /home/psergey/dev-git/mysql-5.6-rocksd-look10/rocksdb/db/compaction_iterator.cc:201: void rocksdb::CompactionIterator::NextFromInput(): Assertion `false' failed.
Aborted (core dumped)

RQG output will look like this:

#2015-11-17T18:51:57 [5362] GenTest::ErrorFilter(5370) started
#2015-11-17T18:51:58 [5371] Started periodic reporting process...
#2015-11-17T18:51:58 [5370] Query: UPDATE `table100_int_autoinc` SET `col_set_utf8_not_null` = 4 WHERE `col_set_latin1` = 7 LIMIT 8 failed: 1205 Lock wait timeout exceeded; try restarting transaction. Further errors of this kind will be suppressed.
#2015-11-17T18:51:58 [5370] Query: UPDATE `table100_int_autoinc` SET `col_char_utf8_key_default_null` = 6 WHERE `col_set_latin1_key_default_null` < 0 LIMIT 8 failed: 1213 Deadlock found when trying to get lock; try restarting transaction. Further errors of this kind will be suppressed.
#2015-11-17T18:52:01 [5370] Query: UPDATE `table100_key_pk_parts_2_int_autoinc` SET `pk` = 1 WHERE `col_char_utf8` < 8 LIMIT 2 failed: 1062 Duplicate entry '%s' for key '%s'. Further errors of this kind will be suppressed.
#2015-11-17T18:52:06 [5377] Query: DELETE FROM `table10_key_pk_parts_2_int_autoinc` WHERE `col_int_unsigned_default_null` < 0 LIMIT 0 failed: 2013 Lost connection to MySQL server during query
#2015-11-17T18:52:06 [5372] Query: UPDATE `table100_int_autoinc` SET `col_enum_latin1_key` = 5 WHERE `col_set_utf8_key_default_null` < 4 LIMIT 9 failed: 2013 Lost connection to MySQL server during query
#2015-11-17T18:52:06 [5372] Mixer: Server crash or critical failure (STATUS_SERVER_CRASHED) reported at dsn dbi:mysql:host=localhost:port=3311:mysql_socket=/tmp/mysql4.sock:user=root:database=test
#2015-11-17T18:52:06 [5377] Mixer: Server crash or critical failure (STATUS_SERVER_CRASHED) reported at dsn dbi:mysql:host=localhost:port=3311:mysql_socket=/tmp/mysql4.sock:user=root:database=test
@spetrunia
Copy link
Contributor Author

I saved a core file. Stack trace from the failing assert:

(gdb) wher
  #0  0x00007f0740bbbf79 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
  #1  0x00007f0740bbf388 in __GI_abort () at abort.c:89
  #2  0x00007f0740bb4e36 in __assert_fail_base (fmt=0x7f0740d06718 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x16c84d0 "false", file=file@entry=0x16c8410 "/home/psergey/dev-git/mysql-5.6-rocksd-look10/rocksdb/db/compaction_iterator.cc", line=line@entry=201, function=function@entry=0x16c93e0 <rocksdb::CompactionIterator::NextFromInput()::__PRETTY_FUNCTION__> "void rocksdb::CompactionIterator::NextFromInput()") at assert.c:92
  #3  0x00007f0740bb4ee2 in __GI___assert_fail (assertion=0x16c84d0 "false", file=0x16c8410 "/home/psergey/dev-git/mysql-5.6-rocksd-look10/rocksdb/db/compaction_iterator.cc", line=201, function=0x16c93e0 <rocksdb::CompactionIterator::NextFromInput()::__PRETTY_FUNCTION__> "void rocksdb::CompactionIterator::NextFromInput()") at assert.c:101
  #4  0x000000000101f545 in rocksdb::CompactionIterator::NextFromInput (this=0x7f073a8cd950) at /home/psergey/dev-git/mysql-5.6-rocksd-look10/rocksdb/db/compaction_iterator.cc:201
  #5  0x000000000101ed58 in rocksdb::CompactionIterator::Next (this=0x7f073a8cd950) at /home/psergey/dev-git/mysql-5.6-rocksd-look10/rocksdb/db/compaction_iterator.cc:90
  #6  0x00000000010091ef in rocksdb::BuildTable (dbname=..., env=0x1f65560 <rocksdb::Env::Default()::default_env>, ioptions=..., env_options=..., table_cache=0x3843f00, iter=0x7f073a8cde18, meta=0x7f073a8ce810, internal_comparator=..., int_tbl_prop_collector_factories=0x3841330, column_family_id=0, snapshots=..., compression=rocksdb::kNoCompression, compression_opts=..., paranoid_file_checks=false, internal_stats=0x3841a20, io_priority=rocksdb::Env::IO_HIGH, table_properties=0x7f073a8ceb30) at /home/psergey/dev-git/mysql-5.6-rocksd-look10/rocksdb/db/builder.cc:102
  #7  0x0000000000ee1c2b in rocksdb::FlushJob::WriteLevel0Table (this=0x7f073a8ceaa0, mems=..., edit=0x3845720, meta=0x7f073a8ce810) at /home/psergey/dev-git/mysql-5.6-rocksd-look10/rocksdb/db/flush_job.cc:234
  #8  0x0000000000ee0f54 in rocksdb::FlushJob::Run (this=0x7f073a8ceaa0, file_meta=0x7f073a8ce970) at /home/psergey/dev-git/mysql-5.6-rocksd-look10/rocksdb/db/flush_job.cc:145
  #9  0x0000000000e95d67 in rocksdb::DBImpl::FlushMemTableToOutputFile (this=0x37008f0, cfd=0x38412f0, mutable_cf_options=..., made_progress=0x7f073a8cf17f, job_context=0x7f073a8cf1c0, log_buffer=0x7f073a8cf380) at /home/psergey/dev-git/mysql-5.6-rocksd-look10/rocksdb/db/db_impl.cc:1463
  #10 0x0000000000e9b1d0 in rocksdb::DBImpl::BackgroundFlush (this=0x37008f0, made_progress=0x7f073a8cf17f, job_context=0x7f073a8cf1c0, log_buffer=0x7f073a8cf380) at /home/psergey/dev-git/mysql-5.6-rocksd-look10/rocksdb/db/db_impl.cc:2461
  #11 0x0000000000e9b3db in rocksdb::DBImpl::BackgroundCallFlush (this=0x37008f0) at /home/psergey/dev-git/mysql-5.6-rocksd-look10/rocksdb/db/db_impl.cc:2483
  #12 0x0000000000e9adc6 in rocksdb::DBImpl::BGWorkFlush (db=0x37008f0) at /home/psergey/dev-git/mysql-5.6-rocksd-look10/rocksdb/db/db_impl.cc:2410
  #13 0x0000000000f98145 in rocksdb::(anonymous namespace)::PosixEnv::ThreadPool::BGThread (this=0x3701980, thread_id=0) at /home/psergey/dev-git/mysql-5.6-rocksd-look10/rocksdb/util/env_posix.cc:809
  #14 0x0000000000f981e9 in rocksdb::(anonymous namespace)::PosixEnv::ThreadPool::BGThreadWrapper (arg=0x7f070c0489b0) at /home/psergey/dev-git/mysql-5.6-rocksd-look10/rocksdb/util/env_posix.cc:833
  #15 0x00007f0741773182 in start_thread (arg=0x7f073a8d0700) at pthread_create.c:312
  #16 0x00007f0740c8030d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

@spetrunia
Copy link
Contributor Author

Code around the assertion failure:

      // Check whether the current key is valid, not corrupt and the same
      // as the single delete.
      if (input_->Valid() && ParseInternalKey(input_->key(), &next_ikey) &&
          cmp_->Equal(ikey_.user_key, next_ikey.user_key)) {
        // Mixing single deletes and merges is not supported. Consecutive
        // single deletes are not valid.
        if (next_ikey.type != kTypeValue) {
          assert(false);
          status_ =
              Status::InvalidArgument("Put expected after single delete.");
          break;
        }

So this might be a MyRocks issue after all.
Indeed, looks like compaction saw two SingleDelete commands with the same key:

(gdb) p ikey_
  $30 = {user_key = {data_ = 0x7f073a8cda28 "", size_ = 9}, sequence = 94777, type = rocksdb::kTypeSingleDeletion}
(gdb) p next_ikey
  $32 = {user_key = {data_ = 0x7f073011f7de "", size_ = 9}, sequence = 11265, type = rocksdb::kTypeSingleDeletion}

@yoshinorim
Copy link
Contributor

@spetrunia Could you paste the duplicate SingleDelete keys and associated schema definition? Duplicate SingleDelete keys is definitely a bug..

@spetrunia spetrunia changed the title Assertion failure in RocksDB Concurrent MyRocks load causing assertion failure in RocksDB Nov 17, 2015
@yoshinorim
Copy link
Contributor

I could reproduce the assertion failure with log-bin enabled and binlog_format=statement. But I haven't been able to reproduce with binlog_format=row.

@yoshinorim
Copy link
Contributor

When I reproduced a couple of times, index id was always 538, and the index was test.table100_key_pk_parts_2_int_autoinc#P#p1 col_set_utf8_not_null_key.

(gdb) p ikey_->user_key->data_[0]
$1 = 0 '\000'
(gdb) p ikey_->user_key->data_[1]
$2 = 0 '\000'
(gdb) p ikey_->user_key->data_[2]
$3 = 2 '\002'
(gdb) p ikey_->user_key->data_[3]
$4 = 26 '\032'

@yoshinorim
Copy link
Contributor

I tested with adding debugging logic (hexdump Put/SingleDelete keys), then noticed that for keys 0000021A0000000080000008 and 0000021A0000002080000040, SingleDelete was called twice but Put was called only once. SingleDelete was called from ha_rocksdb::delete_row(). Statements were as follows.

0000021A0000000080000008 DELETE FROM `table100_key_pk_parts_2_int_autoinc` WHERE `col_int_unsigned_not_null_key` < 6 LIMIT 6
0000021A0000000080000008 DELETE FROM `table100_key_pk_parts_2_int_autoinc` WHERE `pk` < 9 LIMIT 6
0000021A0000002080000040 DELETE FROM `table100_key_pk_parts_2_int_autoinc` WHERE `col_char_utf8` < 3 LIMIT 6
0000021A0000002080000040 DELETE FROM `table100_key_pk_parts_2_int_autoinc` WHERE `col_char_latin1_key` < 4 LIMIT 5

@yoshinorim yoshinorim added this to the high-pri milestone Nov 18, 2015
@yoshinorim
Copy link
Contributor

I tested with the following assertion code then verified it hit -- So somehow MyRocks called SingleDelete() on a secondary key that did not exist.

ha_rocksdb::delete_row()
+      std::string v;
+      rocksdb::Status s= trx->Get(key_descr[i]->get_cf(), secondary_key_slice, &v);
+      if (s.IsNotFound())
+      {
+        fprintf(stderr, "Not Found! %s %s %lu\n", secondary_key_slice.ToString(true).c_str(), buf, thd->thread_id);
+        abort_with_stack_traces();
+      }
       trx->GetIndexedWriteBatch()->SingleDelete(key_descr[i]->get_cf(),
                                                 secondary_key_slice);

Next step will be identifying how this has happened.

@yoshinorim
Copy link
Contributor

I managed to reproduce the assertion error with the following simple SQL statements. It was not concurrency issue. I could not reproduce the issue if not using Partition. I assume it was because multiple index ids were assigned for partitioned primary key.

CREATE TABLE x (
 id INT PRIMARY KEY,
 a set ('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z')  CHARACTER SET utf8,
 b set ('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z')  CHARACTER SET utf8 default null,
 c set ('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z')  CHARACTER SET utf8 not null,
 INDEX (a),
 INDEX (b),
 INDEX (c)
) ENGINE=RocksDB PARTITION BY key (id) partitions 2;

INSERT INTO x (id, b) VALUES (28, 3);
UPDATE x SET id=8 WHERE c < 8 LIMIT 1;
DELETE FROM x WHERE id < 9 LIMIT 6;

Then consecutive SingleDelete for id=8 happened. It makes it easier to debug if applying a patch in the above comment -- aborting if Get() returned no result before calling SingleDelete().

@yoshinorim yoshinorim changed the title Concurrent MyRocks load causing assertion failure in RocksDB MyRocks assertion failure with Partition Nov 18, 2015
@yoshinorim
Copy link
Contributor

@spetrunia I think we should discuss how index ids should be allocated on partitioned tables. Currently index_ids are different per partition, but I think index_id should be the same across partitions, at least on unique/primary key.

@yoshinorim
Copy link
Contributor

I found that with partition, key_info[secondary_index_id].actual_key_parts did not include primary key.

CREATE TABLE y (
 id INT PRIMARY KEY,
 a INT,
 INDEX (a)
) ENGINE=RocksDB PARTITION BY key (id) partitions 2;

without partition
(gdb) p table_arg->key_info[0].actual_key_parts
$40 = 1
(gdb) p table_arg->key_info[1].actual_key_parts
$41 = 2
with partition
(gdb) p table_arg->key_info[0].actual_key_parts
$42 = 1
(gdb) p table_arg->key_info[1].actual_key_parts
$43 = 1

This affects logic "if (old_data && !updated_indexes.is_set(i))" at https://github.com/facebook/mysql-5.6/blob/webscalesql-5.6.24.97/storage/rocksdb/ha_rocksdb.cc#L4890-L4891 -- with Partition, the if clause returned true, then further SingleDelete and Put were skipped. In other words, with partition, if primary key changed but secondary key column did not change, MyRocks may decide that the whole secondary key did not change, even though it actually changed in primary key part.

Another side effect is with partition, covering index didn't work in the following case:

mysql> CREATE TABLE y (  id INT PRIMARY KEY,  a INT,  INDEX (a) ) ENGINE=rocksdb PARTITION BY key (id) partitions 2;
Query OK, 0 rows affected (0.18 sec)

mysql> insert into y values (2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,9);
Query OK, 8 rows affected (0.00 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> explain select id, a from y where a=3;
+----+-------------+-------+------+---------------+------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key  | key_len | ref   | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------+
|  1 | SIMPLE      | y     | ref  | a             | a    | 5       | const |    2 | NULL  |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------+
1 row in set (0.00 sec)

In InnoDB, covering index was used.

@yoshinorim yoshinorim changed the title MyRocks assertion failure with Partition key_info[secondary_key].actual_key_parts does not include primary key on partitioned tables Nov 18, 2015
@yoshinorim yoshinorim added the bug label Nov 18, 2015
@yoshinorim
Copy link
Contributor

key_info[secondary_key].actual_key_parts does not include primary key on partitioned tables -- This happens on both InnoDB and MyRocks. I still have no idea if it is right to use actual_key_parts to check if keys are modified/covered or not

@spetrunia
Copy link
Contributor Author

@yoshinorim , I'm sorry for the delay:

Most of the code (e.g. encoding/decoding index tuples) is actually aware of the fact that secondary index is an "Extended key" with PK column at the end.
(Besides partitioning, we have also hit this problem earlier with unique secondary indexes - SQL layer thinks they are not "Extended Keys" for some reason, while internally they are)

The only part that's not aware is ha_rocksdb::calc_updated_indexes(). Here is a draft patch that makes calc_updated_indexes to take "extended keys" into account: https://gist.github.com/spetrunia/5c138da0dbcebf0f070c I would like to take another look at it tomorrow before committing it.

@spetrunia
Copy link
Contributor Author

Found another example where the MyRocks fails to take into account the hidden
PK columns at the end or partitioned table

create table t1 (
  pk int primary key,
  col1 int,
  col2 int,
  key (col1) comment 'rev:cf1'
) engine=rocksdb partition by hash(pk) partitions 2;

insert into t1 values (1,10,10);
insert into t1 values (2,10,10);

insert into t1 values (11,20,20);
insert into t1 values (12,20,20);
MySQL [test]> explain select * from t1 where col1=10;
+----+-------------+-------+------+---------------+------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key  | key_len | ref   | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------+
|  1 | SIMPLE      | t1    | ref  | col1          | col1 | 5       | const |    2 | NULL  |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------+
1 row in set (0.01 sec)

MySQL [test]>  select * from t1 where col1=10;
Empty set (16.97 sec)

MySQL [test]>  select * from t1 use index () where col1=10;
+----+------+------+
| pk | col1 | col2 |
+----+------+------+
|  2 |   10 |   10 |
|  1 |   10 |   10 |
+----+------+------+
2 rows in set (0.00 sec)

The error happens in ha_rocksdb::read_key_exact(), which is invoked with full_key_match=true, which is wrong for this the secondary key.

@spetrunia
Copy link
Contributor Author

@spetrunia
Copy link
Contributor Author

Ok I figured out why index-only doesn't work for partitioned case.

ha_rocksdb has ha_rocksdb::pk_can_be_decoded, and ha_rocksdb::table_flags() returns HA_PRIMARY_KEY_IN_READ_INDEX bit only when pk_can_be_decoded==true.

The problem was that SQL layer (open_binary_frm() in particular) makes calls to
h->table_flags() before the table fields are available (and so, before MyRocks
cant tell whether HA_PRIMARY_KEY_IN_READ_INDEX should be returned).

This was resolved in LevelDB-SE as follows:

  • when ha_rocksdb ojbect is created, its table_flags() does not return
    HA_PRIMARY_KEY_IN_READ_INDEX.
  • open_binary_frm fills the table_share fields, and eventually calls
    h->index_flags(primary_key, n_parts, TRUE). ha_rocksdb makes use of this
    call to compute pk_can_be_decoded attribute.
  • Then, there was this code added into open_binary_frm():
          /*
            LevelDB & RocksDB SE: we need to re-read to get correct value of
            HA_PRIMARY_KEY_IN_READ_INDEX
          */
          ha_option= handler_file->ha_table_flags();
  • Right after it, field->part_of_key() is calculated for PK fields.

For non-partitioned tables, it worked: handler_file was a ha_rocksdb object,
calling handler_file->index_flags() updated handler_file->cached_table_flags,
then subsequent call to handler_file->ha_table_flags() returned correct value.

With partitioned tables, it doesn't work. ha_rocksdb::index_flags() gets
called, and ha_rocksdb::cached_table_flags is updated, but
ha_partition::cached_table_flags is not updated.

Also, ha_partition only calls index_flags() for the first partition. (Maybe it
calls it for other partitions too... need to check if non-first partitions
eventually get correct cached_table_flags. In any case, it's a minor issue:
::open() is definitely called for all partitions)

@spetrunia
Copy link
Contributor Author

Will try to find a solution for partitioned tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants