Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FB8-264: rocksdb.type_* fail in 8.0.23 because of cardinality #2

Closed

Conversation

kamil-holubicki
Copy link

https://jira.percona.com/browse/FB8-264

Cause:
8.0.23 changed the logic in
dd_table_share.cc::prepare_share()
table.cc::open_binary_frm()
In particular add_pk_parts_to_sk() call is made only if handler's table
flag HA_PRIMARY_KEY_IN_READ_INDEX is set.
For RocksDB the flag is by default not set, and is in init_with_fields()
if the table contains primary key.

Solution:
ha_rocksdb::init_with_fields() is sensitive only to Primary Key, move
the call just after we set PK in share.
If during the iteration some unique key is chosen to be PK call
init_with_fields() again and restart the iteration with already chosen
PK.

https://jira.percona.com/browse/FB8-264

Cause:
8.0.23 changed the logic in
dd_table_share.cc::prepare_share()
table.cc::open_binary_frm()
In particular add_pk_parts_to_sk() call is made only if handler's table
flag HA_PRIMARY_KEY_IN_READ_INDEX is set.
For RocksDB the flag is by default not set, and is in init_with_fields()
if the table contains primary key.

Solution:
ha_rocksdb::init_with_fields() is sensitive only to Primary Key, move
the call just after we set PK in share.
If during the iteration some unique key is chosen to be PK call
init_with_fields() again and restart the iteration with already chosen
PK.
inikep pushed a commit that referenced this pull request Oct 20, 2021
Summary:
This is change #2 that ports the SELECT statement parsing from feature-myrocks-bypass branch, including:

1. select_parser class that parses SELECT statements and validates the scenarios are supported, and error with proper message if not supported
2. You can control bypass policy with rocksdb_select_bypass_policy variable = 0 (ALWAYS OFF) / 1 (ALWAYS ON) / 2 (OPT IN) / 3 (OPT OUT).
3. A few metrics for number of SELECT bypass statements executed/rejected (due to not supported) /failed (something wrong in execution)

At this point this only dumps the SELECT statement parsed results without doing anything useful. The next change will be porting the query plan execution and point/range query execution part.

Please ignore the unused attributes for now - they'll be removed in next change once the execution part is ported.

Reference Patch:

facebook@e79b20a
facebook@8072e3d

 ---
Porting Notes:

There are a lot of small changes in this area but nothing too bad:
* Some bypass scenarios no longer fail or fail with different error message because of the new parse tree -> AST conversion process. For example, `SELECT *` now works correctly
* st_select_lex_t -> SELECT_LEX, and many members has changed name, such as where -> where_cond()
* protocol usage has changed to start_row, store, end_row in 8.0, which is much better.
* thd->query() is now a LEX_STRING
* SELECT option check (such as SELECT_DISTINCT) now uses active_options() and only check for SELECT_DISTINCT as there seems to be default value being set most likely due to side effect of having more parse tree processing being done. I checked the flags and it looks like only SELECT_DISTINCT are interesting and needs to be blocked.
* SELECT INTO OUTFILE check is changed to use results->needs_file_priviledge
* True/False const are now represented as Item_func_true/Item_func_false. For now I'm over-generalizing to functions that are const, and expect val_int/etc would do the right thing

Reviewed By: luqun

Differential Revision: D22808305

fbshipit-source-id: 9a51446af58
inikep pushed a commit that referenced this pull request Oct 20, 2021
… enabled

Summary:
For secondaries, when enable_super_log_bin_read_only is on and read_only is on, currently it will forbid to install/uninstall plugin during run time.

install/uninstall plugin doesn't generate event in binlog, although the thread thd contains OPTION_BIN_LOG flag due to log_slave_updates is on by default in secondaries. It should be safe to execute install/uninstall plugin.

the change is to call set_skip_readonly_check() before install/uninstall plugin and call reset_skip_readonly_check()(for completeness) after install/uninstall plugin.

BTW, mysql will always call reset_skip_readonly_check() for at the beginning of each statement. thus set_skip_readonly_check() won't affect other statement.
```
#0  THD::reset_skip_readonly_check (this=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_class.h:1754
#1  0x0000000005a500e1 in THD::reset_for_next_command (this=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:5892
#2  0x0000000005a517a5 in mysql_reset_thd_for_next_command (thd=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:5817
#3  0x0000000005a50466 in mysql_parse (thd=0x7fb5c1a0b000, parser_state=0x7fb5f6bb4560, last_timer=0x7fb5f6bb39b0) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:6056
#4  0x0000000005a4c7c9 in dispatch_command (thd=0x7fb5c1a0b000, com_data=0x7fb5f6bb4d98, command=COM_QUERY) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:2222
#5  0x0000000005a4f991 in do_command (thd=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:1556
#6  0x0000000005ccd4f1 in handle_connection (arg=0x7fb5cc85b740) at /home/luqun/mysql/mysql-8.0.20/sql/conn_handler/connection_handler_per_thread.cc:330
#7  0x00000000078eb95b in pfs_spawn_thread (arg=0x7fb5f8c89720) at /home/luqun/mysql/mysql-8.0.20/storage/perfschema/pfs.cc:2884
#8  0x00007fb5f957020c in start_thread (arg=0x7fb5f6bb6700) at pthread_create.c:479
#9  0x00007fb5f971881f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
```

Reviewed By: george-reynya

Differential Revision: D27213990

fbshipit-source-id: 68a234fb694
@inikep
Copy link
Owner

inikep commented Oct 21, 2021

@inikep inikep closed this Oct 21, 2021
@inikep inikep self-requested a review October 21, 2021 09:14
Copy link
Owner

@inikep inikep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

inikep pushed a commit that referenced this pull request Oct 21, 2021
Summary:
This is change #2 that ports the SELECT statement parsing from feature-myrocks-bypass branch, including:

1. select_parser class that parses SELECT statements and validates the scenarios are supported, and error with proper message if not supported
2. You can control bypass policy with rocksdb_select_bypass_policy variable = 0 (ALWAYS OFF) / 1 (ALWAYS ON) / 2 (OPT IN) / 3 (OPT OUT).
3. A few metrics for number of SELECT bypass statements executed/rejected (due to not supported) /failed (something wrong in execution)

At this point this only dumps the SELECT statement parsed results without doing anything useful. The next change will be porting the query plan execution and point/range query execution part.

Please ignore the unused attributes for now - they'll be removed in next change once the execution part is ported.

Reference Patch:

facebook@e79b20a
facebook@8072e3d

 ---
Porting Notes:

There are a lot of small changes in this area but nothing too bad:
* Some bypass scenarios no longer fail or fail with different error message because of the new parse tree -> AST conversion process. For example, `SELECT *` now works correctly
* st_select_lex_t -> SELECT_LEX, and many members has changed name, such as where -> where_cond()
* protocol usage has changed to start_row, store, end_row in 8.0, which is much better.
* thd->query() is now a LEX_STRING
* SELECT option check (such as SELECT_DISTINCT) now uses active_options() and only check for SELECT_DISTINCT as there seems to be default value being set most likely due to side effect of having more parse tree processing being done. I checked the flags and it looks like only SELECT_DISTINCT are interesting and needs to be blocked.
* SELECT INTO OUTFILE check is changed to use results->needs_file_priviledge
* True/False const are now represented as Item_func_true/Item_func_false. For now I'm over-generalizing to functions that are const, and expect val_int/etc would do the right thing

Reviewed By: luqun

Differential Revision: D22808305

fbshipit-source-id: 9a51446af58
inikep pushed a commit that referenced this pull request Oct 21, 2021
… enabled

Summary:
For secondaries, when enable_super_log_bin_read_only is on and read_only is on, currently it will forbid to install/uninstall plugin during run time.

install/uninstall plugin doesn't generate event in binlog, although the thread thd contains OPTION_BIN_LOG flag due to log_slave_updates is on by default in secondaries. It should be safe to execute install/uninstall plugin.

the change is to call set_skip_readonly_check() before install/uninstall plugin and call reset_skip_readonly_check()(for completeness) after install/uninstall plugin.

BTW, mysql will always call reset_skip_readonly_check() for at the beginning of each statement. thus set_skip_readonly_check() won't affect other statement.
```
#0  THD::reset_skip_readonly_check (this=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_class.h:1754
#1  0x0000000005a500e1 in THD::reset_for_next_command (this=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:5892
#2  0x0000000005a517a5 in mysql_reset_thd_for_next_command (thd=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:5817
#3  0x0000000005a50466 in mysql_parse (thd=0x7fb5c1a0b000, parser_state=0x7fb5f6bb4560, last_timer=0x7fb5f6bb39b0) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:6056
#4  0x0000000005a4c7c9 in dispatch_command (thd=0x7fb5c1a0b000, com_data=0x7fb5f6bb4d98, command=COM_QUERY) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:2222
#5  0x0000000005a4f991 in do_command (thd=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:1556
#6  0x0000000005ccd4f1 in handle_connection (arg=0x7fb5cc85b740) at /home/luqun/mysql/mysql-8.0.20/sql/conn_handler/connection_handler_per_thread.cc:330
#7  0x00000000078eb95b in pfs_spawn_thread (arg=0x7fb5f8c89720) at /home/luqun/mysql/mysql-8.0.20/storage/perfschema/pfs.cc:2884
#8  0x00007fb5f957020c in start_thread (arg=0x7fb5f6bb6700) at pthread_create.c:479
#9  0x00007fb5f971881f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
```

Reviewed By: george-reynya

Differential Revision: D27213990

fbshipit-source-id: 68a234fb694
inikep pushed a commit that referenced this pull request Oct 22, 2021
Summary:
several testcase failed in asan debug,

==106552==ERROR: AddressSanitizer: stack-buffer-underflow on address 0x7f38f3590078 at pc 0x0000093aa495 bp 0x7f38f358ffc0 sp 0x7f38f358ffb8
READ of size 4 at 0x7f38f3590078 thread T203
    #0 0x93aa494 in _db_enter_(char const*, int, char const*, unsigned int, _db_stack_frame_*)
    #1 0x8d0b9bb in Relay_log_info::remove_logged_gtids
    #2 0x8b1ec3f in trim_logged_gtid
    #3 0x8c767cf in process_raft_queue

If Use DBUG_ENTER, then you need to use DBUG_RETURN to pop current frame in CODE_STATE.
If use DBUG_TRACE, it will pop current frame during .dtor

ps. small refactor changes for sql/rpl_handler.cc

Reviewed By: bhatvinay

Differential Revision: D28809418

fbshipit-source-id: db2ac9c92f8
inikep pushed a commit that referenced this pull request Oct 22, 2021
Summary:
asandebug complain there are memory leaks during MYSQL_BIN_LOG open

Direct leak of 50 byte(s) in 1 object(s) allocated from:
    #0 0x67460ef in malloc
    #1 0x93f0777 in my_raw_malloc(unsigned long, int)
    #2 0x93f064a in my_malloc(unsigned int, unsigned long, int)
    #3 0x93f0eb0 in my_strdup(unsigned int, char const*, int)
    #4 0x8af01a6 in MYSQL_BIN_LOG::open(unsigned int, char const*, char const*, unsigned int)
    #5 0x8af8064 in MYSQL_BIN_LOG::open_binlog(char const*, char const*, unsigned long, bool, bool, bool, Format_description_log_event*, unsigned int, RaftRotateInfo*, bool)
    #6 0x8b00c00 in MYSQL_BIN_LOG::new_file_impl(bool, Format_description_log_event*, RaftRotateInfo*)
    #7 0x8d65e47 in rotate_relay_log(Master_info*, bool, bool, bool, RaftRotateInfo*)
    #8 0x8d661c0 in rotate_relay_log_for_raft(RaftRotateInfo*)
    #9 0x8c7696a in process_raft_queue
    #10 0xa0fa1fd in pfs_spawn_thread(void*)
    #11 0x7f8c9a12b20b in start_thread

release these memory before assign them

Reviewed By: Pushapgl

Differential Revision: D28819752

fbshipit-source-id: ed57cea875c
inikep pushed a commit that referenced this pull request Oct 22, 2021
Summary:
This is change #2 that ports the SELECT statement parsing from feature-myrocks-bypass branch, including:

1. select_parser class that parses SELECT statements and validates the scenarios are supported, and error with proper message if not supported
2. You can control bypass policy with rocksdb_select_bypass_policy variable = 0 (ALWAYS OFF) / 1 (ALWAYS ON) / 2 (OPT IN) / 3 (OPT OUT).
3. A few metrics for number of SELECT bypass statements executed/rejected (due to not supported) /failed (something wrong in execution)

At this point this only dumps the SELECT statement parsed results without doing anything useful. The next change will be porting the query plan execution and point/range query execution part.

Please ignore the unused attributes for now - they'll be removed in next change once the execution part is ported.

Reference Patch:

facebook@e79b20a
facebook@8072e3d

 ---
Porting Notes:

There are a lot of small changes in this area but nothing too bad:
* Some bypass scenarios no longer fail or fail with different error message because of the new parse tree -> AST conversion process. For example, `SELECT *` now works correctly
* st_select_lex_t -> SELECT_LEX, and many members has changed name, such as where -> where_cond()
* protocol usage has changed to start_row, store, end_row in 8.0, which is much better.
* thd->query() is now a LEX_STRING
* SELECT option check (such as SELECT_DISTINCT) now uses active_options() and only check for SELECT_DISTINCT as there seems to be default value being set most likely due to side effect of having more parse tree processing being done. I checked the flags and it looks like only SELECT_DISTINCT are interesting and needs to be blocked.
* SELECT INTO OUTFILE check is changed to use results->needs_file_priviledge
* True/False const are now represented as Item_func_true/Item_func_false. For now I'm over-generalizing to functions that are const, and expect val_int/etc would do the right thing

Reviewed By: luqun

Differential Revision: D22808305

fbshipit-source-id: 9a51446af58
inikep pushed a commit that referenced this pull request Oct 22, 2021
… enabled

Summary:
For secondaries, when enable_super_log_bin_read_only is on and read_only is on, currently it will forbid to install/uninstall plugin during run time.

install/uninstall plugin doesn't generate event in binlog, although the thread thd contains OPTION_BIN_LOG flag due to log_slave_updates is on by default in secondaries. It should be safe to execute install/uninstall plugin.

the change is to call set_skip_readonly_check() before install/uninstall plugin and call reset_skip_readonly_check()(for completeness) after install/uninstall plugin.

BTW, mysql will always call reset_skip_readonly_check() for at the beginning of each statement. thus set_skip_readonly_check() won't affect other statement.
```
#0  THD::reset_skip_readonly_check (this=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_class.h:1754
#1  0x0000000005a500e1 in THD::reset_for_next_command (this=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:5892
#2  0x0000000005a517a5 in mysql_reset_thd_for_next_command (thd=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:5817
#3  0x0000000005a50466 in mysql_parse (thd=0x7fb5c1a0b000, parser_state=0x7fb5f6bb4560, last_timer=0x7fb5f6bb39b0) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:6056
#4  0x0000000005a4c7c9 in dispatch_command (thd=0x7fb5c1a0b000, com_data=0x7fb5f6bb4d98, command=COM_QUERY) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:2222
#5  0x0000000005a4f991 in do_command (thd=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:1556
#6  0x0000000005ccd4f1 in handle_connection (arg=0x7fb5cc85b740) at /home/luqun/mysql/mysql-8.0.20/sql/conn_handler/connection_handler_per_thread.cc:330
#7  0x00000000078eb95b in pfs_spawn_thread (arg=0x7fb5f8c89720) at /home/luqun/mysql/mysql-8.0.20/storage/perfschema/pfs.cc:2884
#8  0x00007fb5f957020c in start_thread (arg=0x7fb5f6bb6700) at pthread_create.c:479
#9  0x00007fb5f971881f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
```

Reviewed By: george-reynya

Differential Revision: D27213990

fbshipit-source-id: 68a234fb694
inikep pushed a commit that referenced this pull request Oct 22, 2021
Summary:
several testcase failed in asan debug,

==106552==ERROR: AddressSanitizer: stack-buffer-underflow on address 0x7f38f3590078 at pc 0x0000093aa495 bp 0x7f38f358ffc0 sp 0x7f38f358ffb8
READ of size 4 at 0x7f38f3590078 thread T203
    #0 0x93aa494 in _db_enter_(char const*, int, char const*, unsigned int, _db_stack_frame_*)
    #1 0x8d0b9bb in Relay_log_info::remove_logged_gtids
    #2 0x8b1ec3f in trim_logged_gtid
    #3 0x8c767cf in process_raft_queue

If Use DBUG_ENTER, then you need to use DBUG_RETURN to pop current frame in CODE_STATE.
If use DBUG_TRACE, it will pop current frame during .dtor

ps. small refactor changes for sql/rpl_handler.cc

Reviewed By: bhatvinay

Differential Revision: D28809418

fbshipit-source-id: db2ac9c92f8
inikep pushed a commit that referenced this pull request Oct 22, 2021
Summary:
asandebug complain there are memory leaks during MYSQL_BIN_LOG open

Direct leak of 50 byte(s) in 1 object(s) allocated from:
    #0 0x67460ef in malloc
    #1 0x93f0777 in my_raw_malloc(unsigned long, int)
    #2 0x93f064a in my_malloc(unsigned int, unsigned long, int)
    #3 0x93f0eb0 in my_strdup(unsigned int, char const*, int)
    #4 0x8af01a6 in MYSQL_BIN_LOG::open(unsigned int, char const*, char const*, unsigned int)
    #5 0x8af8064 in MYSQL_BIN_LOG::open_binlog(char const*, char const*, unsigned long, bool, bool, bool, Format_description_log_event*, unsigned int, RaftRotateInfo*, bool)
    #6 0x8b00c00 in MYSQL_BIN_LOG::new_file_impl(bool, Format_description_log_event*, RaftRotateInfo*)
    #7 0x8d65e47 in rotate_relay_log(Master_info*, bool, bool, bool, RaftRotateInfo*)
    #8 0x8d661c0 in rotate_relay_log_for_raft(RaftRotateInfo*)
    #9 0x8c7696a in process_raft_queue
    #10 0xa0fa1fd in pfs_spawn_thread(void*)
    #11 0x7f8c9a12b20b in start_thread

release these memory before assign them

Reviewed By: Pushapgl

Differential Revision: D28819752

fbshipit-source-id: ed57cea875c
inikep pushed a commit that referenced this pull request Oct 25, 2021
Summary:
several testcase failed in asan debug,

==106552==ERROR: AddressSanitizer: stack-buffer-underflow on address 0x7f38f3590078 at pc 0x0000093aa495 bp 0x7f38f358ffc0 sp 0x7f38f358ffb8
READ of size 4 at 0x7f38f3590078 thread T203
    #0 0x93aa494 in _db_enter_(char const*, int, char const*, unsigned int, _db_stack_frame_*)
    #1 0x8d0b9bb in Relay_log_info::remove_logged_gtids
    #2 0x8b1ec3f in trim_logged_gtid
    #3 0x8c767cf in process_raft_queue

If Use DBUG_ENTER, then you need to use DBUG_RETURN to pop current frame in CODE_STATE.
If use DBUG_TRACE, it will pop current frame during .dtor

ps. small refactor changes for sql/rpl_handler.cc

Reviewed By: bhatvinay

Differential Revision: D28809418

fbshipit-source-id: db2ac9c92f8
inikep pushed a commit that referenced this pull request Oct 25, 2021
Summary:
asandebug complain there are memory leaks during MYSQL_BIN_LOG open

Direct leak of 50 byte(s) in 1 object(s) allocated from:
    #0 0x67460ef in malloc
    #1 0x93f0777 in my_raw_malloc(unsigned long, int)
    #2 0x93f064a in my_malloc(unsigned int, unsigned long, int)
    #3 0x93f0eb0 in my_strdup(unsigned int, char const*, int)
    #4 0x8af01a6 in MYSQL_BIN_LOG::open(unsigned int, char const*, char const*, unsigned int)
    #5 0x8af8064 in MYSQL_BIN_LOG::open_binlog(char const*, char const*, unsigned long, bool, bool, bool, Format_description_log_event*, unsigned int, RaftRotateInfo*, bool)
    #6 0x8b00c00 in MYSQL_BIN_LOG::new_file_impl(bool, Format_description_log_event*, RaftRotateInfo*)
    #7 0x8d65e47 in rotate_relay_log(Master_info*, bool, bool, bool, RaftRotateInfo*)
    #8 0x8d661c0 in rotate_relay_log_for_raft(RaftRotateInfo*)
    #9 0x8c7696a in process_raft_queue
    #10 0xa0fa1fd in pfs_spawn_thread(void*)
    #11 0x7f8c9a12b20b in start_thread

release these memory before assign them

Reviewed By: Pushapgl

Differential Revision: D28819752

fbshipit-source-id: ed57cea875c
inikep pushed a commit that referenced this pull request Oct 27, 2021
Summary:
several testcase failed in asan debug,

==106552==ERROR: AddressSanitizer: stack-buffer-underflow on address 0x7f38f3590078 at pc 0x0000093aa495 bp 0x7f38f358ffc0 sp 0x7f38f358ffb8
READ of size 4 at 0x7f38f3590078 thread T203
    #0 0x93aa494 in _db_enter_(char const*, int, char const*, unsigned int, _db_stack_frame_*)
    #1 0x8d0b9bb in Relay_log_info::remove_logged_gtids
    #2 0x8b1ec3f in trim_logged_gtid
    #3 0x8c767cf in process_raft_queue

If Use DBUG_ENTER, then you need to use DBUG_RETURN to pop current frame in CODE_STATE.
If use DBUG_TRACE, it will pop current frame during .dtor

ps. small refactor changes for sql/rpl_handler.cc

Reviewed By: bhatvinay

Differential Revision: D28809418

fbshipit-source-id: db2ac9c92f8
inikep pushed a commit that referenced this pull request Oct 27, 2021
Summary:
asandebug complain there are memory leaks during MYSQL_BIN_LOG open

Direct leak of 50 byte(s) in 1 object(s) allocated from:
    #0 0x67460ef in malloc
    #1 0x93f0777 in my_raw_malloc(unsigned long, int)
    #2 0x93f064a in my_malloc(unsigned int, unsigned long, int)
    #3 0x93f0eb0 in my_strdup(unsigned int, char const*, int)
    #4 0x8af01a6 in MYSQL_BIN_LOG::open(unsigned int, char const*, char const*, unsigned int)
    #5 0x8af8064 in MYSQL_BIN_LOG::open_binlog(char const*, char const*, unsigned long, bool, bool, bool, Format_description_log_event*, unsigned int, RaftRotateInfo*, bool)
    #6 0x8b00c00 in MYSQL_BIN_LOG::new_file_impl(bool, Format_description_log_event*, RaftRotateInfo*)
    #7 0x8d65e47 in rotate_relay_log(Master_info*, bool, bool, bool, RaftRotateInfo*)
    #8 0x8d661c0 in rotate_relay_log_for_raft(RaftRotateInfo*)
    #9 0x8c7696a in process_raft_queue
    #10 0xa0fa1fd in pfs_spawn_thread(void*)
    #11 0x7f8c9a12b20b in start_thread

release these memory before assign them

Reviewed By: Pushapgl

Differential Revision: D28819752

fbshipit-source-id: ed57cea875c
avbelov23 pushed a commit to avbelov23/mysql-5.6 that referenced this pull request Oct 30, 2021
Summary:
This is an upstream bug: https://bugs.mysql.com/bug.php?id=92964

The issue is that if slave instance use sysvar_session_track_gtids == OWN_GTID and also enable MTS, the slave will lag behind master and its performance degrades over time.

The reason for this lag/perf degrade is caused by each slave worker keep adding its own gtid into its own Session_consistency_gtids_ctx.m_gtid_set during each transaction commit. Overtime, some of Session_consistency_gtids_ctx.m_gtid_set may have millions gno_interval and it will take long time to insert 1 gtid into Session_consistency_gtids_ctx.m_gtid_set(time complexity O(n)), see [Gtid_set::add_gno_interval](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_gtid_set.cc#L323).

There are couple related code to this issue:
- During slave worker thread start, there is a [check](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/session_tracker.cc#L1560) to not enable tracker for system threads but that check doesn't work---The THD system_thread field is initialized with default value(NON_SYSTEM_THREAD) during THD::THD.
```
  m_enabled = thd->variables.session_track_gtids != OFF &&
              /* No need to track GTIDs for system threads. */
              thd->system_thread == NON_SYSTEM_THREAD;
```
```
Call stack:
```
- in [Session_gtids_track::Update](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/session_tracker.cc#L1581), the else code doesn't unregister listener due to mysql/mysql-server@1e0844c try to fix another issue.

Solutions:
1.  [Current]add system threads check during collecting data, since THD will be fully initialized at that time.
 -Pro: simple change(easy to port) and perf is ok
2. add `thd->session_track_gtids=OFF; thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER)->update(thd);` after [init_slave_thread()](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_slave.cc#L6788)
-Pro: During init_slave_thread(), thd->system_thread will be set correctly value.
-Con: it need to add couple places, such as handle_slave_io, handle_slave_sql, handle_slave_worker
3. add `thd->session_track_gtids=OFF;thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER)->update(thd);` inside [init_slave_thread()](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_slave.cc#L6788)
   -Pro: only add once for slave/slave worker/IO threads
   -Con: [should_collect](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_context.cc#L53) check still return true, thus it is less perf than inikep#1
4. add `thd->session_track_gtids=OFF;thd->rpl_thd_ctx.session_gtids_ctx().unregister_ctx_change_listener(thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER));` inside init_slave_thread()
   -Pro: perf is better than inikep#3, on par as inikep#1
   -Con: if there are other non-system threads(except IO/sql threads) do transactions, these threads will still collect its own gtids.
5. add another THD overload constructor which takes 'enum_thread_type' as a parameter whose default value is NON_SYSTEM_THREAD
 -Pro: it will initialize correct value with THD and won't enable those session_tracker
 -Con: similar to inikep#2/inikep#4, need to replace THD() with THD(enum_thread_type) couples places.

Differential Revision: D18072672 (facebook@4b1d7d5)

fbshipit-source-id: 4b902020e76
avbelov23 pushed a commit to avbelov23/mysql-5.6 that referenced this pull request Oct 30, 2021
…_cache_type / tx_readonly variables

Summary:
5.6 Java client we have today will issue following SQL query on connect:

```
/* MYSQL_CJ_FULL_PROD_NAME@ ( Revision: MYSQL_CJ_REVISION@ ) */SELECT  session.auto_increment_increment AS auto_increment_increment, character_set_client AS character_set_client, character_set_connection AS character_set_connection, character_set_results AS character_set_results, character_set_server AS character_set_server, init_connect AS init_connect, interactive_timeout AS interactive_timeout, license AS license, lower_case_table_names AS lower_case_table_names, max_allowed_packet AS max_allowed_packet, net_buffer_length AS net_buffer_length, net_write_timeout AS net_write_timeout, query_cache_size AS query_cache_size, query_cache_type AS query_cache_type, sql_mode AS sql_mode, system_time_zone AS system_time_zone, Timeroot_zone AS time_zone, tx_isolation AS tx_isolation, wait_timeout AS wait_timeout
```

It'll fail on 8.0 as tx_isolation is replaced by transaction_isolation, and query_cache_size/query_cache_type are deprecated (query cache feature is gone in 8.0). This adds a reasonable implementation of those variables to make the connector happy (without introducing other issues, hopefully), before we migrate our java client to 8.0:
1. Add dummy implementation of query_cache_size / query_cache_type that returns 0 size and cache off. Both are readonly. Note I changed query_cache_type to global as I don't want to pay for the extra bytes for a readonly variable and it doesn't make too much sense for a session variable to be read-only anyway. A new test is added as well.
2. Add read-only tx_isolation that directly maps to transaction_isolation variable. transaction_isolation_basic test is enhanced to ensure any change to transaction_isolation gets mapped to tx_isolation (which just works since they are the same variable on thd session)
3. Add read-only tx_read_only maps to transaction_read_only similar to inikep#2 for cases Java cient calls connection.isReadOnly.

Reviewed By: zhichengzhu

Differential Revision: D19551619 (facebook@73d384c)

fbshipit-source-id: 78f483aa3ed
avbelov23 pushed a commit to avbelov23/mysql-5.6 that referenced this pull request Oct 30, 2021
Summary:
Inside index_next_same() call, we should
1. first check whether the record matches the index
   lookup prefix,
2. then check pushed index condition.

If we try to check inikep#2 without checking inikep#1 first, we may walk
off the index lookup prefix and scan till the end of the index.

Differential Revision: https://reviews.facebook.net/D38769

fbshipit-source-id: 6c34e3ca0d2
avbelov23 pushed a commit to avbelov23/mysql-5.6 that referenced this pull request Oct 30, 2021
Summary:
MyRocks had two bugs when calculating index scan cost.
1. block_size was not considered. This made covering index scan cost
(both full index scan and range scan) much higher
2. ha_rocksdb::records_in_range() may have estimated more rows
than the estimated number of rows in the table. This was wrong,
and MySQL optimizer decided to use full index scan even though
range scan was more efficient.

This diff fixes inikep#1 by setting stats.block_size at ha_rocksdb::open(),
and fixes inikep#2 by reducing the number of estimated rows if it was
larger than stats.records.

Differential Revision: https://reviews.facebook.net/D55869

fbshipit-source-id: 15ca2c3fa8e
avbelov23 pushed a commit to avbelov23/mysql-5.6 that referenced this pull request Oct 30, 2021
…acebook#871)

Summary:
Original report: https://jira.mariadb.org/browse/MDEV-15816

To reproduce this bug just following below steps,

client 1:
USE test;
CREATE TABLE t1 (i INT) ENGINE=MyISAM;
HANDLER t1 OPEN h;
CREATE TABLE t2 (i INT) ENGINE=RocksDB;
LOCK TABLES t2 WRITE;

client 2:
FLUSH TABLES WITH READ LOCK;

client 1:
INSERT INTO t2 VALUES (1);

So client 1 acquired the lock and set m_lock_rows = RDB_LOCK_WRITE.
Then client 2 calls store_lock(TL_IGNORE) and m_lock_rows was wrongly
set to RDB_LOCK_NONE, as below

```
 #0  myrocks::ha_rocksdb::store_lock (this=0x7fffbc03c7c8, thd=0x7fffc0000ba0, to=0x7fffc0011220, lock_type=TL_IGNORE)
 inikep#1  get_lock_data (thd=0x7fffc0000ba0, table_ptr=0x7fffe84b7d20, count=1, flags=2)
 inikep#2  mysql_lock_abort_for_thread (thd=0x7fffc0000ba0, table=0x7fffbc03bbc0)
 inikep#3  THD::notify_shared_lock (this=0x7fffc0000ba0, ctx_in_use=0x7fffbc000bd8, needs_thr_lock_abort=true)
 inikep#4  MDL_lock::notify_conflicting_locks (this=0x555557a82380, ctx=0x7fffc0000cc8)
 inikep#5  MDL_context::acquire_lock (this=0x7fffc0000cc8, mdl_request=0x7fffe84b8350, lock_wait_timeout=2)
 inikep#6  Global_read_lock::lock_global_read_lock (this=0x7fffc0003fe0, thd=0x7fffc0000ba0)
```

Finally, client 1 "INSERT INTO..." hits the Assertion 'm_lock_rows == RDB_LOCK_WRITE'
failed in myrocks::ha_rocksdb::write_row()

Fix this bug by not setting m_locks_rows if lock_type == TL_IGNORE.

Closes facebook#838
Pull Request resolved: facebook#871

Differential Revision: D9417382 (facebook@8cb1dc8)

Pulled By: lth

fbshipit-source-id: 5aa7ed56c0a
avbelov23 pushed a commit to avbelov23/mysql-5.6 that referenced this pull request Oct 30, 2021
Summary: Missed a few in earlier fixes for AutoInitCopy rule. Also added a few fixes for anoymous class rule and local shadowing rule.

Reviewed By: luqun

Differential Revision: D15467213 (facebook@b1fb095)

fbshipit-source-id: 59d9e6761e1
avbelov23 pushed a commit to avbelov23/mysql-5.6 that referenced this pull request Oct 30, 2021
Summary:
1. ttl_primary has an interesting bug that it inserts using UNIX_TIMESTAMP and then updates the same keys with UNIX_TIMESTAMP in attempt to renew its TTL (while setting debug_rocksdb_ttl_ts to a positive value) but if the commands are executed fast enough the UNIX_TIMESTAMP won't change and as a result the update will not trigger any TTL update. I've verified 5.6 has the same behavior in this case so it's just flaky test. Fixed to use UNIX_TIMESTAMP() + 1 so as long as time flows in the positive direction in the foreseeable future (paring unexpected clock time jumps) we should be fine. Other TTL tests seems to employ similar pattern.
2. Rebaseline line a few other tests due to issues that already been observed. See my earlier diffs for more context.

Reviewed By: lloyd

Differential Revision: D17529590 (facebook@b30636d)

fbshipit-source-id: b235485053e
inikep pushed a commit that referenced this pull request Nov 1, 2021
Summary:
This is change #2 that ports the SELECT statement parsing from feature-myrocks-bypass branch, including:

1. select_parser class that parses SELECT statements and validates the scenarios are supported, and error with proper message if not supported
2. You can control bypass policy with rocksdb_select_bypass_policy variable = 0 (ALWAYS OFF) / 1 (ALWAYS ON) / 2 (OPT IN) / 3 (OPT OUT).
3. A few metrics for number of SELECT bypass statements executed/rejected (due to not supported) /failed (something wrong in execution)

At this point this only dumps the SELECT statement parsed results without doing anything useful. The next change will be porting the query plan execution and point/range query execution part.

Please ignore the unused attributes for now - they'll be removed in next change once the execution part is ported.

Reference Patch:

facebook@e79b20a
facebook@8072e3d

 ---
Porting Notes:

There are a lot of small changes in this area but nothing too bad:
* Some bypass scenarios no longer fail or fail with different error message because of the new parse tree -> AST conversion process. For example, `SELECT *` now works correctly
* st_select_lex_t -> SELECT_LEX, and many members has changed name, such as where -> where_cond()
* protocol usage has changed to start_row, store, end_row in 8.0, which is much better.
* thd->query() is now a LEX_STRING
* SELECT option check (such as SELECT_DISTINCT) now uses active_options() and only check for SELECT_DISTINCT as there seems to be default value being set most likely due to side effect of having more parse tree processing being done. I checked the flags and it looks like only SELECT_DISTINCT are interesting and needs to be blocked.
* SELECT INTO OUTFILE check is changed to use results->needs_file_priviledge
* True/False const are now represented as Item_func_true/Item_func_false. For now I'm over-generalizing to functions that are const, and expect val_int/etc would do the right thing

Reviewed By: luqun

Differential Revision: D22808305

fbshipit-source-id: 9a51446af58
inikep pushed a commit that referenced this pull request Nov 1, 2021
… enabled

Summary:
For secondaries, when enable_super_log_bin_read_only is on and read_only is on, currently it will forbid to install/uninstall plugin during run time.

install/uninstall plugin doesn't generate event in binlog, although the thread thd contains OPTION_BIN_LOG flag due to log_slave_updates is on by default in secondaries. It should be safe to execute install/uninstall plugin.

the change is to call set_skip_readonly_check() before install/uninstall plugin and call reset_skip_readonly_check()(for completeness) after install/uninstall plugin.

BTW, mysql will always call reset_skip_readonly_check() for at the beginning of each statement. thus set_skip_readonly_check() won't affect other statement.
```
#0  THD::reset_skip_readonly_check (this=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_class.h:1754
#1  0x0000000005a500e1 in THD::reset_for_next_command (this=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:5892
#2  0x0000000005a517a5 in mysql_reset_thd_for_next_command (thd=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:5817
#3  0x0000000005a50466 in mysql_parse (thd=0x7fb5c1a0b000, parser_state=0x7fb5f6bb4560, last_timer=0x7fb5f6bb39b0) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:6056
#4  0x0000000005a4c7c9 in dispatch_command (thd=0x7fb5c1a0b000, com_data=0x7fb5f6bb4d98, command=COM_QUERY) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:2222
#5  0x0000000005a4f991 in do_command (thd=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:1556
#6  0x0000000005ccd4f1 in handle_connection (arg=0x7fb5cc85b740) at /home/luqun/mysql/mysql-8.0.20/sql/conn_handler/connection_handler_per_thread.cc:330
#7  0x00000000078eb95b in pfs_spawn_thread (arg=0x7fb5f8c89720) at /home/luqun/mysql/mysql-8.0.20/storage/perfschema/pfs.cc:2884
#8  0x00007fb5f957020c in start_thread (arg=0x7fb5f6bb6700) at pthread_create.c:479
#9  0x00007fb5f971881f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
```

Reviewed By: george-reynya

Differential Revision: D27213990

fbshipit-source-id: 68a234fb694
inikep pushed a commit that referenced this pull request Nov 1, 2021
Summary:
several testcase failed in asan debug,

==106552==ERROR: AddressSanitizer: stack-buffer-underflow on address 0x7f38f3590078 at pc 0x0000093aa495 bp 0x7f38f358ffc0 sp 0x7f38f358ffb8
READ of size 4 at 0x7f38f3590078 thread T203
    #0 0x93aa494 in _db_enter_(char const*, int, char const*, unsigned int, _db_stack_frame_*)
    #1 0x8d0b9bb in Relay_log_info::remove_logged_gtids
    #2 0x8b1ec3f in trim_logged_gtid
    #3 0x8c767cf in process_raft_queue

If Use DBUG_ENTER, then you need to use DBUG_RETURN to pop current frame in CODE_STATE.
If use DBUG_TRACE, it will pop current frame during .dtor

ps. small refactor changes for sql/rpl_handler.cc

Reviewed By: bhatvinay

Differential Revision: D28809418

fbshipit-source-id: db2ac9c92f8
inikep pushed a commit that referenced this pull request Nov 1, 2021
Summary:
asandebug complain there are memory leaks during MYSQL_BIN_LOG open

Direct leak of 50 byte(s) in 1 object(s) allocated from:
    #0 0x67460ef in malloc
    #1 0x93f0777 in my_raw_malloc(unsigned long, int)
    #2 0x93f064a in my_malloc(unsigned int, unsigned long, int)
    #3 0x93f0eb0 in my_strdup(unsigned int, char const*, int)
    #4 0x8af01a6 in MYSQL_BIN_LOG::open(unsigned int, char const*, char const*, unsigned int)
    #5 0x8af8064 in MYSQL_BIN_LOG::open_binlog(char const*, char const*, unsigned long, bool, bool, bool, Format_description_log_event*, unsigned int, RaftRotateInfo*, bool)
    #6 0x8b00c00 in MYSQL_BIN_LOG::new_file_impl(bool, Format_description_log_event*, RaftRotateInfo*)
    #7 0x8d65e47 in rotate_relay_log(Master_info*, bool, bool, bool, RaftRotateInfo*)
    #8 0x8d661c0 in rotate_relay_log_for_raft(RaftRotateInfo*)
    #9 0x8c7696a in process_raft_queue
    #10 0xa0fa1fd in pfs_spawn_thread(void*)
    #11 0x7f8c9a12b20b in start_thread

release these memory before assign them

Reviewed By: Pushapgl

Differential Revision: D28819752

fbshipit-source-id: ed57cea875c
inikep pushed a commit that referenced this pull request Jul 19, 2024
Summary:
MySQL bypass infrastructure change

1. Adds a new hton override `handlerton::handle_single_table_select` to allow hijacking of SELECT statement execution from any storage engine.
2. Add lex parsing for bypass hints in optimizer hint style /*+ bypass */ and /*+ no_bypass */
3. MyRocks overrides handle_single_table_select but does nothing and simply fallback.

This is a port from from bypass feature branch - change #1 of 3. I'm planning to port the changes in the following order:

1. MySQL lexer + SELECT hijacking (this one).
2. SELECT Statement parsing and checking
3. Execution of SELECT statement in bypass

Reference Patch: facebook@2d19dc0

-----
Porting Notes:

* The entry points have changed due to MySQL 8.0 refactoring to SELECT path, and now it is in `Sql_cmd_dml::execute_inner`. I'll also see if overriding the entire execute would be better but this should be good enough for now.
* The check for whether we should go to bypass handler is mostly the same, except st_select_lex_t is now SELECT_LEX.
* The lex parsing for hints was mostly ported as-is. I was also debating whether I should just use optimizer hints instead of hacking the lexer, but decided against using optimizer hint for now so that migration from 5.6 will be easier. We can migrate to optimizer hint once 8.0 migration is complete.

Differential Revision: D22807040

---------------------------------------------------------------------------------------------

[bypass] MySQL bypass change #2 - SELECT parsing [non-RocksDB part]

Summary:
This is change #2 that ports the SELECT statement parsing from feature-myrocks-bypass branch, including:

1. select_parser class that parses SELECT statements and validates the scenarios are supported, and error with proper message if not supported
2. You can control bypass policy with rocksdb_select_bypass_policy variable = 0 (ALWAYS OFF) / 1 (ALWAYS ON) / 2 (OPT IN) / 3 (OPT OUT).
3. A few metrics for number of SELECT bypass statements executed/rejected (due to not supported) /failed (something wrong in execution)

At this point this only dumps the SELECT statement parsed results without doing anything useful. The next change will be porting the query plan execution and point/range query execution part.

Please ignore the unused attributes for now - they'll be removed in next change once the execution part is ported.

Reference Patch:

facebook@e79b20a
facebook@8072e3d

 ---
Porting Notes:

There are a lot of small changes in this area but nothing too bad:
* Some bypass scenarios no longer fail or fail with different error message because of the new parse tree -> AST conversion process. For example, `SELECT *` now works correctly
* st_select_lex_t -> SELECT_LEX, and many members has changed name, such as where -> where_cond()
* protocol usage has changed to start_row, store, end_row in 8.0, which is much better.
* thd->query() is now a LEX_STRING
* SELECT option check (such as SELECT_DISTINCT) now uses active_options() and only check for SELECT_DISTINCT as there seems to be default value being set most likely due to side effect of having more parse tree processing being done. I checked the flags and it looks like only SELECT_DISTINCT are interesting and needs to be blocked.
* SELECT INTO OUTFILE check is changed to use results->needs_file_priviledge
* True/False const are now represented as Item_func_true/Item_func_false. For now I'm over-generalizing to functions that are const, and expect val_int/etc would do the right thing

Reviewed By: luqun

Differential Revision: D22808305

-----------------------------------------------------------------------------------------------

Introduce base select parser class

Summary:
To make it easier to add a new bypass rpc parser into existing bypass implementation (nosql_access.cc),
and make minimum change to bypass engine while adding a bypass rpc parser, I introduced a new base class,
base_select_parser. It includes all essential members and methods for parsing bypass sq/rpc queries.
Existing bypass sql parser (select_parser) and a new bypass rpc parser (not included in this diff)
inherits the base function.

Bypass RPC reference diff: D30493654

Reviewed By: yizhang82

Differential Revision: D31870628
inikep pushed a commit that referenced this pull request Jul 19, 2024
Summary:
Almost a direct port of the feature in 5.6 except for minor visual changes and
some minor changes related to 8.0 thread handling

Also brings in the basic raft_listener_queue without the implemented functions

Differential Revision: D21670710

Use DBUG_TRACE instead of DBUG_ENTER

Summary:
several testcase failed in asan debug,

==106552==ERROR: AddressSanitizer: stack-buffer-underflow on address 0x7f38f3590078 at pc 0x0000093aa495 bp 0x7f38f358ffc0 sp 0x7f38f358ffb8
READ of size 4 at 0x7f38f3590078 thread T203
    #0 0x93aa494 in _db_enter_(char const*, int, char const*, unsigned int, _db_stack_frame_*)
    #1 0x8d0b9bb in Relay_log_info::remove_logged_gtids
    #2 0x8b1ec3f in trim_logged_gtid
    #3 0x8c767cf in process_raft_queue

If Use DBUG_ENTER, then you need to use DBUG_RETURN to pop current frame in CODE_STATE.
If use DBUG_TRACE, it will pop current frame during .dtor

ps. small refactor changes for sql/rpl_handler.cc

Reviewed By: bhatvinay

Differential Revision: D28809418
inikep pushed a commit that referenced this pull request Jul 19, 2024
…_cache_type / tx_readonly variables

Summary:
5.6 Java client we have today will issue following SQL query on connect:

```
/* MYSQL_CJ_FULL_PROD_NAME@ ( Revision: MYSQL_CJ_REVISION@ ) */SELECT  session.auto_increment_increment AS auto_increment_increment, character_set_client AS character_set_client, character_set_connection AS character_set_connection, character_set_results AS character_set_results, character_set_server AS character_set_server, init_connect AS init_connect, interactive_timeout AS interactive_timeout, license AS license, lower_case_table_names AS lower_case_table_names, max_allowed_packet AS max_allowed_packet, net_buffer_length AS net_buffer_length, net_write_timeout AS net_write_timeout, query_cache_size AS query_cache_size, query_cache_type AS query_cache_type, sql_mode AS sql_mode, system_time_zone AS system_time_zone, Timeroot_zone AS time_zone, tx_isolation AS tx_isolation, wait_timeout AS wait_timeout
```

It'll fail on 8.0 as tx_isolation is replaced by transaction_isolation, and query_cache_size/query_cache_type are deprecated (query cache feature is gone in 8.0). This adds a reasonable implementation of those variables to make the connector happy (without introducing other issues, hopefully), before we migrate our java client to 8.0:
1. Add dummy implementation of query_cache_size / query_cache_type that returns 0 size and cache off. Both are readonly. Note I changed query_cache_type to global as I don't want to pay for the extra bytes for a readonly variable and it doesn't make too much sense for a session variable to be read-only anyway. A new test is added as well.
2. Add read-only tx_isolation that directly maps to transaction_isolation variable. transaction_isolation_basic test is enhanced to ensure any change to transaction_isolation gets mapped to tx_isolation (which just works since they are the same variable on thd session)
3. Add read-only tx_read_only maps to transaction_read_only similar to #2 for cases Java cient calls connection.isReadOnly.

Reviewed By: zhichengzhu

Differential Revision: D19551619
inikep pushed a commit that referenced this pull request Jul 19, 2024
Summary:
port D21375044 (facebook@7fc9eaa) (facebook@7fc9eaa) and D17291883 (facebook@6a6a35f) (facebook@6a6a35f)
**Summary**
* As part of leadership changes, we will trim binlogs/relay logs on a
running mysql instance. This needs the ability to remove gtids from executed_gtid set.
This diff adds the ability to do the same. One can enqueue an event in the raft
listener queue and the raft thread in the server will take care of removing and
updating executed_gtid on a running instance.
* When raft logs are trimmed as part of TruncateOpsAfter(), the gtid of the trimmed trxs are also removed from executed/logged gtids. However, when binlog files are rotated, the previous gtid written to the new file depends on what files are rotated. During  binlog rotation the prev gtid are the logged/executed gtids of the instance. During relay log rotation prev gtids are the retrieved gtids of the instance. Hence, in addition to trimming logged gtids/executed gtids, we should also trim retrieved gtids. This prevents creation of holes in the instances executed_gtid set as the replicaset goes through multiple promotions over its lifetime.

Reviewed By: luqun

Differential Revision: D24690007

---------------------------------------------------------------------------------------

Use DBUG_TRACE instead of DBUG_ENTER

Summary:
several testcase failed in asan debug,

==106552==ERROR: AddressSanitizer: stack-buffer-underflow on address 0x7f38f3590078 at pc 0x0000093aa495 bp 0x7f38f358ffc0 sp 0x7f38f358ffb8
READ of size 4 at 0x7f38f3590078 thread T203
    #0 0x93aa494 in _db_enter_(char const*, int, char const*, unsigned int, _db_stack_frame_*)
    #1 0x8d0b9bb in Relay_log_info::remove_logged_gtids
    #2 0x8b1ec3f in trim_logged_gtid
    #3 0x8c767cf in process_raft_queue

If Use DBUG_ENTER, then you need to use DBUG_RETURN to pop current frame in CODE_STATE.
If use DBUG_TRACE, it will pop current frame during .dtor

ps. small refactor changes for sql/rpl_handler.cc

Reviewed By: bhatvinay

Differential Revision: D28809418
inikep pushed a commit that referenced this pull request Jul 19, 2024
Summary:
[Porting Notes]
We want to dump raft logs to vanilla async replicas regardless
of whether it's the relay log or binlog. Effectively after this change
we'll dump relay logs on the followers and binlogs on the leader. When
the raft role changes, the logs to the dumped are also changed.
Dump_log class is introduced as a thin wrapper/continer around
mysql_bin_log or rli->relay_log and is inited with mysql_bin_log
to emulate vanilla mysql behavior. Dump threads use the global
dump_log object instead of mysql_bin_log directly. We switch the log
in dump log only when raft role changes (in binlog_change_to_binlog()
and binlog_change_to_apply_log()).
During raft role change we take all log releated locks (LOCK_log,
LOCK_index, LOCK_binlog_end_pos, and dump log lock) to serialize it with
other log operations like dumping logs.

Related doc - https://fb.quip.com/oTVAAdgEi4zY

This diff contains below 7 patches:
D23013977
D24766787
D24716539
D24900223
D24955284
D25174166
D25775525

Reviewed By: luqun

Differential Revision: D26141496

-------------------------------------------------------------------------------

Passing raw_log pointer to wait_with_heartbeat() and wait_without_heartbeat()

Summary:
When enable_raft_plugin is OFF Dump_log::lock() is a no-op.
Which means that when enable_raft_plugin is OFF there can be a race
between log switching and dump threads. This could lead to a scenario
where the raw_log that wait_next_event() is working on might be
different than what wait_with_heartbeat()/wait_without_heartbeat() is
working on. This can cause deadlocks because
wait_with_heartbeat()/wait_without_heartbeat()'s mysql_cond_wait would
unlock and then lock a different log's LOCK_binlog_end_pos mutex which
would then never be unlocked by wait_next_event().

Reviewed By: anirbanr-fb

Differential Revision: D32152658

-----------------------------------------------------------------------------------------

Fix rpl_raft_dump_raft_logs

Summary:
This tests completes but fails because the following warning exists:
```
2022-08-30T16:28:00.159525Z 11 [ERROR] [MY-013114] [Repl] Slave I/O for channel '': Got fatal error 1236 from master when reading data from binary log: 'Slave has more GTIDs than the master has, using the master's SERVER_UUID. This may indicate that the end of the binary log was truncated or that the last binary log file was lost, e.g., after a power or disk failure when sync_binlog != 1. The master may or may not have rolled back transactions that were already replicated to the slave. Suggest to replicate any transactions that master has rolled back from slave to master, and/or commit empty transactions on master to account for transactions that have been', Error_code: MY-013114
```
Since the MTR result file is valid, we can suppress this error.

Reviewed By: yichenshen

Differential Revision: D39141846

-------------------------------------------------------------------------------

Fix heap overflow in group_relay_log_name handling

Summary:
We were accessing group_relay_log_name in
Query_log_event::do_apply_event_worker() but it's assigned only after
the coordinator thread encounters an end event (i.e. xid event or a
query event with "COMMIT" or "ROLLBACK" query). This was causing a race
between accessing group_relay_log_name in the worker thread and writing
it on the coordinator thread. We don't need to set transaction position
in events other than end event, so now we set transaction position in
query event only if it's an end event. The race is eliminated because
group_relay_log_name is set before enqueuing the event to the worker
thread (in both dep repl and vanilla mts).

Reviewed By: lth

Differential Revision: D28767430

-------------------------------------------------------------------------------

fix memory during MYSQL_BIN_LOG::open_existing_binlog

Summary:
asandebug complain there are memory leaks during MYSQL_BIN_LOG open

Direct leak of 50 byte(s) in 1 object(s) allocated from:
    #0 0x67460ef in malloc
    #1 0x93f0777 in my_raw_malloc(unsigned long, int)
    #2 0x93f064a in my_malloc(unsigned int, unsigned long, int)
    #3 0x93f0eb0 in my_strdup(unsigned int, char const*, int)
    #4 0x8af01a6 in MYSQL_BIN_LOG::open(unsigned int, char const*, char const*, unsigned int)
    #5 0x8af8064 in MYSQL_BIN_LOG::open_binlog(char const*, char const*, unsigned long, bool, bool, bool, Format_description_log_event*, unsigned int, RaftRotateInfo*, bool)
    #6 0x8b00c00 in MYSQL_BIN_LOG::new_file_impl(bool, Format_description_log_event*, RaftRotateInfo*)
    #7 0x8d65e47 in rotate_relay_log(Master_info*, bool, bool, bool, RaftRotateInfo*)
    #8 0x8d661c0 in rotate_relay_log_for_raft(RaftRotateInfo*)
    #9 0x8c7696a in process_raft_queue
    #10 0xa0fa1fd in pfs_spawn_thread(void*)
    #11 0x7f8c9a12b20b in start_thread

release these memory before assign them

Reviewed By: Pushapgl

Differential Revision: D28819752
inikep pushed a commit that referenced this pull request Jul 29, 2024
Summary:
When COUNT(*) style aggregation appears in queries, the optimizer realizes that a full scan needs to be done. The index doesn't matter - any index will do as it long it satisfies the other query predicates present.

There are two heuristics at play. If there is a choice of multiple indexes, the optimizer will pick the index that has the shortest length since that means less disk scan time, therefore faster query (heuristic #1).

In the case of vector indexes, the current design can have the vector field present in the clustered PK as well as the vector index (eg the flat index). Even though there's no clear advantage, the heuristic is based on INNODB style index layout, and assumes that a secondary index will always be a better choice as long as it does not cover all table columns (heuristic #2). In reality, in both INNODB and MyRocks, the secondary key does contain the PK. But this fact is ignored.

What ends up happening is that by only considered the # of explicitly defined key parts, the vector index ends up getting picked up.

In the case of the IVF index, the vector index may be shorter than the primary clustered index. Thus secondary index will have a shorter key length than the primary, and will end up getting picked up (i.e. if the key part heuristic doesn't already pick the vector index again).

There can be many other index combinations. We want the optimizer to pick another available index. If no other index is available, then pick the primary clustered index.

Differential Revision: D55003211

fbshipit-source-id: 5ce6b34
inikep pushed a commit that referenced this pull request Jul 30, 2024
Summary:
[Porting Notes]
We want to dump raft logs to vanilla async replicas regardless
of whether it's the relay log or binlog. Effectively after this change
we'll dump relay logs on the followers and binlogs on the leader. When
the raft role changes, the logs to the dumped are also changed.
Dump_log class is introduced as a thin wrapper/continer around
mysql_bin_log or rli->relay_log and is inited with mysql_bin_log
to emulate vanilla mysql behavior. Dump threads use the global
dump_log object instead of mysql_bin_log directly. We switch the log
in dump log only when raft role changes (in binlog_change_to_binlog()
and binlog_change_to_apply_log()).
During raft role change we take all log releated locks (LOCK_log,
LOCK_index, LOCK_binlog_end_pos, and dump log lock) to serialize it with
other log operations like dumping logs.

Related doc - https://fb.quip.com/oTVAAdgEi4zY

This diff contains below 7 patches:
D23013977
D24766787
D24716539
D24900223
D24955284
D25174166
D25775525

Reviewed By: luqun

Differential Revision: D26141496

-------------------------------------------------------------------------------

Passing raw_log pointer to wait_with_heartbeat() and wait_without_heartbeat()

Summary:
When enable_raft_plugin is OFF Dump_log::lock() is a no-op.
Which means that when enable_raft_plugin is OFF there can be a race
between log switching and dump threads. This could lead to a scenario
where the raw_log that wait_next_event() is working on might be
different than what wait_with_heartbeat()/wait_without_heartbeat() is
working on. This can cause deadlocks because
wait_with_heartbeat()/wait_without_heartbeat()'s mysql_cond_wait would
unlock and then lock a different log's LOCK_binlog_end_pos mutex which
would then never be unlocked by wait_next_event().

Reviewed By: anirbanr-fb

Differential Revision: D32152658

-----------------------------------------------------------------------------------------

Fix rpl_raft_dump_raft_logs

Summary:
This tests completes but fails because the following warning exists:
```
2022-08-30T16:28:00.159525Z 11 [ERROR] [MY-013114] [Repl] Slave I/O for channel '': Got fatal error 1236 from master when reading data from binary log: 'Slave has more GTIDs than the master has, using the master's SERVER_UUID. This may indicate that the end of the binary log was truncated or that the last binary log file was lost, e.g., after a power or disk failure when sync_binlog != 1. The master may or may not have rolled back transactions that were already replicated to the slave. Suggest to replicate any transactions that master has rolled back from slave to master, and/or commit empty transactions on master to account for transactions that have been', Error_code: MY-013114
```
Since the MTR result file is valid, we can suppress this error.

Reviewed By: yichenshen

Differential Revision: D39141846

-------------------------------------------------------------------------------

Fix heap overflow in group_relay_log_name handling

Summary:
We were accessing group_relay_log_name in
Query_log_event::do_apply_event_worker() but it's assigned only after
the coordinator thread encounters an end event (i.e. xid event or a
query event with "COMMIT" or "ROLLBACK" query). This was causing a race
between accessing group_relay_log_name in the worker thread and writing
it on the coordinator thread. We don't need to set transaction position
in events other than end event, so now we set transaction position in
query event only if it's an end event. The race is eliminated because
group_relay_log_name is set before enqueuing the event to the worker
thread (in both dep repl and vanilla mts).

Reviewed By: lth

Differential Revision: D28767430

-------------------------------------------------------------------------------

fix memory during MYSQL_BIN_LOG::open_existing_binlog

Summary:
asandebug complain there are memory leaks during MYSQL_BIN_LOG open

Direct leak of 50 byte(s) in 1 object(s) allocated from:
    #0 0x67460ef in malloc
    #1 0x93f0777 in my_raw_malloc(unsigned long, int)
    #2 0x93f064a in my_malloc(unsigned int, unsigned long, int)
    #3 0x93f0eb0 in my_strdup(unsigned int, char const*, int)
    #4 0x8af01a6 in MYSQL_BIN_LOG::open(unsigned int, char const*, char const*, unsigned int)
    #5 0x8af8064 in MYSQL_BIN_LOG::open_binlog(char const*, char const*, unsigned long, bool, bool, bool, Format_description_log_event*, unsigned int, RaftRotateInfo*, bool)
    #6 0x8b00c00 in MYSQL_BIN_LOG::new_file_impl(bool, Format_description_log_event*, RaftRotateInfo*)
    #7 0x8d65e47 in rotate_relay_log(Master_info*, bool, bool, bool, RaftRotateInfo*)
    #8 0x8d661c0 in rotate_relay_log_for_raft(RaftRotateInfo*)
    #9 0x8c7696a in process_raft_queue
    #10 0xa0fa1fd in pfs_spawn_thread(void*)
    #11 0x7f8c9a12b20b in start_thread

release these memory before assign them

Reviewed By: Pushapgl

Differential Revision: D28819752
inikep pushed a commit that referenced this pull request Jul 30, 2024
Summary:
When COUNT(*) style aggregation appears in queries, the optimizer realizes that a full scan needs to be done. The index doesn't matter - any index will do as it long it satisfies the other query predicates present.

There are two heuristics at play. If there is a choice of multiple indexes, the optimizer will pick the index that has the shortest length since that means less disk scan time, therefore faster query (heuristic #1).

In the case of vector indexes, the current design can have the vector field present in the clustered PK as well as the vector index (eg the flat index). Even though there's no clear advantage, the heuristic is based on INNODB style index layout, and assumes that a secondary index will always be a better choice as long as it does not cover all table columns (heuristic #2). In reality, in both INNODB and MyRocks, the secondary key does contain the PK. But this fact is ignored.

What ends up happening is that by only considered the # of explicitly defined key parts, the vector index ends up getting picked up.

In the case of the IVF index, the vector index may be shorter than the primary clustered index. Thus secondary index will have a shorter key length than the primary, and will end up getting picked up (i.e. if the key part heuristic doesn't already pick the vector index again).

There can be many other index combinations. We want the optimizer to pick another available index. If no other index is available, then pick the primary clustered index.

Differential Revision: D55003211

fbshipit-source-id: 5ce6b34
inikep pushed a commit that referenced this pull request Jul 31, 2024
Summary:
This is an upstream bug: https://bugs.mysql.com/bug.php?id=92964

The issue is that if slave instance use sysvar_session_track_gtids == OWN_GTID and also enable MTS, the slave will lag behind master and its performance degrades over time.

The reason for this lag/perf degrade is caused by each slave worker keep adding its own gtid into its own Session_consistency_gtids_ctx.m_gtid_set during each transaction commit. Overtime, some of Session_consistency_gtids_ctx.m_gtid_set may have millions gno_interval and it will take long time to insert 1 gtid into Session_consistency_gtids_ctx.m_gtid_set(time complexity O(n)), see [Gtid_set::add_gno_interval](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_gtid_set.cc#L323).

There are couple related code to this issue:
- During slave worker thread start, there is a [check](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/session_tracker.cc#L1560) to not enable tracker for system threads but that check doesn't work---The THD system_thread field is initialized with default value(NON_SYSTEM_THREAD) during THD::THD.
```
  m_enabled = thd->variables.session_track_gtids != OFF &&
              /* No need to track GTIDs for system threads. */
              thd->system_thread == NON_SYSTEM_THREAD;
```
```
Call stack:
```
- in [Session_gtids_track::Update](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/session_tracker.cc#L1581), the else code doesn't unregister listener due to mysql/mysql-server@1e0844c try to fix another issue.

Solutions:
1.  [Current]add system threads check during collecting data, since THD will be fully initialized at that time.
 -Pro: simple change(easy to port) and perf is ok
2. add `thd->session_track_gtids=OFF; thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER)->update(thd);` after [init_slave_thread()](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_slave.cc#L6788)
-Pro: During init_slave_thread(), thd->system_thread will be set correctly value.
-Con: it need to add couple places, such as handle_slave_io, handle_slave_sql, handle_slave_worker
3. add `thd->session_track_gtids=OFF;thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER)->update(thd);` inside [init_slave_thread()](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_slave.cc#L6788)
   -Pro: only add once for slave/slave worker/IO threads
   -Con: [should_collect](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_context.cc#L53) check still return true, thus it is less perf than #1
4. add `thd->session_track_gtids=OFF;thd->rpl_thd_ctx.session_gtids_ctx().unregister_ctx_change_listener(thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER));` inside init_slave_thread()
   -Pro: perf is better than #3, on par as #1
   -Con: if there are other non-system threads(except IO/sql threads) do transactions, these threads will still collect its own gtids.
5. add another THD overload constructor which takes 'enum_thread_type' as a parameter whose default value is NON_SYSTEM_THREAD
 -Pro: it will initialize correct value with THD and won't enable those session_tracker
 -Con: similar to #2/#4, need to replace THD() with THD(enum_thread_type) couples places.

Differential Revision: D18072672
inikep pushed a commit that referenced this pull request Jul 31, 2024
Summary:
MySQL bypass infrastructure change

1. Adds a new hton override `handlerton::handle_single_table_select` to allow hijacking of SELECT statement execution from any storage engine.
2. Add lex parsing for bypass hints in optimizer hint style /*+ bypass */ and /*+ no_bypass */
3. MyRocks overrides handle_single_table_select but does nothing and simply fallback.

This is a port from from bypass feature branch - change #1 of 3. I'm planning to port the changes in the following order:

1. MySQL lexer + SELECT hijacking (this one).
2. SELECT Statement parsing and checking
3. Execution of SELECT statement in bypass

Reference Patch: facebook@2d19dc0

-----
Porting Notes:

* The entry points have changed due to MySQL 8.0 refactoring to SELECT path, and now it is in `Sql_cmd_dml::execute_inner`. I'll also see if overriding the entire execute would be better but this should be good enough for now.
* The check for whether we should go to bypass handler is mostly the same, except st_select_lex_t is now SELECT_LEX.
* The lex parsing for hints was mostly ported as-is. I was also debating whether I should just use optimizer hints instead of hacking the lexer, but decided against using optimizer hint for now so that migration from 5.6 will be easier. We can migrate to optimizer hint once 8.0 migration is complete.

Differential Revision: D22807040

---------------------------------------------------------------------------------------------

[bypass] MySQL bypass change #2 - SELECT parsing [non-RocksDB part]

Summary:
This is change #2 that ports the SELECT statement parsing from feature-myrocks-bypass branch, including:

1. select_parser class that parses SELECT statements and validates the scenarios are supported, and error with proper message if not supported
2. You can control bypass policy with rocksdb_select_bypass_policy variable = 0 (ALWAYS OFF) / 1 (ALWAYS ON) / 2 (OPT IN) / 3 (OPT OUT).
3. A few metrics for number of SELECT bypass statements executed/rejected (due to not supported) /failed (something wrong in execution)

At this point this only dumps the SELECT statement parsed results without doing anything useful. The next change will be porting the query plan execution and point/range query execution part.

Please ignore the unused attributes for now - they'll be removed in next change once the execution part is ported.

Reference Patch:

facebook@e79b20a
facebook@8072e3d

 ---
Porting Notes:

There are a lot of small changes in this area but nothing too bad:
* Some bypass scenarios no longer fail or fail with different error message because of the new parse tree -> AST conversion process. For example, `SELECT *` now works correctly
* st_select_lex_t -> SELECT_LEX, and many members has changed name, such as where -> where_cond()
* protocol usage has changed to start_row, store, end_row in 8.0, which is much better.
* thd->query() is now a LEX_STRING
* SELECT option check (such as SELECT_DISTINCT) now uses active_options() and only check for SELECT_DISTINCT as there seems to be default value being set most likely due to side effect of having more parse tree processing being done. I checked the flags and it looks like only SELECT_DISTINCT are interesting and needs to be blocked.
* SELECT INTO OUTFILE check is changed to use results->needs_file_priviledge
* True/False const are now represented as Item_func_true/Item_func_false. For now I'm over-generalizing to functions that are const, and expect val_int/etc would do the right thing

Reviewed By: luqun

Differential Revision: D22808305

-----------------------------------------------------------------------------------------------

Introduce base select parser class

Summary:
To make it easier to add a new bypass rpc parser into existing bypass implementation (nosql_access.cc),
and make minimum change to bypass engine while adding a bypass rpc parser, I introduced a new base class,
base_select_parser. It includes all essential members and methods for parsing bypass sq/rpc queries.
Existing bypass sql parser (select_parser) and a new bypass rpc parser (not included in this diff)
inherits the base function.

Bypass RPC reference diff: D30493654

Reviewed By: yizhang82

Differential Revision: D31870628
inikep pushed a commit that referenced this pull request Jul 31, 2024
Summary:
Almost a direct port of the feature in 5.6 except for minor visual changes and
some minor changes related to 8.0 thread handling

Also brings in the basic raft_listener_queue without the implemented functions

Differential Revision: D21670710

Use DBUG_TRACE instead of DBUG_ENTER

Summary:
several testcase failed in asan debug,

==106552==ERROR: AddressSanitizer: stack-buffer-underflow on address 0x7f38f3590078 at pc 0x0000093aa495 bp 0x7f38f358ffc0 sp 0x7f38f358ffb8
READ of size 4 at 0x7f38f3590078 thread T203
    #0 0x93aa494 in _db_enter_(char const*, int, char const*, unsigned int, _db_stack_frame_*)
    #1 0x8d0b9bb in Relay_log_info::remove_logged_gtids
    #2 0x8b1ec3f in trim_logged_gtid
    #3 0x8c767cf in process_raft_queue

If Use DBUG_ENTER, then you need to use DBUG_RETURN to pop current frame in CODE_STATE.
If use DBUG_TRACE, it will pop current frame during .dtor

ps. small refactor changes for sql/rpl_handler.cc

Reviewed By: bhatvinay

Differential Revision: D28809418
inikep pushed a commit that referenced this pull request Jul 31, 2024
…_cache_type / tx_readonly variables

Summary:
5.6 Java client we have today will issue following SQL query on connect:

```
/* MYSQL_CJ_FULL_PROD_NAME@ ( Revision: MYSQL_CJ_REVISION@ ) */SELECT  session.auto_increment_increment AS auto_increment_increment, character_set_client AS character_set_client, character_set_connection AS character_set_connection, character_set_results AS character_set_results, character_set_server AS character_set_server, init_connect AS init_connect, interactive_timeout AS interactive_timeout, license AS license, lower_case_table_names AS lower_case_table_names, max_allowed_packet AS max_allowed_packet, net_buffer_length AS net_buffer_length, net_write_timeout AS net_write_timeout, query_cache_size AS query_cache_size, query_cache_type AS query_cache_type, sql_mode AS sql_mode, system_time_zone AS system_time_zone, Timeroot_zone AS time_zone, tx_isolation AS tx_isolation, wait_timeout AS wait_timeout
```

It'll fail on 8.0 as tx_isolation is replaced by transaction_isolation, and query_cache_size/query_cache_type are deprecated (query cache feature is gone in 8.0). This adds a reasonable implementation of those variables to make the connector happy (without introducing other issues, hopefully), before we migrate our java client to 8.0:
1. Add dummy implementation of query_cache_size / query_cache_type that returns 0 size and cache off. Both are readonly. Note I changed query_cache_type to global as I don't want to pay for the extra bytes for a readonly variable and it doesn't make too much sense for a session variable to be read-only anyway. A new test is added as well.
2. Add read-only tx_isolation that directly maps to transaction_isolation variable. transaction_isolation_basic test is enhanced to ensure any change to transaction_isolation gets mapped to tx_isolation (which just works since they are the same variable on thd session)
3. Add read-only tx_read_only maps to transaction_read_only similar to #2 for cases Java cient calls connection.isReadOnly.

Reviewed By: zhichengzhu

Differential Revision: D19551619
inikep pushed a commit that referenced this pull request Jul 31, 2024
Summary:
port D21375044 (facebook@7fc9eaa) (facebook@7fc9eaa) and D17291883 (facebook@6a6a35f) (facebook@6a6a35f)
**Summary**
* As part of leadership changes, we will trim binlogs/relay logs on a
running mysql instance. This needs the ability to remove gtids from executed_gtid set.
This diff adds the ability to do the same. One can enqueue an event in the raft
listener queue and the raft thread in the server will take care of removing and
updating executed_gtid on a running instance.
* When raft logs are trimmed as part of TruncateOpsAfter(), the gtid of the trimmed trxs are also removed from executed/logged gtids. However, when binlog files are rotated, the previous gtid written to the new file depends on what files are rotated. During  binlog rotation the prev gtid are the logged/executed gtids of the instance. During relay log rotation prev gtids are the retrieved gtids of the instance. Hence, in addition to trimming logged gtids/executed gtids, we should also trim retrieved gtids. This prevents creation of holes in the instances executed_gtid set as the replicaset goes through multiple promotions over its lifetime.

Reviewed By: luqun

Differential Revision: D24690007

---------------------------------------------------------------------------------------

Use DBUG_TRACE instead of DBUG_ENTER

Summary:
several testcase failed in asan debug,

==106552==ERROR: AddressSanitizer: stack-buffer-underflow on address 0x7f38f3590078 at pc 0x0000093aa495 bp 0x7f38f358ffc0 sp 0x7f38f358ffb8
READ of size 4 at 0x7f38f3590078 thread T203
    #0 0x93aa494 in _db_enter_(char const*, int, char const*, unsigned int, _db_stack_frame_*)
    #1 0x8d0b9bb in Relay_log_info::remove_logged_gtids
    #2 0x8b1ec3f in trim_logged_gtid
    #3 0x8c767cf in process_raft_queue

If Use DBUG_ENTER, then you need to use DBUG_RETURN to pop current frame in CODE_STATE.
If use DBUG_TRACE, it will pop current frame during .dtor

ps. small refactor changes for sql/rpl_handler.cc

Reviewed By: bhatvinay

Differential Revision: D28809418
inikep pushed a commit that referenced this pull request Jul 31, 2024
Summary:
[Porting Notes]
We want to dump raft logs to vanilla async replicas regardless
of whether it's the relay log or binlog. Effectively after this change
we'll dump relay logs on the followers and binlogs on the leader. When
the raft role changes, the logs to the dumped are also changed.
Dump_log class is introduced as a thin wrapper/continer around
mysql_bin_log or rli->relay_log and is inited with mysql_bin_log
to emulate vanilla mysql behavior. Dump threads use the global
dump_log object instead of mysql_bin_log directly. We switch the log
in dump log only when raft role changes (in binlog_change_to_binlog()
and binlog_change_to_apply_log()).
During raft role change we take all log releated locks (LOCK_log,
LOCK_index, LOCK_binlog_end_pos, and dump log lock) to serialize it with
other log operations like dumping logs.

Related doc - https://fb.quip.com/oTVAAdgEi4zY

This diff contains below 7 patches:
D23013977
D24766787
D24716539
D24900223
D24955284
D25174166
D25775525

Reviewed By: luqun

Differential Revision: D26141496

-------------------------------------------------------------------------------

Passing raw_log pointer to wait_with_heartbeat() and wait_without_heartbeat()

Summary:
When enable_raft_plugin is OFF Dump_log::lock() is a no-op.
Which means that when enable_raft_plugin is OFF there can be a race
between log switching and dump threads. This could lead to a scenario
where the raw_log that wait_next_event() is working on might be
different than what wait_with_heartbeat()/wait_without_heartbeat() is
working on. This can cause deadlocks because
wait_with_heartbeat()/wait_without_heartbeat()'s mysql_cond_wait would
unlock and then lock a different log's LOCK_binlog_end_pos mutex which
would then never be unlocked by wait_next_event().

Reviewed By: anirbanr-fb

Differential Revision: D32152658

-----------------------------------------------------------------------------------------

Fix rpl_raft_dump_raft_logs

Summary:
This tests completes but fails because the following warning exists:
```
2022-08-30T16:28:00.159525Z 11 [ERROR] [MY-013114] [Repl] Slave I/O for channel '': Got fatal error 1236 from master when reading data from binary log: 'Slave has more GTIDs than the master has, using the master's SERVER_UUID. This may indicate that the end of the binary log was truncated or that the last binary log file was lost, e.g., after a power or disk failure when sync_binlog != 1. The master may or may not have rolled back transactions that were already replicated to the slave. Suggest to replicate any transactions that master has rolled back from slave to master, and/or commit empty transactions on master to account for transactions that have been', Error_code: MY-013114
```
Since the MTR result file is valid, we can suppress this error.

Reviewed By: yichenshen

Differential Revision: D39141846

-------------------------------------------------------------------------------

Fix heap overflow in group_relay_log_name handling

Summary:
We were accessing group_relay_log_name in
Query_log_event::do_apply_event_worker() but it's assigned only after
the coordinator thread encounters an end event (i.e. xid event or a
query event with "COMMIT" or "ROLLBACK" query). This was causing a race
between accessing group_relay_log_name in the worker thread and writing
it on the coordinator thread. We don't need to set transaction position
in events other than end event, so now we set transaction position in
query event only if it's an end event. The race is eliminated because
group_relay_log_name is set before enqueuing the event to the worker
thread (in both dep repl and vanilla mts).

Reviewed By: lth

Differential Revision: D28767430

-------------------------------------------------------------------------------

fix memory during MYSQL_BIN_LOG::open_existing_binlog

Summary:
asandebug complain there are memory leaks during MYSQL_BIN_LOG open

Direct leak of 50 byte(s) in 1 object(s) allocated from:
    #0 0x67460ef in malloc
    #1 0x93f0777 in my_raw_malloc(unsigned long, int)
    #2 0x93f064a in my_malloc(unsigned int, unsigned long, int)
    #3 0x93f0eb0 in my_strdup(unsigned int, char const*, int)
    #4 0x8af01a6 in MYSQL_BIN_LOG::open(unsigned int, char const*, char const*, unsigned int)
    #5 0x8af8064 in MYSQL_BIN_LOG::open_binlog(char const*, char const*, unsigned long, bool, bool, bool, Format_description_log_event*, unsigned int, RaftRotateInfo*, bool)
    #6 0x8b00c00 in MYSQL_BIN_LOG::new_file_impl(bool, Format_description_log_event*, RaftRotateInfo*)
    #7 0x8d65e47 in rotate_relay_log(Master_info*, bool, bool, bool, RaftRotateInfo*)
    #8 0x8d661c0 in rotate_relay_log_for_raft(RaftRotateInfo*)
    #9 0x8c7696a in process_raft_queue
    #10 0xa0fa1fd in pfs_spawn_thread(void*)
    #11 0x7f8c9a12b20b in start_thread

release these memory before assign them

Reviewed By: Pushapgl

Differential Revision: D28819752
inikep pushed a commit that referenced this pull request Jul 31, 2024
Summary:
When COUNT(*) style aggregation appears in queries, the optimizer realizes that a full scan needs to be done. The index doesn't matter - any index will do as it long it satisfies the other query predicates present.

There are two heuristics at play. If there is a choice of multiple indexes, the optimizer will pick the index that has the shortest length since that means less disk scan time, therefore faster query (heuristic #1).

In the case of vector indexes, the current design can have the vector field present in the clustered PK as well as the vector index (eg the flat index). Even though there's no clear advantage, the heuristic is based on INNODB style index layout, and assumes that a secondary index will always be a better choice as long as it does not cover all table columns (heuristic #2). In reality, in both INNODB and MyRocks, the secondary key does contain the PK. But this fact is ignored.

What ends up happening is that by only considered the # of explicitly defined key parts, the vector index ends up getting picked up.

In the case of the IVF index, the vector index may be shorter than the primary clustered index. Thus secondary index will have a shorter key length than the primary, and will end up getting picked up (i.e. if the key part heuristic doesn't already pick the vector index again).

There can be many other index combinations. We want the optimizer to pick another available index. If no other index is available, then pick the primary clustered index.

Differential Revision: D55003211

fbshipit-source-id: 5ce6b34
inikep pushed a commit that referenced this pull request Aug 2, 2024
Summary:
This is change #2 that ports the SELECT statement parsing from feature-myrocks-bypass branch, including:

1. select_parser class that parses SELECT statements and validates the scenarios are supported, and error with proper message if not supported
2. You can control bypass policy with rocksdb_select_bypass_policy variable = 0 (ALWAYS OFF) / 1 (ALWAYS ON) / 2 (OPT IN) / 3 (OPT OUT).
3. A few metrics for number of SELECT bypass statements executed/rejected (due to not supported) /failed (something wrong in execution)

At this point this only dumps the SELECT statement parsed results without doing anything useful. The next change will be porting the query plan execution and point/range query execution part.

Please ignore the unused attributes for now - they'll be removed in next change once the execution part is ported.

Reference Patch:

facebook@e79b20a
facebook@8072e3d

 ---
Porting Notes:

There are a lot of small changes in this area but nothing too bad:
* Some bypass scenarios no longer fail or fail with different error message because of the new parse tree -> AST conversion process. For example, `SELECT *` now works correctly
* st_select_lex_t -> SELECT_LEX, and many members has changed name, such as where -> where_cond()
* protocol usage has changed to start_row, store, end_row in 8.0, which is much better.
* thd->query() is now a LEX_STRING
* SELECT option check (such as SELECT_DISTINCT) now uses active_options() and only check for SELECT_DISTINCT as there seems to be default value being set most likely due to side effect of having more parse tree processing being done. I checked the flags and it looks like only SELECT_DISTINCT are interesting and needs to be blocked.
* SELECT INTO OUTFILE check is changed to use results->needs_file_priviledge
* True/False const are now represented as Item_func_true/Item_func_false. For now I'm over-generalizing to functions that are const, and expect val_int/etc would do the right thing

Reviewed By: luqun

Differential Revision: D22808305

-----------------------------------------------------------------------------------------------

Introduce base select parser class

Summary:
To make it easier to add a new bypass rpc parser into existing bypass implementation (nosql_access.cc),
and make minimum change to bypass engine while adding a bypass rpc parser, I introduced a new base class,
base_select_parser. It includes all essential members and methods for parsing bypass sq/rpc queries.
Existing bypass sql parser (select_parser) and a new bypass rpc parser (not included in this diff)
inherits the base function.

Bypass RPC reference diff: D30493654

Reviewed By: yizhang82

Differential Revision: D31870628
inikep pushed a commit that referenced this pull request Aug 2, 2024
Summary:
some MTR failed in ubsan due to num of rows/records calculated in records_in_range() is a negative values which is caused by m_actual_disk_size is a negative value

```
storage/rocksdb/ha_rocksdb.cc:: runtime error: -56.8272 is outside the range of representable values of type 'unsigned long long'
    #0  myrocks::ha_rocksdb::records_in_range_internal(unsigned int, key_range*, key_range*, long, long, unsigned long long*, unsigned long long*) /data/sandcastle/boxes/trunk-git-mysql/storage/rocksdb/ha_rocksdb.cc:14855:16
    #1  myrocks::ha_rocksdb::records_in_range(unsigned int, key_range*, key_range*) /data/sandcastle/boxes/trunk-git-mysql/storage/rocksdb/ha_rocksdb.cc:14760:3
    #2  handler::multi_range_read_info_const(unsigned int, RANGE_SEQ_IF*, void*, unsigned int, unsigned int*, unsigned int*, Cost_estimate*) /data/sandcastle/boxes/trunk-git-mysql/sql/handler.cc:6608:26
    #3  myrocks::ha_rocksdb::multi_range_read_info_const(unsigned int, RANGE_SEQ_IF*, void*, unsigned int, unsigned int*, unsigned int*, Cost_estimate*) /data/sandcastle/boxes/trunk-git-mysql/storage/rocksdb/ha_rocksdb.cc:19002:18
    #4  check_quick_select(THD*, RANGE_OPT_P
```

Due to m_actual_disk_size is an estimated value, always reset to 0 if it i becomes negative.

Differential Revision: D50531919
inikep pushed a commit that referenced this pull request Aug 2, 2024
Summary:
This is an upstream bug: https://bugs.mysql.com/bug.php?id=92964

The issue is that if slave instance use sysvar_session_track_gtids == OWN_GTID and also enable MTS, the slave will lag behind master and its performance degrades over time.

The reason for this lag/perf degrade is caused by each slave worker keep adding its own gtid into its own Session_consistency_gtids_ctx.m_gtid_set during each transaction commit. Overtime, some of Session_consistency_gtids_ctx.m_gtid_set may have millions gno_interval and it will take long time to insert 1 gtid into Session_consistency_gtids_ctx.m_gtid_set(time complexity O(n)), see [Gtid_set::add_gno_interval](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_gtid_set.cc#L323).

There are couple related code to this issue:
- During slave worker thread start, there is a [check](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/session_tracker.cc#L1560) to not enable tracker for system threads but that check doesn't work---The THD system_thread field is initialized with default value(NON_SYSTEM_THREAD) during THD::THD.
```
  m_enabled = thd->variables.session_track_gtids != OFF &&
              /* No need to track GTIDs for system threads. */
              thd->system_thread == NON_SYSTEM_THREAD;
```
```
Call stack:
```
- in [Session_gtids_track::Update](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/session_tracker.cc#L1581), the else code doesn't unregister listener due to mysql/mysql-server@1e0844c try to fix another issue.

Solutions:
1.  [Current]add system threads check during collecting data, since THD will be fully initialized at that time.
 -Pro: simple change(easy to port) and perf is ok
2. add `thd->session_track_gtids=OFF; thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER)->update(thd);` after [init_slave_thread()](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_slave.cc#L6788)
-Pro: During init_slave_thread(), thd->system_thread will be set correctly value.
-Con: it need to add couple places, such as handle_slave_io, handle_slave_sql, handle_slave_worker
3. add `thd->session_track_gtids=OFF;thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER)->update(thd);` inside [init_slave_thread()](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_slave.cc#L6788)
   -Pro: only add once for slave/slave worker/IO threads
   -Con: [should_collect](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_context.cc#L53) check still return true, thus it is less perf than #1
4. add `thd->session_track_gtids=OFF;thd->rpl_thd_ctx.session_gtids_ctx().unregister_ctx_change_listener(thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER));` inside init_slave_thread()
   -Pro: perf is better than #3, on par as #1
   -Con: if there are other non-system threads(except IO/sql threads) do transactions, these threads will still collect its own gtids.
5. add another THD overload constructor which takes 'enum_thread_type' as a parameter whose default value is NON_SYSTEM_THREAD
 -Pro: it will initialize correct value with THD and won't enable those session_tracker
 -Con: similar to #2/#4, need to replace THD() with THD(enum_thread_type) couples places.

Differential Revision: D18072672
inikep pushed a commit that referenced this pull request Aug 2, 2024
Summary:
MySQL bypass infrastructure change

1. Adds a new hton override `handlerton::handle_single_table_select` to allow hijacking of SELECT statement execution from any storage engine.
2. Add lex parsing for bypass hints in optimizer hint style /*+ bypass */ and /*+ no_bypass */
3. MyRocks overrides handle_single_table_select but does nothing and simply fallback.

This is a port from from bypass feature branch - change #1 of 3. I'm planning to port the changes in the following order:

1. MySQL lexer + SELECT hijacking (this one).
2. SELECT Statement parsing and checking
3. Execution of SELECT statement in bypass

Reference Patch: facebook@2d19dc0

-----
Porting Notes:

* The entry points have changed due to MySQL 8.0 refactoring to SELECT path, and now it is in `Sql_cmd_dml::execute_inner`. I'll also see if overriding the entire execute would be better but this should be good enough for now.
* The check for whether we should go to bypass handler is mostly the same, except st_select_lex_t is now SELECT_LEX.
* The lex parsing for hints was mostly ported as-is. I was also debating whether I should just use optimizer hints instead of hacking the lexer, but decided against using optimizer hint for now so that migration from 5.6 will be easier. We can migrate to optimizer hint once 8.0 migration is complete.

Differential Revision: D22807040

---------------------------------------------------------------------------------------------

[bypass] MySQL bypass change #2 - SELECT parsing [non-RocksDB part]

Summary:
This is change #2 that ports the SELECT statement parsing from feature-myrocks-bypass branch, including:

1. select_parser class that parses SELECT statements and validates the scenarios are supported, and error with proper message if not supported
2. You can control bypass policy with rocksdb_select_bypass_policy variable = 0 (ALWAYS OFF) / 1 (ALWAYS ON) / 2 (OPT IN) / 3 (OPT OUT).
3. A few metrics for number of SELECT bypass statements executed/rejected (due to not supported) /failed (something wrong in execution)

At this point this only dumps the SELECT statement parsed results without doing anything useful. The next change will be porting the query plan execution and point/range query execution part.

Please ignore the unused attributes for now - they'll be removed in next change once the execution part is ported.

Reference Patch:

facebook@e79b20a
facebook@8072e3d

 ---
Porting Notes:

There are a lot of small changes in this area but nothing too bad:
* Some bypass scenarios no longer fail or fail with different error message because of the new parse tree -> AST conversion process. For example, `SELECT *` now works correctly
* st_select_lex_t -> SELECT_LEX, and many members has changed name, such as where -> where_cond()
* protocol usage has changed to start_row, store, end_row in 8.0, which is much better.
* thd->query() is now a LEX_STRING
* SELECT option check (such as SELECT_DISTINCT) now uses active_options() and only check for SELECT_DISTINCT as there seems to be default value being set most likely due to side effect of having more parse tree processing being done. I checked the flags and it looks like only SELECT_DISTINCT are interesting and needs to be blocked.
* SELECT INTO OUTFILE check is changed to use results->needs_file_priviledge
* True/False const are now represented as Item_func_true/Item_func_false. For now I'm over-generalizing to functions that are const, and expect val_int/etc would do the right thing

Reviewed By: luqun

Differential Revision: D22808305

-----------------------------------------------------------------------------------------------

Introduce base select parser class

Summary:
To make it easier to add a new bypass rpc parser into existing bypass implementation (nosql_access.cc),
and make minimum change to bypass engine while adding a bypass rpc parser, I introduced a new base class,
base_select_parser. It includes all essential members and methods for parsing bypass sq/rpc queries.
Existing bypass sql parser (select_parser) and a new bypass rpc parser (not included in this diff)
inherits the base function.

Bypass RPC reference diff: D30493654

Reviewed By: yizhang82

Differential Revision: D31870628
inikep pushed a commit that referenced this pull request Aug 2, 2024
Summary:
Almost a direct port of the feature in 5.6 except for minor visual changes and
some minor changes related to 8.0 thread handling

Also brings in the basic raft_listener_queue without the implemented functions

Differential Revision: D21670710

Use DBUG_TRACE instead of DBUG_ENTER

Summary:
several testcase failed in asan debug,

==106552==ERROR: AddressSanitizer: stack-buffer-underflow on address 0x7f38f3590078 at pc 0x0000093aa495 bp 0x7f38f358ffc0 sp 0x7f38f358ffb8
READ of size 4 at 0x7f38f3590078 thread T203
    #0 0x93aa494 in _db_enter_(char const*, int, char const*, unsigned int, _db_stack_frame_*)
    #1 0x8d0b9bb in Relay_log_info::remove_logged_gtids
    #2 0x8b1ec3f in trim_logged_gtid
    #3 0x8c767cf in process_raft_queue

If Use DBUG_ENTER, then you need to use DBUG_RETURN to pop current frame in CODE_STATE.
If use DBUG_TRACE, it will pop current frame during .dtor

ps. small refactor changes for sql/rpl_handler.cc

Reviewed By: bhatvinay

Differential Revision: D28809418
inikep pushed a commit that referenced this pull request Aug 2, 2024
…_cache_type / tx_readonly variables

Summary:
5.6 Java client we have today will issue following SQL query on connect:

```
/* MYSQL_CJ_FULL_PROD_NAME@ ( Revision: MYSQL_CJ_REVISION@ ) */SELECT  session.auto_increment_increment AS auto_increment_increment, character_set_client AS character_set_client, character_set_connection AS character_set_connection, character_set_results AS character_set_results, character_set_server AS character_set_server, init_connect AS init_connect, interactive_timeout AS interactive_timeout, license AS license, lower_case_table_names AS lower_case_table_names, max_allowed_packet AS max_allowed_packet, net_buffer_length AS net_buffer_length, net_write_timeout AS net_write_timeout, query_cache_size AS query_cache_size, query_cache_type AS query_cache_type, sql_mode AS sql_mode, system_time_zone AS system_time_zone, Timeroot_zone AS time_zone, tx_isolation AS tx_isolation, wait_timeout AS wait_timeout
```

It'll fail on 8.0 as tx_isolation is replaced by transaction_isolation, and query_cache_size/query_cache_type are deprecated (query cache feature is gone in 8.0). This adds a reasonable implementation of those variables to make the connector happy (without introducing other issues, hopefully), before we migrate our java client to 8.0:
1. Add dummy implementation of query_cache_size / query_cache_type that returns 0 size and cache off. Both are readonly. Note I changed query_cache_type to global as I don't want to pay for the extra bytes for a readonly variable and it doesn't make too much sense for a session variable to be read-only anyway. A new test is added as well.
2. Add read-only tx_isolation that directly maps to transaction_isolation variable. transaction_isolation_basic test is enhanced to ensure any change to transaction_isolation gets mapped to tx_isolation (which just works since they are the same variable on thd session)
3. Add read-only tx_read_only maps to transaction_read_only similar to #2 for cases Java cient calls connection.isReadOnly.

Reviewed By: zhichengzhu

Differential Revision: D19551619
inikep pushed a commit that referenced this pull request Aug 2, 2024
Summary:
port D21375044 (facebook@7fc9eaa) (facebook@7fc9eaa) and D17291883 (facebook@6a6a35f) (facebook@6a6a35f)
**Summary**
* As part of leadership changes, we will trim binlogs/relay logs on a
running mysql instance. This needs the ability to remove gtids from executed_gtid set.
This diff adds the ability to do the same. One can enqueue an event in the raft
listener queue and the raft thread in the server will take care of removing and
updating executed_gtid on a running instance.
* When raft logs are trimmed as part of TruncateOpsAfter(), the gtid of the trimmed trxs are also removed from executed/logged gtids. However, when binlog files are rotated, the previous gtid written to the new file depends on what files are rotated. During  binlog rotation the prev gtid are the logged/executed gtids of the instance. During relay log rotation prev gtids are the retrieved gtids of the instance. Hence, in addition to trimming logged gtids/executed gtids, we should also trim retrieved gtids. This prevents creation of holes in the instances executed_gtid set as the replicaset goes through multiple promotions over its lifetime.

Reviewed By: luqun

Differential Revision: D24690007

---------------------------------------------------------------------------------------

Use DBUG_TRACE instead of DBUG_ENTER

Summary:
several testcase failed in asan debug,

==106552==ERROR: AddressSanitizer: stack-buffer-underflow on address 0x7f38f3590078 at pc 0x0000093aa495 bp 0x7f38f358ffc0 sp 0x7f38f358ffb8
READ of size 4 at 0x7f38f3590078 thread T203
    #0 0x93aa494 in _db_enter_(char const*, int, char const*, unsigned int, _db_stack_frame_*)
    #1 0x8d0b9bb in Relay_log_info::remove_logged_gtids
    #2 0x8b1ec3f in trim_logged_gtid
    #3 0x8c767cf in process_raft_queue

If Use DBUG_ENTER, then you need to use DBUG_RETURN to pop current frame in CODE_STATE.
If use DBUG_TRACE, it will pop current frame during .dtor

ps. small refactor changes for sql/rpl_handler.cc

Reviewed By: bhatvinay

Differential Revision: D28809418
inikep pushed a commit that referenced this pull request Aug 2, 2024
Summary:
[Porting Notes]
We want to dump raft logs to vanilla async replicas regardless
of whether it's the relay log or binlog. Effectively after this change
we'll dump relay logs on the followers and binlogs on the leader. When
the raft role changes, the logs to the dumped are also changed.
Dump_log class is introduced as a thin wrapper/continer around
mysql_bin_log or rli->relay_log and is inited with mysql_bin_log
to emulate vanilla mysql behavior. Dump threads use the global
dump_log object instead of mysql_bin_log directly. We switch the log
in dump log only when raft role changes (in binlog_change_to_binlog()
and binlog_change_to_apply_log()).
During raft role change we take all log releated locks (LOCK_log,
LOCK_index, LOCK_binlog_end_pos, and dump log lock) to serialize it with
other log operations like dumping logs.

Related doc - https://fb.quip.com/oTVAAdgEi4zY

This diff contains below 7 patches:
D23013977
D24766787
D24716539
D24900223
D24955284
D25174166
D25775525

Reviewed By: luqun

Differential Revision: D26141496

-------------------------------------------------------------------------------

Passing raw_log pointer to wait_with_heartbeat() and wait_without_heartbeat()

Summary:
When enable_raft_plugin is OFF Dump_log::lock() is a no-op.
Which means that when enable_raft_plugin is OFF there can be a race
between log switching and dump threads. This could lead to a scenario
where the raw_log that wait_next_event() is working on might be
different than what wait_with_heartbeat()/wait_without_heartbeat() is
working on. This can cause deadlocks because
wait_with_heartbeat()/wait_without_heartbeat()'s mysql_cond_wait would
unlock and then lock a different log's LOCK_binlog_end_pos mutex which
would then never be unlocked by wait_next_event().

Reviewed By: anirbanr-fb

Differential Revision: D32152658

-----------------------------------------------------------------------------------------

Fix rpl_raft_dump_raft_logs

Summary:
This tests completes but fails because the following warning exists:
```
2022-08-30T16:28:00.159525Z 11 [ERROR] [MY-013114] [Repl] Slave I/O for channel '': Got fatal error 1236 from master when reading data from binary log: 'Slave has more GTIDs than the master has, using the master's SERVER_UUID. This may indicate that the end of the binary log was truncated or that the last binary log file was lost, e.g., after a power or disk failure when sync_binlog != 1. The master may or may not have rolled back transactions that were already replicated to the slave. Suggest to replicate any transactions that master has rolled back from slave to master, and/or commit empty transactions on master to account for transactions that have been', Error_code: MY-013114
```
Since the MTR result file is valid, we can suppress this error.

Reviewed By: yichenshen

Differential Revision: D39141846

-------------------------------------------------------------------------------

Fix heap overflow in group_relay_log_name handling

Summary:
We were accessing group_relay_log_name in
Query_log_event::do_apply_event_worker() but it's assigned only after
the coordinator thread encounters an end event (i.e. xid event or a
query event with "COMMIT" or "ROLLBACK" query). This was causing a race
between accessing group_relay_log_name in the worker thread and writing
it on the coordinator thread. We don't need to set transaction position
in events other than end event, so now we set transaction position in
query event only if it's an end event. The race is eliminated because
group_relay_log_name is set before enqueuing the event to the worker
thread (in both dep repl and vanilla mts).

Reviewed By: lth

Differential Revision: D28767430

-------------------------------------------------------------------------------

fix memory during MYSQL_BIN_LOG::open_existing_binlog

Summary:
asandebug complain there are memory leaks during MYSQL_BIN_LOG open

Direct leak of 50 byte(s) in 1 object(s) allocated from:
    #0 0x67460ef in malloc
    #1 0x93f0777 in my_raw_malloc(unsigned long, int)
    #2 0x93f064a in my_malloc(unsigned int, unsigned long, int)
    #3 0x93f0eb0 in my_strdup(unsigned int, char const*, int)
    #4 0x8af01a6 in MYSQL_BIN_LOG::open(unsigned int, char const*, char const*, unsigned int)
    #5 0x8af8064 in MYSQL_BIN_LOG::open_binlog(char const*, char const*, unsigned long, bool, bool, bool, Format_description_log_event*, unsigned int, RaftRotateInfo*, bool)
    #6 0x8b00c00 in MYSQL_BIN_LOG::new_file_impl(bool, Format_description_log_event*, RaftRotateInfo*)
    #7 0x8d65e47 in rotate_relay_log(Master_info*, bool, bool, bool, RaftRotateInfo*)
    #8 0x8d661c0 in rotate_relay_log_for_raft(RaftRotateInfo*)
    #9 0x8c7696a in process_raft_queue
    #10 0xa0fa1fd in pfs_spawn_thread(void*)
    #11 0x7f8c9a12b20b in start_thread

release these memory before assign them

Reviewed By: Pushapgl

Differential Revision: D28819752
inikep pushed a commit that referenced this pull request Aug 2, 2024
Summary:
When COUNT(*) style aggregation appears in queries, the optimizer realizes that a full scan needs to be done. The index doesn't matter - any index will do as it long it satisfies the other query predicates present.

There are two heuristics at play. If there is a choice of multiple indexes, the optimizer will pick the index that has the shortest length since that means less disk scan time, therefore faster query (heuristic #1).

In the case of vector indexes, the current design can have the vector field present in the clustered PK as well as the vector index (eg the flat index). Even though there's no clear advantage, the heuristic is based on INNODB style index layout, and assumes that a secondary index will always be a better choice as long as it does not cover all table columns (heuristic #2). In reality, in both INNODB and MyRocks, the secondary key does contain the PK. But this fact is ignored.

What ends up happening is that by only considered the # of explicitly defined key parts, the vector index ends up getting picked up.

In the case of the IVF index, the vector index may be shorter than the primary clustered index. Thus secondary index will have a shorter key length than the primary, and will end up getting picked up (i.e. if the key part heuristic doesn't already pick the vector index again).

There can be many other index combinations. We want the optimizer to pick another available index. If no other index is available, then pick the primary clustered index.

Differential Revision: D55003211

fbshipit-source-id: 5ce6b34
inikep pushed a commit that referenced this pull request Aug 6, 2024
Summary:
Almost a direct port of the feature in 5.6 except for minor visual changes and
some minor changes related to 8.0 thread handling

Also brings in the basic raft_listener_queue without the implemented functions

Differential Revision: D21670710

Use DBUG_TRACE instead of DBUG_ENTER

Summary:
several testcase failed in asan debug,

==106552==ERROR: AddressSanitizer: stack-buffer-underflow on address 0x7f38f3590078 at pc 0x0000093aa495 bp 0x7f38f358ffc0 sp 0x7f38f358ffb8
READ of size 4 at 0x7f38f3590078 thread T203
    #0 0x93aa494 in _db_enter_(char const*, int, char const*, unsigned int, _db_stack_frame_*)
    #1 0x8d0b9bb in Relay_log_info::remove_logged_gtids
    #2 0x8b1ec3f in trim_logged_gtid
    #3 0x8c767cf in process_raft_queue

If Use DBUG_ENTER, then you need to use DBUG_RETURN to pop current frame in CODE_STATE.
If use DBUG_TRACE, it will pop current frame during .dtor

ps. small refactor changes for sql/rpl_handler.cc

Reviewed By: bhatvinay

Differential Revision: D28809418
inikep pushed a commit that referenced this pull request Aug 6, 2024
…_cache_type / tx_readonly variables

Summary:
5.6 Java client we have today will issue following SQL query on connect:

```
/* MYSQL_CJ_FULL_PROD_NAME@ ( Revision: MYSQL_CJ_REVISION@ ) */SELECT  session.auto_increment_increment AS auto_increment_increment, character_set_client AS character_set_client, character_set_connection AS character_set_connection, character_set_results AS character_set_results, character_set_server AS character_set_server, init_connect AS init_connect, interactive_timeout AS interactive_timeout, license AS license, lower_case_table_names AS lower_case_table_names, max_allowed_packet AS max_allowed_packet, net_buffer_length AS net_buffer_length, net_write_timeout AS net_write_timeout, query_cache_size AS query_cache_size, query_cache_type AS query_cache_type, sql_mode AS sql_mode, system_time_zone AS system_time_zone, Timeroot_zone AS time_zone, tx_isolation AS tx_isolation, wait_timeout AS wait_timeout
```

It'll fail on 8.0 as tx_isolation is replaced by transaction_isolation, and query_cache_size/query_cache_type are deprecated (query cache feature is gone in 8.0). This adds a reasonable implementation of those variables to make the connector happy (without introducing other issues, hopefully), before we migrate our java client to 8.0:
1. Add dummy implementation of query_cache_size / query_cache_type that returns 0 size and cache off. Both are readonly. Note I changed query_cache_type to global as I don't want to pay for the extra bytes for a readonly variable and it doesn't make too much sense for a session variable to be read-only anyway. A new test is added as well.
2. Add read-only tx_isolation that directly maps to transaction_isolation variable. transaction_isolation_basic test is enhanced to ensure any change to transaction_isolation gets mapped to tx_isolation (which just works since they are the same variable on thd session)
3. Add read-only tx_read_only maps to transaction_read_only similar to #2 for cases Java cient calls connection.isReadOnly.

Reviewed By: zhichengzhu

Differential Revision: D19551619
inikep pushed a commit that referenced this pull request Aug 6, 2024
Summary:
port D21375044 (facebook@7fc9eaa) (facebook@7fc9eaa) and D17291883 (facebook@6a6a35f) (facebook@6a6a35f)
**Summary**
* As part of leadership changes, we will trim binlogs/relay logs on a
running mysql instance. This needs the ability to remove gtids from executed_gtid set.
This diff adds the ability to do the same. One can enqueue an event in the raft
listener queue and the raft thread in the server will take care of removing and
updating executed_gtid on a running instance.
* When raft logs are trimmed as part of TruncateOpsAfter(), the gtid of the trimmed trxs are also removed from executed/logged gtids. However, when binlog files are rotated, the previous gtid written to the new file depends on what files are rotated. During  binlog rotation the prev gtid are the logged/executed gtids of the instance. During relay log rotation prev gtids are the retrieved gtids of the instance. Hence, in addition to trimming logged gtids/executed gtids, we should also trim retrieved gtids. This prevents creation of holes in the instances executed_gtid set as the replicaset goes through multiple promotions over its lifetime.

Reviewed By: luqun

Differential Revision: D24690007

---------------------------------------------------------------------------------------

Use DBUG_TRACE instead of DBUG_ENTER

Summary:
several testcase failed in asan debug,

==106552==ERROR: AddressSanitizer: stack-buffer-underflow on address 0x7f38f3590078 at pc 0x0000093aa495 bp 0x7f38f358ffc0 sp 0x7f38f358ffb8
READ of size 4 at 0x7f38f3590078 thread T203
    #0 0x93aa494 in _db_enter_(char const*, int, char const*, unsigned int, _db_stack_frame_*)
    #1 0x8d0b9bb in Relay_log_info::remove_logged_gtids
    #2 0x8b1ec3f in trim_logged_gtid
    #3 0x8c767cf in process_raft_queue

If Use DBUG_ENTER, then you need to use DBUG_RETURN to pop current frame in CODE_STATE.
If use DBUG_TRACE, it will pop current frame during .dtor

ps. small refactor changes for sql/rpl_handler.cc

Reviewed By: bhatvinay

Differential Revision: D28809418
inikep pushed a commit that referenced this pull request Aug 6, 2024
Summary:
[Porting Notes]
We want to dump raft logs to vanilla async replicas regardless
of whether it's the relay log or binlog. Effectively after this change
we'll dump relay logs on the followers and binlogs on the leader. When
the raft role changes, the logs to the dumped are also changed.
Dump_log class is introduced as a thin wrapper/continer around
mysql_bin_log or rli->relay_log and is inited with mysql_bin_log
to emulate vanilla mysql behavior. Dump threads use the global
dump_log object instead of mysql_bin_log directly. We switch the log
in dump log only when raft role changes (in binlog_change_to_binlog()
and binlog_change_to_apply_log()).
During raft role change we take all log releated locks (LOCK_log,
LOCK_index, LOCK_binlog_end_pos, and dump log lock) to serialize it with
other log operations like dumping logs.

Related doc - https://fb.quip.com/oTVAAdgEi4zY

This diff contains below 7 patches:
D23013977
D24766787
D24716539
D24900223
D24955284
D25174166
D25775525

Reviewed By: luqun

Differential Revision: D26141496

-------------------------------------------------------------------------------

Passing raw_log pointer to wait_with_heartbeat() and wait_without_heartbeat()

Summary:
When enable_raft_plugin is OFF Dump_log::lock() is a no-op.
Which means that when enable_raft_plugin is OFF there can be a race
between log switching and dump threads. This could lead to a scenario
where the raw_log that wait_next_event() is working on might be
different than what wait_with_heartbeat()/wait_without_heartbeat() is
working on. This can cause deadlocks because
wait_with_heartbeat()/wait_without_heartbeat()'s mysql_cond_wait would
unlock and then lock a different log's LOCK_binlog_end_pos mutex which
would then never be unlocked by wait_next_event().

Reviewed By: anirbanr-fb

Differential Revision: D32152658

-----------------------------------------------------------------------------------------

Fix rpl_raft_dump_raft_logs

Summary:
This tests completes but fails because the following warning exists:
```
2022-08-30T16:28:00.159525Z 11 [ERROR] [MY-013114] [Repl] Slave I/O for channel '': Got fatal error 1236 from master when reading data from binary log: 'Slave has more GTIDs than the master has, using the master's SERVER_UUID. This may indicate that the end of the binary log was truncated or that the last binary log file was lost, e.g., after a power or disk failure when sync_binlog != 1. The master may or may not have rolled back transactions that were already replicated to the slave. Suggest to replicate any transactions that master has rolled back from slave to master, and/or commit empty transactions on master to account for transactions that have been', Error_code: MY-013114
```
Since the MTR result file is valid, we can suppress this error.

Reviewed By: yichenshen

Differential Revision: D39141846

-------------------------------------------------------------------------------

Fix heap overflow in group_relay_log_name handling

Summary:
We were accessing group_relay_log_name in
Query_log_event::do_apply_event_worker() but it's assigned only after
the coordinator thread encounters an end event (i.e. xid event or a
query event with "COMMIT" or "ROLLBACK" query). This was causing a race
between accessing group_relay_log_name in the worker thread and writing
it on the coordinator thread. We don't need to set transaction position
in events other than end event, so now we set transaction position in
query event only if it's an end event. The race is eliminated because
group_relay_log_name is set before enqueuing the event to the worker
thread (in both dep repl and vanilla mts).

Reviewed By: lth

Differential Revision: D28767430

-------------------------------------------------------------------------------

fix memory during MYSQL_BIN_LOG::open_existing_binlog

Summary:
asandebug complain there are memory leaks during MYSQL_BIN_LOG open

Direct leak of 50 byte(s) in 1 object(s) allocated from:
    #0 0x67460ef in malloc
    #1 0x93f0777 in my_raw_malloc(unsigned long, int)
    #2 0x93f064a in my_malloc(unsigned int, unsigned long, int)
    #3 0x93f0eb0 in my_strdup(unsigned int, char const*, int)
    #4 0x8af01a6 in MYSQL_BIN_LOG::open(unsigned int, char const*, char const*, unsigned int)
    #5 0x8af8064 in MYSQL_BIN_LOG::open_binlog(char const*, char const*, unsigned long, bool, bool, bool, Format_description_log_event*, unsigned int, RaftRotateInfo*, bool)
    #6 0x8b00c00 in MYSQL_BIN_LOG::new_file_impl(bool, Format_description_log_event*, RaftRotateInfo*)
    #7 0x8d65e47 in rotate_relay_log(Master_info*, bool, bool, bool, RaftRotateInfo*)
    #8 0x8d661c0 in rotate_relay_log_for_raft(RaftRotateInfo*)
    #9 0x8c7696a in process_raft_queue
    #10 0xa0fa1fd in pfs_spawn_thread(void*)
    #11 0x7f8c9a12b20b in start_thread

release these memory before assign them

Reviewed By: Pushapgl

Differential Revision: D28819752
inikep pushed a commit that referenced this pull request Aug 6, 2024
Summary:
When COUNT(*) style aggregation appears in queries, the optimizer realizes that a full scan needs to be done. The index doesn't matter - any index will do as it long it satisfies the other query predicates present.

There are two heuristics at play. If there is a choice of multiple indexes, the optimizer will pick the index that has the shortest length since that means less disk scan time, therefore faster query (heuristic #1).

In the case of vector indexes, the current design can have the vector field present in the clustered PK as well as the vector index (eg the flat index). Even though there's no clear advantage, the heuristic is based on INNODB style index layout, and assumes that a secondary index will always be a better choice as long as it does not cover all table columns (heuristic #2). In reality, in both INNODB and MyRocks, the secondary key does contain the PK. But this fact is ignored.

What ends up happening is that by only considered the # of explicitly defined key parts, the vector index ends up getting picked up.

In the case of the IVF index, the vector index may be shorter than the primary clustered index. Thus secondary index will have a shorter key length than the primary, and will end up getting picked up (i.e. if the key part heuristic doesn't already pick the vector index again).

There can be many other index combinations. We want the optimizer to pick another available index. If no other index is available, then pick the primary clustered index.

Differential Revision: D55003211

fbshipit-source-id: 5ce6b34
inikep pushed a commit that referenced this pull request Aug 12, 2024
Summary:
When COUNT(*) style aggregation appears in queries, the optimizer realizes that a full scan needs to be done. The index doesn't matter - any index will do as it long it satisfies the other query predicates present.

There are two heuristics at play. If there is a choice of multiple indexes, the optimizer will pick the index that has the shortest length since that means less disk scan time, therefore faster query (heuristic #1).

In the case of vector indexes, the current design can have the vector field present in the clustered PK as well as the vector index (eg the flat index). Even though there's no clear advantage, the heuristic is based on INNODB style index layout, and assumes that a secondary index will always be a better choice as long as it does not cover all table columns (heuristic #2). In reality, in both INNODB and MyRocks, the secondary key does contain the PK. But this fact is ignored.

What ends up happening is that by only considered the # of explicitly defined key parts, the vector index ends up getting picked up.

In the case of the IVF index, the vector index may be shorter than the primary clustered index. Thus secondary index will have a shorter key length than the primary, and will end up getting picked up (i.e. if the key part heuristic doesn't already pick the vector index again).

There can be many other index combinations. We want the optimizer to pick another available index. If no other index is available, then pick the primary clustered index.

Differential Revision: D55003211

fbshipit-source-id: 5ce6b34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants