Improve performance of default GP_Translation->for_translation() query #376

ocean90 · 2016-03-31T08:56:21Z

The query for https://translate.wordpress.org/projects/wp/dev/admin/de/formal takes between 8 and 10 seconds:

SELECT SQL_CALC_FOUND_ROWS t.*, o.*, t.id as id, o.id as original_id, t.status as translation_status, o.status as original_status, t.date_added as translation_added, o.date_added as original_added
            FROM translate_originals as o
            LEFT JOIN translate_translations AS t ON o.id = t.original_id AND t.translation_set_id = 3898 AND t.status != "rejected" AND t.status != "old" AND (t.status = 'current' OR t.status = 'waiting' OR t.status = 'fuzzy')
            WHERE o.project_id = 78 AND o.status LIKE '+%'  ORDER BY t.date_added DESC LIMIT 20 OFFSET 0;

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	o	range	project_id_status	project_id_status	262	NULL	1741	100.00	Using where; Using temporary; Using filesort
1	SIMPLE	t	ref	original_id,translation_set_id,translation_set_id_status	original_id	5	translate.o.id	55	100.00

The query for https://translate.wordpress.org/projects/wp/dev/admin/de/formal?filters%5Bstatus%5D=untranslated&sort%5Bby%5D=priority&sort%5Bhow%5D=desc takes between 8 and 15 seconds:

SELECT SQL_CALC_FOUND_ROWS t.*, o.*, t.id as id, o.id as original_id, t.status as translation_status, o.status as original_status, t.date_added as translation_added, o.date_added as original_added
            FROM translate_originals as o
            LEFT JOIN translate_translations AS t ON o.id = t.original_id AND t.translation_set_id = 3898 AND t.status != "rejected" AND t.status != "old"
            WHERE o.project_id = 78 AND o.status LIKE '+%' AND t.translation_0 IS NULL ORDER BY o.priority DESC, o.date_added DESC LIMIT 20 OFFSET 0;

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	o	range	project_id_status	project_id_status	262	NULL	1741	100.00	Using where; Using filesort
1	SIMPLE	t	ref	original_id,translation_set_id,translation_set_id_status	original_id	5	translate.o.id	55	100.00	Using where; Not exists

The query for https://translate.wordpress.org/projects/wp/dev/admin/de/formal?filters%5Btranslated%5D=yes&filters%5Bstatus%5D=current takes between 500ms and 2 seconds:

SELECT SQL_CALC_FOUND_ROWS t.*, o.*, t.id as id, o.id as original_id, t.status as translation_status, o.status as original_status, t.date_added as translation_added, o.date_added as original_added
            FROM translate_originals as o
            INNER JOIN translate_translations AS t ON o.id = t.original_id AND t.translation_set_id = 3898 AND (t.status = 'current')
            WHERE o.project_id = 78 AND o.status LIKE '+%'  ORDER BY t.date_added DESC LIMIT 20 OFFSET 0;

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	t	ref	original_id,translation_set_id,translation_set_id_status	translation_set_id_status	27	const,const	1214	100.00	Using where; Using filesort
1	SIMPLE	o	eq_ref	PRIMARY,project_id_status	PRIMARY	4	translate.t.original_id	1	100.00	Using where

The text was updated successfully, but these errors were encountered:

ocean90 · 2016-03-31T10:32:25Z

The main difference is LEFT JOIN vs INNER JOIN: https://github.com/GlotPress/GlotPress-WP/blob/1401774181b96d011b03329ec7e5c0e7f7edb1db/gp-includes/things/translation.php#L198.

toolstack · 2016-03-31T12:28:35Z

I don't think we can use an inner join in those cases as it will return different results if no translation row exists.

For example, when you create a new translation set, no rows are added to the translations table, so the inner join would return 0 rows (there are no rows common between the originals and translations tables), where as the left join would return all the rows from the originals table.

akirk · 2016-04-01T06:18:29Z

I think a compound index on the translations table for (original_id, translation_set_id, status) should speed up things. Right now only the original_id index is being used and the filtering of the other fields has to be done without an index.

akirk · 2016-04-01T14:44:02Z

I now took a closer look at the tables and there are a couple of things to optimize. Usually multiple single key indices indicate that somebody misunderstood something about MySQL keys (no offense to anyone, just an observation). It might make sense to remove some of those indices (for example date_added on translations, unless there is a query somewhere that displays all translations in the table sorted or filtered by date_added).

For example the keys in the translations table

  KEY `translation_set_id` (`translation_set_id`),
  KEY `translation_set_id_status` (`translation_set_id`,`status`),

are redundant because translation_set_id_status can also be used when just translation_set_id is being queried.

The best way to approach this is to optimize the keys based on the most common queries. I'll assume the queries you mention are the most important ones.

Query 1: optimize by adding an index (translation_set_id, date_added) to the translations table

Explanation: MySQL needs to fetch all rows from table translations with the given translation_set_id and sort them by date_added. Normally you'd also include the field status between translation_set_id and date_added but it cannot be used in this case because there are multiple possible values for it.

One interesting aspect to this query: MySQL opts to first query the translations table and join the originals table later, using the primary key (because it needs a 1:1 match anyway) and then filters out the wrong stati while fetching the rows.

Query 2: this is weird one: this is the definition translation_0 TEXT NOT NULL yet the query contains a t.translation_0 IS NULL. So this can be removed from the query. Also the o.status LIKE '+%' really hurts, especially as there is actually just one possible value: +active. So this should be changed to o.status = '+active'.

So to optimize now: add a key (original_id, translation_set_id, status) to translations and a key (project_id, status, priority, date_added) to originals

Query 3: this is already running quite fast, the key from Query 1 can be used here to improve things a little too. I would also recommend to change this to o.status = '+active'

ocean90 · 2016-04-01T16:06:20Z

Thanks for your comprehensive analysis, @akirk!

It might make sense to remove some of those indices (for example date_added on translations, unless there is a query somewhere that displays all translations in the table sorted or filtered by date_added).

I took a look at when those were introduced:

translation_set_id and translation_set_id_status where introduced in https://glotpress.trac.wordpress.org/changeset/311. Sadly without a reason.
date_added was added in https://glotpress.trac.wordpress.org/changeset/348. Can't find a reason here too.
(warnings was added in https://glotpress.trac.wordpress.org/changeset/351)

Query 2: this is weird one: this is the definition translation_0 TEXT NOT NULL yet the query contains a t.translation_0 IS NULL.

Isn't that because all originals are mapped with a translation and then the rows without a translation (all fields are NULL) are selected?

akirk · 2016-04-01T16:09:25Z

Isn't that because all originals are mapped with a translation and then the rows without a translation (all fields are NULL) are selected?

Ah, you are right. My bad, I didn't consider that the row could come in as NULL from the join.

ocean90 · 2016-04-06T22:04:16Z

Proposed indexes in SQL:

ALTER TABLE translate_translations ADD INDEX test1(translation_set_id, date_added);
ALTER TABLE translate_translations ADD INDEX test2(original_id, translation_set_id, status);
ALTER TABLE translate_originals ADD INDEX test3(project_id, status, priority, date_added);

ocean90 · 2016-04-06T22:11:35Z

Looking at the index for (original_id, translation_set_id, status) and noticed that the current schema allows an empty ID for the original and translation set. That doesn't make sense IMO. But would it enhance the index generation if we change it?

akirk · 2016-04-11T16:44:46Z

Here are my findings, running the queries on a big test database on a quite fast machine (SSD, Core i7 3GHz, 16GB RAM):

Index test1 turns out not useful for these queries.
o.status ='+active' instead of o.status LIKE '+%' already improves a lot
not having a production system I ran the queries multiple times -- the visible effects (14x, 5x improvement don't have to be the same on a production system)

Query 1

mysql> EXPLAIN SELECT SQL_CALC_FOUND_ROWS t., o., t.id as id, o.id as original_id, t.status as translation_status, o.status as original_status, t.date_added as translation_added, o.date_added as original_added FROM translate_originals as o LEFT JOIN translate_translations AS t ON o.id = t.original_id AND t.translation_set_id = 3898 AND t.status != "rejected" AND t.status != "old" AND (t.status = 'current' OR t.status = 'waiting' OR t.status = 'fuzzy') WHERE o.project_id = 78 AND o.status ='+active' ORDER BY t.date_added DESC LIMIT 20 OFFSET 0\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: o
partitions: NULL
type: ref
possible_keys: project_id_status,test3
key: project_id_status
key_len: 262
ref: const,const
rows: 1948
filtered: 100.00
Extra: Using temporary; Using filesort <- because of ORDER BY t.date_added (no big deal, small result size)
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: t
partitions: NULL
type: ref
possible_keys: original_id,translation_set_id,translation_set_id_status,test1,test2
key: test2 <- new key being used
key_len: 10
ref: translate.o.id,const
rows: 1
filtered: 100.00
Extra: Using where
2 rows in set, 1 warning (0.00 sec)

Real query: 20 rows in set (0.05/0.05/0.05/0.06/0.04 sec)
Old query: 20 rows in set (3.46/0.74/0.75/0.81/0.73 sec)

Query 2
mysql> EXPLAIN SELECT SQL_CALC_FOUND_ROWS t., o., t.id as id, o.id as original_id, t.status as translation_status, o.status as original_status, t.date_added as translation_added, o.date_added as original_added FROM translate_originals as o LEFT JOIN translate_translations AS t ON o.id = t.original_id AND t.translation_set_id = 3898 AND t.status != "rejected" AND t.status != "old" WHERE o.project_id = 14 AND o.status ='+active' AND t.translation_0 IS NULL ORDER BY o.priority DESC, o.date_added DESC LIMIT 20 OFFSET 0 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: o
partitions: NULL
type: ref
possible_keys: project_id_status,test3
key: test3 <- new key being used
key_len: 262
ref: const,const
rows: 1548
filtered: 100.00
Extra: Using where
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: t
partitions: NULL
type: ref
possible_keys: original_id,translation_set_id,translation_set_id_status,test1,test2
key: test2 <- new key being used
key_len: 10
ref: translate.o.id,const
rows: 1
filtered: 10.00
Extra: Using where; Not exists
2 rows in set, 1 warning (0.00 sec)

Real query: 20 rows in set (0.06/0.07/0.07/0.07/0.06 sec)
Old query: 20 rows in set (0.95/0.26/0.27/0.26/0.25 sec)

Query 3
mysql> EXPLAIN SELECT SQL_CALC_FOUND_ROWS t., o., t.id as id, o.id as original_id, t.status as translation_status, o.status as original_status, t.date_added as translation_added, o.date_added as original_added FROM translate_originals as o INNER JOIN translate_translations AS t ON o.id = t.original_id AND t.translation_set_id = 3898 AND (t.status = 'current') WHERE o.project_id = 78 AND o.status = '+active' ORDER BY t.date_added DESC LIMIT 20 OFFSET 0\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t
partitions: NULL
type: ref
possible_keys: original_id,translation_set_id,translation_set_id_status,test1,test2
key: translation_set_id_status
key_len: 27
ref: const,const
rows: 1285
filtered: 100.00
Extra: Using index condition; Using where; Using filesort <- because of ORDER BY t.date_added (no big deal, small result size)
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: o
partitions: NULL
type: eq_ref
possible_keys: PRIMARY,project_id_status,test3
key: PRIMARY <- best key to use
key_len: 4
ref: translate.t.original_id
rows: 1
filtered: 5.00
Extra: Using where
2 rows in set, 1 warning (0.00 sec)

Real query: 20 rows in set (0.04/0.04/0.04/0.03/0.04 sec)
Old query: 20 rows in set (0.05/0.05/0.04/0.05/0.05 sec)

There are only two status for originals: `+active` and `-original`. There is no need to perform a fuzzy search which affects the performance of the GP_Translation->for_translation()` query. See #376.

…d `project_id` fields. See #376.

@akirk

…r_translation()`. Props @akirk for the analysis, see #376 (comment). See #376.

@akirk

…r_translation()`. Props @akirk for the analysis, see #376 (comment). See #376.

…d `project_id` fields. See #376.

There are only two status for originals: `+active` and `-original`. There is no need to perform a fuzzy search which affects the performance of the GP_Translation->for_translation()` query. See GlotPress#376.

@akirk

…r_translation()`. Props @akirk for the analysis, see #376 (comment). See #376.

…d `project_id` fields. See #376.

@akirk

…r_translation()`. Props @akirk for the analysis, see GlotPress#376 (comment). See GlotPress#376.

@akirk

…r_translation()`. Props @akirk for the analysis, see GlotPress#376 (comment). See GlotPress#376.

…d `project_id` fields. See #376.

ocean90 added [Type] Enhancement A suggestion for improvement. [Priority] High labels Mar 31, 2016

ocean90 modified the milestone: 2.1 Apr 1, 2016

ocean90 added the [Type] Performance label Apr 1, 2016

ocean90 mentioned this issue Apr 20, 2016

376: Replace LIKE queries for the status of an original with an exact match. #419

Merged

ocean90 added a commit that referenced this issue Apr 20, 2016

Schema: Disallow NULL for the original_id, translation_set_id, an…

244d08d

…d `project_id` fields. See #376.

ocean90 mentioned this issue Apr 20, 2016

376: Schema: Disallow NULL for the original_id, translation_set_id, and project_id fields #420

Open

ocean90 added a commit that referenced this issue Apr 20, 2016

Schema: Add new indices to improve performance of `GP_Translation->fo…

a079431

…r_translation()`. Props @akirk for the analysis, see #376 (comment). See #376.

ocean90 mentioned this issue Apr 20, 2016

376: Schema: Add new indices to improve performance of GP_Translation->for_translation(). #421

Merged

ocean90 added a commit that referenced this issue Apr 25, 2016

Schema: Add new indices to improve performance of `GP_Translation->fo…

a8b38b9

…r_translation()`. Props @akirk for the analysis, see #376 (comment). See #376.

toolstack pushed a commit that referenced this issue Apr 28, 2016

Schema: Disallow NULL for the original_id, translation_set_id, an…

e38fd47

…d `project_id` fields. See #376.

ocean90 mentioned this issue May 13, 2016

Provide a better default sort order for GP_Translation::for_export() for consistent exports #441

Open

toolstack pushed a commit that referenced this issue May 17, 2016

Schema: Add new indices to improve performance of `GP_Translation->fo…

cbf2ba6

…r_translation()`. Props @akirk for the analysis, see #376 (comment). See #376.

ocean90 added a commit that referenced this issue May 22, 2016

Schema: Disallow NULL for the original_id, translation_set_id, an…

e887a6e

…d `project_id` fields. See #376.

ocean90 self-assigned this May 22, 2016

ocean90 added the [Status] In Progress Tracking issues with work in progress label May 31, 2016

yoavf pushed a commit to Automattic/GlotPress-WP that referenced this issue Jun 5, 2016

Schema: Add new indices to improve performance of `GP_Translation->fo…

9ac967a

…r_translation()`. Props @akirk for the analysis, see GlotPress#376 (comment). See GlotPress#376.

yoavf pushed a commit to Automattic/GlotPress-WP that referenced this issue Jun 5, 2016

Schema: Add new indices to improve performance of `GP_Translation->fo…

9a4ff8e

…r_translation()`. Props @akirk for the analysis, see GlotPress#376 (comment). See GlotPress#376.

ocean90 added a commit that referenced this issue Jun 7, 2016

Schema: Disallow NULL for the original_id, translation_set_id, an…

443fc99

…d `project_id` fields. See #376.

toolstack pushed a commit that referenced this issue Jun 20, 2016

Schema: Disallow NULL for the original_id, translation_set_id, an…

cbc6b9d

…d `project_id` fields. See #376.

ocean90 closed this as completed Jun 29, 2016

toolstack pushed a commit that referenced this issue Jul 12, 2016

Schema: Disallow NULL for the original_id, translation_set_id, an…

642b6a4

…d `project_id` fields. See #376.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of default GP_Translation->for_translation() query #376

Improve performance of default GP_Translation->for_translation() query #376

ocean90 commented Mar 31, 2016

ocean90 commented Mar 31, 2016

toolstack commented Mar 31, 2016

akirk commented Apr 1, 2016

akirk commented Apr 1, 2016

ocean90 commented Apr 1, 2016

akirk commented Apr 1, 2016

ocean90 commented Apr 6, 2016

ocean90 commented Apr 6, 2016

akirk commented Apr 11, 2016

Improve performance of default GP_Translation->for_translation() query #376

Improve performance of default GP_Translation->for_translation() query #376

Comments

ocean90 commented Mar 31, 2016

ocean90 commented Mar 31, 2016

toolstack commented Mar 31, 2016

akirk commented Apr 1, 2016

akirk commented Apr 1, 2016

ocean90 commented Apr 1, 2016

akirk commented Apr 1, 2016

ocean90 commented Apr 6, 2016

ocean90 commented Apr 6, 2016

akirk commented Apr 11, 2016