You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Both Postgres and MySQL are relational databases. Both are transactional. Both are designed for the OLTP load. So they must be similar inside, right? No at all! In fact they are so different, that it’s hard to believe they solve similar problems.
In this article we’ll go over how they store and look up rows. We’ll first discuss general approaches used in this kind of databases, and then we’ll dive into which approaches these databases chose and what problems they now face.
NB: MySQL storage is actually pluggable - it’s possible to choose a different one with different properties. We’ll be discussing the most popular OLTP storage - InnoDB.
Terminology:
tx - transaction
Record, tuple, row mean the same and are used interchangeably
Conventional databases store tables in 2 possible forms: either a Heap-organized table or an Index-organized table. Each having their own pros and cons.
Heap-organized table
A Heap is just a file split into logical pages (let's say 8kB per page), and records are added to those pages as we INSERT into the table. Since multiple records can reside on the same page, the full address of any record becomes (page_no, record_no), where record_no is the record number within the page.
There's no particular sorting to those records. While originally we fill the table sequentially and the order coincides with the order of insertion, at some point the old (deleted) records can be removed. This leaves the pages half-full, which makes them ready to accept new records.
Finding some specific row in a Heap would be very slow - we’d have to inspect every record whether it matches our criteria. So real-world databases need indexes too. These are separate data structures, residing in separate (logical or physical) files. Indexes are sorted by some key (by a column in a simple case), and finding something in a sorted structure is fast. Very fast. So long as we're searching by that key.
So having a query like select * from users where name = ‘A’, if we had an index that uses name as a key, finding that record is lightning fast. To borrow analogies from different programming languages:
An unsorted array or list of objects would represent a Heap-organized table
And a key-value structure (different languages use different nomenclature: Map, Dict/ Dictionary, Associative Array) - these are indexes.
Of course, we often need to search by different keys - e.g. our app may need to search users by name in one use case and by their ID in a different use case. So we have to have 2 separate indexes (2 different Maps/Dicts). Each of them is going to be sorted by its own key (column).
Maps/Dicts have a key, but they also have a value. They need to reference the actual tuples that reside in the table. If you remember, every record in a Heap has its address in the form (page_no, record_no). So indexes can reference the tuples using this address. And when searching a user by name, the database will first look it up in the respective index, find its physical location - and only then it'll be able to return the full information about the user.
Postgres uses Heap-organized tables. Some other databases (Oracle, MSSQL) are able to use Heaps too, but it’s not the default.
Index-organized table aka Clustered Index
The alternative is to keep the records sorted in the first place. Basically, our table file becomes an index, but that index also contains the actual values. We can have only one such index per table. These indexes are special. We call them Clustered Indexes, but probably a better name would be Clustering Indexes - because they sort the data and basically keep the rows with similar keys together (the rows are clustered together).
Which of the columns should become the key in such an index? Usually it defaults to the Primary Key of our table. Of course, the column marked as the Primary Key is special - it's basically a unique index, and the values can never be null.
What if we create other indexes? Those are called Secondary Indexes. And instead of referencing a physical location, they keep Primary Keys of the records. So if we're looking up a user by name, the database would have to first go to the Secondary Index, find the Primary Keys of the matching records, and then it has to make another lookup. This time - in the Clustered Index.
BTW, in the case when we were using Heaps there was no distinction between types of indexes. All of them were Secondary Indexes. Such storage doesn't give any special meaning to the Primary Key - it's just a yet another Secondary Index, although it's unique and non-null without saying so explicitly.
MySQL uses Clustered Indexes. Other databases (Oracle, MSSQL) operate in this mode by default, but it’s not the only choice for them.
Multiversioning (MVCC) implementations
Databases need to handle multiple connections and transactions simultaneously. So what happens if one tx modifies a row, while another one reads it? There are 2 choices:
Writing tx (Writer) locks the record, and reading tx (Reader) must wait. Once the Writer ends, the Reader can proceed reading the new value. While this approach is supported by all the conventional SQL databases with select ... for share (we'll talk about it in Part II), it's not the default because it causes a lot of locks and thus reduces parallelism.
Instead of just overwriting the record, the Writer can create a new version of it. So Readers can read the old versions until the Writer commits. Since this allows multiple versions of the record to coexist, this is called Multiversioning or Multi-version concurrency control (MVCC). And this is the default in most databases.
The fact that most SQL databases implement MVCC, doesn't mean they do it in the same way or even similarly. On the contrary! There are 2 general approaches:
Copy-on-write (COW) - during UPDATE create a full copy of the row with the new values. The Readers see the old version. This is the Postgres way.
Undo storage - we write the new value right into the table, but we copy the old values into a special storage outside the table. The parallel Readers can reconstruct the version that they want by using the current value in the table plus the data from the Undo storage. Most (if not all) other databases use this approach.
But how do Readers know which version of the tuple to use? Simple: each version must store which tx created it. While the algorithm and the exact fields may differ, the general idea is this:
Assign each tx a monotonically increasing ID (let’s call it TX_ID). Could be an integer that increments with each next tx.
When a Writer creates a tuple, it writes its TX_ID into the header of that tuple. If Writer updates an existing tuple, then the old version either stays in the heap (COW) marked as deleted_by=TX_ID or the old attributes are written into the Undo storage with their previous TX_ID.
When a tx is started, the database captures all other transactions that are currently in-progress/aborted/committed. This info is stored within every tx, and it's different for every tx. This state of transactions is called a Snapshot (PG) or a Read View (MySQL).
When running into a tuple, the Reader knows that it must not see versions created by in-progress or aborted transactions - so it chooses the version that it can actually see. With the COW we have to iterate over all the old/new versions and find only the one we're interested in. With the Undo storage we first find the latest version in the table, and then go through the Undo to find which attributes changed, and reconstruct the version we need.
This allows Writers not to block Readers, which is good! But there's a lot of problems and complexity hidden in this generic approach - this is going to differ between implementations (keep reading).
Postgres vs MySQL
Finally! The core of our discussion! We have 2 popular databases with completely different approaches:
Postgres uses Heaps and Copy-on-write MVCC
MySQL uses Clustered Indexes and the MVCC with the Undo storage
Let's see how these approaches compare and what complexity they bring in.
Deleting old versions
So we have all these old/aborted versions of tuples. We probably want to delete them at some point? For that we need to track transactions that end, and after that we can discard old versions of tuples. After all, those old versions aren't reachable by any tx anymore.
With the Undo storage it's quite efficient - all the obsolete tuples are grouped in a single place. We have to figure out which versions are in fact unreachable (some transactions are still in progress and may need them), but at least all the candidates are grouped together.
But with the Copy-on-write, each version is a first-class citizen of the table. So we have to traverse the whole table to find the obsolete records! To solve this problem PG has a special garbage collecting mechanism - VACUUM. In the systems with heavy load this process can cause problems: it can lag behind the UPDATE/DELETE operations, and it can take considerable CPU power.
Both MySQL and PG start a separate thread/process for this clean up procedure. But for PG it’s certainly less efficient.
Reading old versions
Suppose there's some old Reader that needs an old version of the tuple, what complexity goes into actually reading it?
In the case of Postgres and its COW it's trivial - old versions don't differ in any way from the latest version of the tuple. They are all in Heap, they all are equal. You can read about the exact details in a separate article.
In MySQL and its Undo storage it's more complicated:
We find the record in the table, we see that this (latest) version shouldn't be visible to us. This version (v5) points to the location of the previous version (v4) in the Undo storage.
So we jump there, we see which attributes changed between v4 and v5 (let's say user name changed, but the age didn't), and we merge these versions to get the state at v4. But v4 also may be a not-sufficiently-old version. So this record points to an even older v3 in the Undo storage.
Another jump, and we finally found the right version (v3). Now we can merge v4 with v3 (basically removing changes brought by v4). Phew!
As you can see, in MySQL this process is more complicated and less efficient.
Aborting transactions and resurrecting old versions
Suppose we issued a lot of UPDATE/DELETE statements, but in the end the tx aborted. How do we return the old versions back?
In Postgres it's trivial - you actually don't do anything with the records. Just mark the tx aborted. Other transactions will see the status, and they will know to skip those versions of the tuples. VACUUM then will have to clean up the aborted versions. So in Postgres tx abort is basically a free operation.
But with the Undo storage of MySQL this operation is much more complicated, as it needs to go over the Undo storage and write the old versions back into the table. If we update a million rows, we have to restore a million rows! This means that the abort command may take considerable time in MySQL.
Updating indexes
Okay, we change an attribute in the table. What happens to the indexes if the attribute is not indexed? What happens if it is indexed? This one is tough on both databases.
Updating indexes in Postgres
In Postgres, as you remember, indexes reference physical location. Updating a row will create a new record with a new physical location. So if we have 5 indexes for this table, we need to create index entries for the new versions in addition to keeping the entries of the old versions. Even if the indexed keys didn't change!
Okay, okay, at some point PG introduced an optimization called Heap-only Tuples (HOT). If the new version stayed within the same page as the old version, then we could keep referencing the old version (and thus no changes in the index). When we find the old version, it can point to the new version within the same page - and this allows us to quickly find the right version. For this to work the pages need to have spare room for extra versions - and PG has a fillfactor that defines how much we fill the pages initially. Again, this works only if we’re lucky and the new version can fit on the same page, if not - we still need to update all the indexes.
If an attribute, that participates in the index, changes - then we unconditionally need to change the index - add a new key that references the new version. But this part has to happen in any possible implementation - otherwise some transactions won’t be able to find the records.
Okay, now what about searching? There's a Reader that found all these entries by some key - how do we know which version to go for? Or even if it's just one version - how do we know this tuple is accessible to the current tx? Unfortunately, we don't. We have to actually go through the tuples because the information about which tx created/deleted them is stored only in the Heap.
This is very frustrating, because if our index contains all the attributes needed to answer the query without jumping into the Heap (we say the index covers the query), we still need to access the Heap because of the MVCC! It's a huge performance hit! And to alleviate the pain (partially), Postgres has another optimization. If all the tuples on a page were created by old transactions and are visible to every in-progress tx, then we can mark the whole page as "visible for all". PG keeps this information in a special bitmap called Visibility Map. So after finding an entry in the index, we can quickly learn if it's on a visible-for-all page, and only if not - we jump to the Heap and check the tuple itself.
And the last thing - VACUUMing is actually more complicated than we thought:
Before deleting obsolete tuples we need to first delete obsolete index entries.
And it's VACUUM who decides which pages go into the Visibility Map.
Updating indexes in MySQL
MySQL faces similar problems. But instead of keeping a separate Visibility Map (which gets updated not as frequently as one might hope), MySQL keeps the transaction-that-last-updated right in the page. So if there are 10 records on an index page (yes, indexes also go in pages), and we just updated one of them - we have to update the page with the current TX_ID. This TX_ID doesn’t represent each individual entry, its granularity is the whole page.
When searching in the index we now can check if that TX_ID was committed. If yes, and if we know that all transactions before this one are guaranteed to be committed (both MySQL and PG track this), then we know for sure that this entry is accessible (in fact, all the entries on the page are accessible). If not, then we’re forced to jump to the Clustered Index and check the version of that particular record. And if we are looking for a different version, then jump through the Undo and start the whole reconstruction business.
We won't go into the details here in Part I, but if 2 transactions are updating the same row, they have to wait on one another. And to do that the database needs to create Locks. In PG it’s easy - we just lock the heap tuple itself, it's our single source of truth. But MySQL locks index records (both Clustered and Secondary). So when issuing an update ... where name = 'a', and then in another tx we run update ... where age=18 MySQL somehow needs to know that both transactions arrived at the same record. For that reason MySQL either needs to put locks in all the indexes during the UPDATE, or have a way to dynamically check the locks in unrelated indexes during the SELECT. In fact MySQL does both, depending on which approach it deems more optimal in any particular situation.
And the last problem (that I know of) is that indexes are represented by structures called trees (specifically, B-tree in most cases). After the tree is modified it may require rebalancing to keep good performance. But when creating Lock objects, MySQL references a memory address of the index entry (space_id, page_no, rec_no), not the value of the key (the key could be changed by a query, so we can’t lock on the key). So imagine a situation:
tx1 is writing to idx_entry1 located in some portion of the tree
tx2 modifies a completely unrelated record, but unfortunately this causes the tree to be rebalanced, including the portion where idx_entry1 is located.
During rebalancing the addresses will change, but we hold a lock on that entry. So one tx has to wait on another tx not because they update the same records, but because the index has to be rebalanced.
Summary
As you can see Postgres and MySQL InnoDB are very different, even though they solve similar problems. Clearly, they chose different paths in the beginning, which made their lives complicated in different ways.
If you’re interested in debugging these databases yourself, here are some instructions. But I must warn you - MySQL code (C++) is very convoluted and poorly documented. Postgres source code (C), on the other hand, is well structured, relatively simple and half of it (if not more) is documentation.
The reason MySQL is more complicated is partially because the Undo Storage and Clustered Indexes are more complicated architectural solutions with more problems. Another part - they seem to spend a lot of time optimizing every possible use case, while the PG team keeps it simple. But ultimately I think it’s the dev culture of the teams that had the biggest impact - after all no matter how complicated the solution is, it’s no excuse for leaving MySQL source code poorly documented.
Without running actual benchmarks it’s not obvious which solution is faster. Most likely, it will depend on the use case. From the architecture itself we can tell that probably:
Postgres is more optimal for append-only solutions, where UPDATEs don’t happen. Or if they happen rarely. We should be especially careful updating the old data as this will make Visibility Maps useless.
If, however, UPDATEs happen often and are all over the tables, my hunch is that MySQL might be faster. However, we haven't yet considered the locking mechanisms. MySQL is very generous with locks which reduces the possibility for concurrency.
In Part II we’ll discuss the differences in the locking mechanisms and transaction isolation. We’ll see some really weird shit there. Stay tuned!
The text was updated successfully, but these errors were encountered:
Both Postgres and MySQL are relational databases. Both are transactional. Both are designed for the OLTP load. So they must be similar inside, right? No at all! In fact they are so different, that it’s hard to believe they solve similar problems.
In this article we’ll go over how they store and look up rows. We’ll first discuss general approaches used in this kind of databases, and then we’ll dive into which approaches these databases chose and what problems they now face.
NB: MySQL storage is actually pluggable - it’s possible to choose a different one with different properties. We’ll be discussing the most popular OLTP storage - InnoDB.
Terminology:
Table of Contents
How tables and indexes are structured inside
Conventional databases store tables in 2 possible forms: either a Heap-organized table or an Index-organized table. Each having their own pros and cons.
Heap-organized table
A Heap is just a file split into logical pages (let's say 8kB per page), and records are added to those pages as we INSERT into the table. Since multiple records can reside on the same page, the full address of any record becomes
(page_no, record_no)
, whererecord_no
is the record number within the page.There's no particular sorting to those records. While originally we fill the table sequentially and the order coincides with the order of insertion, at some point the old (deleted) records can be removed. This leaves the pages half-full, which makes them ready to accept new records.
Finding some specific row in a Heap would be very slow - we’d have to inspect every record whether it matches our criteria. So real-world databases need indexes too. These are separate data structures, residing in separate (logical or physical) files. Indexes are sorted by some key (by a column in a simple case), and finding something in a sorted structure is fast. Very fast. So long as we're searching by that key.
So having a query like
select * from users where name = ‘A’
, if we had an index that usesname
as a key, finding that record is lightning fast. To borrow analogies from different programming languages:Map
,Dict
/Dictionary
,Associative Array
) - these are indexes.Of course, we often need to search by different keys - e.g. our app may need to search users by name in one use case and by their ID in a different use case. So we have to have 2 separate indexes (2 different Maps/Dicts). Each of them is going to be sorted by its own key (column).
Maps/Dicts have a key, but they also have a value. They need to reference the actual tuples that reside in the table. If you remember, every record in a Heap has its address in the form
(page_no, record_no)
. So indexes can reference the tuples using this address. And when searching a user by name, the database will first look it up in the respective index, find its physical location - and only then it'll be able to return the full information about the user.Postgres uses Heap-organized tables. Some other databases (Oracle, MSSQL) are able to use Heaps too, but it’s not the default.
Index-organized table aka Clustered Index
The alternative is to keep the records sorted in the first place. Basically, our table file becomes an index, but that index also contains the actual values. We can have only one such index per table. These indexes are special. We call them Clustered Indexes, but probably a better name would be Clustering Indexes - because they sort the data and basically keep the rows with similar keys together (the rows are clustered together).
Which of the columns should become the key in such an index? Usually it defaults to the Primary Key of our table. Of course, the column marked as the Primary Key is special - it's basically a unique index, and the values can never be null.
What if we create other indexes? Those are called Secondary Indexes. And instead of referencing a physical location, they keep Primary Keys of the records. So if we're looking up a user by name, the database would have to first go to the Secondary Index, find the Primary Keys of the matching records, and then it has to make another lookup. This time - in the Clustered Index.
BTW, in the case when we were using Heaps there was no distinction between types of indexes. All of them were Secondary Indexes. Such storage doesn't give any special meaning to the Primary Key - it's just a yet another Secondary Index, although it's unique and non-null without saying so explicitly.
MySQL uses Clustered Indexes. Other databases (Oracle, MSSQL) operate in this mode by default, but it’s not the only choice for them.
Multiversioning (MVCC) implementations
Databases need to handle multiple connections and transactions simultaneously. So what happens if one tx modifies a row, while another one reads it? There are 2 choices:
select ... for share
(we'll talk about it in Part II), it's not the default because it causes a lot of locks and thus reduces parallelism.The fact that most SQL databases implement MVCC, doesn't mean they do it in the same way or even similarly. On the contrary! There are 2 general approaches:
But how do Readers know which version of the tuple to use? Simple: each version must store which tx created it. While the algorithm and the exact fields may differ, the general idea is this:
deleted_by=TX_ID
or the old attributes are written into the Undo storage with their previous TX_ID.This allows Writers not to block Readers, which is good! But there's a lot of problems and complexity hidden in this generic approach - this is going to differ between implementations (keep reading).
Postgres vs MySQL
Finally! The core of our discussion! We have 2 popular databases with completely different approaches:
Let's see how these approaches compare and what complexity they bring in.
Deleting old versions
So we have all these old/aborted versions of tuples. We probably want to delete them at some point? For that we need to track transactions that end, and after that we can discard old versions of tuples. After all, those old versions aren't reachable by any tx anymore.
With the Undo storage it's quite efficient - all the obsolete tuples are grouped in a single place. We have to figure out which versions are in fact unreachable (some transactions are still in progress and may need them), but at least all the candidates are grouped together.
But with the Copy-on-write, each version is a first-class citizen of the table. So we have to traverse the whole table to find the obsolete records! To solve this problem PG has a special garbage collecting mechanism - VACUUM. In the systems with heavy load this process can cause problems: it can lag behind the UPDATE/DELETE operations, and it can take considerable CPU power.
Both MySQL and PG start a separate thread/process for this clean up procedure. But for PG it’s certainly less efficient.
Reading old versions
Suppose there's some old Reader that needs an old version of the tuple, what complexity goes into actually reading it?
In the case of Postgres and its COW it's trivial - old versions don't differ in any way from the latest version of the tuple. They are all in Heap, they all are equal. You can read about the exact details in a separate article.
In MySQL and its Undo storage it's more complicated:
As you can see, in MySQL this process is more complicated and less efficient.
Aborting transactions and resurrecting old versions
Suppose we issued a lot of UPDATE/DELETE statements, but in the end the tx aborted. How do we return the old versions back?
In Postgres it's trivial - you actually don't do anything with the records. Just mark the tx aborted. Other transactions will see the status, and they will know to skip those versions of the tuples. VACUUM then will have to clean up the aborted versions. So in Postgres tx abort is basically a free operation.
But with the Undo storage of MySQL this operation is much more complicated, as it needs to go over the Undo storage and write the old versions back into the table. If we update a million rows, we have to restore a million rows! This means that the abort command may take considerable time in MySQL.
Updating indexes
Okay, we change an attribute in the table. What happens to the indexes if the attribute is not indexed? What happens if it is indexed? This one is tough on both databases.
Updating indexes in Postgres
In Postgres, as you remember, indexes reference physical location. Updating a row will create a new record with a new physical location. So if we have 5 indexes for this table, we need to create index entries for the new versions in addition to keeping the entries of the old versions. Even if the indexed keys didn't change!
Okay, okay, at some point PG introduced an optimization called Heap-only Tuples (HOT). If the new version stayed within the same page as the old version, then we could keep referencing the old version (and thus no changes in the index). When we find the old version, it can point to the new version within the same page - and this allows us to quickly find the right version. For this to work the pages need to have spare room for extra versions - and PG has a fillfactor that defines how much we fill the pages initially. Again, this works only if we’re lucky and the new version can fit on the same page, if not - we still need to update all the indexes.
If an attribute, that participates in the index, changes - then we unconditionally need to change the index - add a new key that references the new version. But this part has to happen in any possible implementation - otherwise some transactions won’t be able to find the records.
Okay, now what about searching? There's a Reader that found all these entries by some key - how do we know which version to go for? Or even if it's just one version - how do we know this tuple is accessible to the current tx? Unfortunately, we don't. We have to actually go through the tuples because the information about which tx created/deleted them is stored only in the Heap.
This is very frustrating, because if our index contains all the attributes needed to answer the query without jumping into the Heap (we say the index covers the query), we still need to access the Heap because of the MVCC! It's a huge performance hit! And to alleviate the pain (partially), Postgres has another optimization. If all the tuples on a page were created by old transactions and are visible to every in-progress tx, then we can mark the whole page as "visible for all". PG keeps this information in a special bitmap called Visibility Map. So after finding an entry in the index, we can quickly learn if it's on a visible-for-all page, and only if not - we jump to the Heap and check the tuple itself.
And the last thing - VACUUMing is actually more complicated than we thought:
Updating indexes in MySQL
MySQL faces similar problems. But instead of keeping a separate Visibility Map (which gets updated not as frequently as one might hope), MySQL keeps the transaction-that-last-updated right in the page. So if there are 10 records on an index page (yes, indexes also go in pages), and we just updated one of them - we have to update the page with the current TX_ID. This TX_ID doesn’t represent each individual entry, its granularity is the whole page.
When searching in the index we now can check if that TX_ID was committed. If yes, and if we know that all transactions before this one are guaranteed to be committed (both MySQL and PG track this), then we know for sure that this entry is accessible (in fact, all the entries on the page are accessible). If not, then we’re forced to jump to the Clustered Index and check the version of that particular record. And if we are looking for a different version, then jump through the Undo and start the whole reconstruction business.
We won't go into the details here in Part I, but if 2 transactions are updating the same row, they have to wait on one another. And to do that the database needs to create Locks. In PG it’s easy - we just lock the heap tuple itself, it's our single source of truth. But MySQL locks index records (both Clustered and Secondary). So when issuing an
update ... where name = 'a'
, and then in another tx we runupdate ... where age=18
MySQL somehow needs to know that both transactions arrived at the same record. For that reason MySQL either needs to put locks in all the indexes during the UPDATE, or have a way to dynamically check the locks in unrelated indexes during the SELECT. In fact MySQL does both, depending on which approach it deems more optimal in any particular situation.And the last problem (that I know of) is that indexes are represented by structures called trees (specifically, B-tree in most cases). After the tree is modified it may require rebalancing to keep good performance. But when creating Lock objects, MySQL references a memory address of the index entry
(space_id, page_no, rec_no)
, not the value of the key (the key could be changed by a query, so we can’t lock on the key). So imagine a situation:Summary
As you can see Postgres and MySQL InnoDB are very different, even though they solve similar problems. Clearly, they chose different paths in the beginning, which made their lives complicated in different ways.
If you’re interested in debugging these databases yourself, here are some instructions. But I must warn you - MySQL code (C++) is very convoluted and poorly documented. Postgres source code (C), on the other hand, is well structured, relatively simple and half of it (if not more) is documentation.
The reason MySQL is more complicated is partially because the Undo Storage and Clustered Indexes are more complicated architectural solutions with more problems. Another part - they seem to spend a lot of time optimizing every possible use case, while the PG team keeps it simple. But ultimately I think it’s the dev culture of the teams that had the biggest impact - after all no matter how complicated the solution is, it’s no excuse for leaving MySQL source code poorly documented.
Without running actual benchmarks it’s not obvious which solution is faster. Most likely, it will depend on the use case. From the architecture itself we can tell that probably:
In Part II we’ll discuss the differences in the locking mechanisms and transaction isolation. We’ll see some really weird shit there. Stay tuned!
The text was updated successfully, but these errors were encountered: