-
Notifications
You must be signed in to change notification settings - Fork 3
[OEP 10] Disk-based transactions with record/index key granularity. #10
Comments
Hi Andrey. Here's my question (as we discussed privately): Will it be possible for a user to get RIDs from a ridbag and then see different RIDs from the same ridbag a few seconds later if a deadlock is detected in the dependency graph and then rolled back? |
Hi Collin, |
@SDIPro updated description |
Hey @Laa, thanks for the detailed description. My questions:
|
The current implementation already works in this way: TX doesn't acquire an exclusive lock on the storage, but they acquire locks on indexes, clusters, etc. So it's the same: as soon as one of the locked components is unlocked, the transaction could start before the other one has completely unlocked all the components. Or in this case, you mean that operations on components will be ordered, so as soon as you finish to change a component is unlocked?
True in theory, but the new impl. would write more data on disk during the commit operation, right? So transactions could be slower than now. |
@lvca according to your next question:
That is incorrect claim as for case when this query is executed inside of transaction as for the case when At first in your concrete example query About the second case when the query is executed inside of a transaction. What does it mean? It means that in current implementation million of pages will be affected and all of them will be accumulated inside of RAM which will cause OOM on much earlier stage than in new proposal when only locks are stored inside of RAM. But even this will not happen in the new proposal. Once again from the text: So to sum up all written above. Execution of queries like update V set a >= a + 20 will be more stable in new implementation than in current one and actually will not experience OOM as it will be now in the case of transactional case. Do you agree with such argumentation? |
(Sorry, I've updated your first comment instead of writing a new one).
The current implementation already works in this way: TX doesn't acquire an exclusive lock on the storage, but they acquire locks on indexes, clusters, etc. So it's the same: as soon as one of the locked components is unlocked, the transaction could start before the other one has completely unlocked all the components. Or in this case, you mean that operations on components will be ordered, so as soon as you finish to change a component is unlocked?
True in theory, but the new impl. would write more data on disk during the commit operation, right? So transactions could be slower than now. |
That is not true. If the same component locked by both transactions, they are serial So now do you agree that : ? |
@lvca Again do not agree with that because of:
Taking into account that normal transaction will be faster because of items 1 and 2. Taking into account that amount of rollbacks much less than an amount of committed transactions we can conclude that new transactions will be faster especially if we take in account scalability effect. Do you agree with that? P.S. In reality even rollbacks in case of normal execution will be not slower I will write separate comment about that. |
What about my first question:
|
Ok, let's talk about that in private first. |
@lvca sorry my last comment was incorrect. We can not use deferred updates with logical rollbacks. A bit tired of the conversation and different variants in mind. But I still think that we should not keep deferred updates approach for the sake of speed of rollbacks. As for me better to have slow rollbacks but an absence of OOM. So I do not think that disk-based nature of the transaction is a disadvantage, from my point of view that is an advantage . If you do not mind I remove comment above because it is not correct. Also about your questions 1, 2 and 8. Still, the point is not entirely clear for me. As we discussed above rollbacks will be slower, but normally in a system, very few rollbacks are performed. Also, normal execution of transactions will be faster and transactions will be much more scalable. If as side effect we get disk-based transactions and will avoid OOM what is harm in this ? |
From a user perspective, rollbacks shouldn't really ever happen. So if they are slow, it is basically irrelevant. I do have one question though. How can a user avoid the OOM saving rollback from actually happening? I understand the mass update example from Luca happening within a transaction could cause problems. What would be the solution to getting it done, without a failure? Is there some sort of batching of a transaction's actions possible? Scott |
@smolinari query in an example provided by Luca cause OOM in the current implementation of the transaction. The same query in new transaction model will not cause OOM because of a limit of locks which can be acquired in the transaction. Once we rich this limit we issue transaction rollback by simple exception and will ask a user to limit a scope of a transaction. So instead of OOM, we will have clear rollback with a reason why it happened. |
Thanks. I understood what you explained. I guess what I am looking for is how to limit the scope of transaction. Sorry for my incompetence on this, but how would it be done? For instance, how would a user get Luca's update example to work? Scott |
Sorry Scott, вт, 20 сент. 2016 г., 12:17 Scott [email protected]:
twitter: @Andrey_Lomakin |
You said above, that the rollback warning would "ask a user to limit a scope of a transaction". How would the user limit the scope of a transaction? Scott |
@lvca following the rest of your questions.
Yes, the way is simpler but unfortunately not scalable and does no solve deadlocks related to ridbags. Also because of ridbags may be updated independently from clusters (that is real issue which we fixed severl weeks ago) locking of clusters does not take effect on ridbags. |
@lvca last questions.
Agree there is no enough of information. |
@smolinari Sorry for the later reply , or more precisely no reply. This is off topic for this issue if you like you may ask in stackoveflow and I will answer you. |
@Laa - although I believe my question would be on-topic depending on the answer, I will oblige. http://stackoverflow.com/questions/39634845/how-to-lower-the-scope-of-a-transaction Scott |
Reference:
https://github.com/orientechnologies/orientdb-labs/blob/master/OEP_10.md
Summary:
Disk-based transactions with record/index key granularity.
Goals:
Create transaction processing engine which supports following features:
Non-Goals:
Improve speed of component operations in single-thread mode.
Success metrics:
Improved scalability of massive write operations on performance benchmarks both in transactional and nontransactional modes.
Motivation:
Also, our investigation of results of performance benchmarks shows that component level locks are not enough to achieve good scalability of writes.
(so-called deferred updates), and if we need to read changed pages, we apply those changes back to pages. Such approach increase system fragility. Any error inside of WAL or atomic operation will lead to data corruption.
Description:
High level design
All changes are performed on pages are logged inside of WAL. Each log record consist of two parts: first part contains information which is needed to restore page content after the crash (redo part), the second part contains information which is needed to rollback page changes (undo part). Redo part is stored in the form of binary diff between original and changed binary presentations of a page.
Proposed transaction protocol allows using fine granularity locks during transaction processing by using a fascinating feature of all our data structures. When we store/delete entry inside of any data structure
it has a page which satisfies the following requirement - if all changes are made on data structure but this single page (domain page) is not updated then data structure still is in the valid state, but
data are treated as absent inside of given data structure.
Each data structure changes may be split on two parts "structure modification changes" and "logical changes."
Let's consider for example tree based index. When we insert (key, rid) pair inside of a tree, we make such changes as a split of parent nodes and a split of leaf page itself. All those changes are "structure modification changes."
As a final step, we should add (key, rid) pair inside of the leaf page (which plays a role of domain page).
Till this entry is add to the leaf page (key, rid) pair is still absent in database, but tree structure is completely valid.
Once we put (key, rid) pair inside leaf page (execute "logical" insertion of data inside tree) it will be treated as stored inside
of database.
In every index which we use leaf, page plays a role of domain page. For clusters, such domain pages are pages of
cluster transaction map.
To restore data after the system crash, we replay all transaction changes from the log till the point of crash and then
revert all uncompleted transactions.
When transaction rollback is performed, we read WAL records from the last record logged into the WAL till the first record
and rollback changes are logged into those records using information is stored in undo part of WAL record.
During rollback it is possible to perform two types of rollbacks "page level" rollback when changes performed on the page reverted from this page during rollback and "logical" rollback when the action which is opposite to the executed logical operation will be performed. Logical rollbacks are performed only to revert logical changes and page level rollbacks are performed on structure modification changes.
Every logical rollback changes again logged into the WAL. It will allow restoring data in case of system crash.
There are several cases of processing of rollbacks and data restore operations.
we will rollback all changes applied to the pages on binary level and at the end of rollback
the data structure will look like it was at the begging of a transaction.
it was at the begging of the transaction.
Taken all above into account it becomes obvious that to implement "isolation" feature of transaction it is enough to keep only
key/record level locks during a transaction and keep only page level locks inside of structure modification changes.
So what about restore of a system after the crash. How to keep data in a correct state even if the system crashed during transactional rollback.
To make it possible, we introduce a new type of log record which is called compensation log record (CLR).
This record keeps a pointer to the log record which should be rolled back next after the log record which
was rolled back by operation wich generates current CLR.
Every time we perform a rollback of page changes are logged inside of log record we put related CLR record.
Such CLR record contains only redo information.
Redo information may include following data:
The presence of such CLR record forces the rollback procedure to skip structure modification
changes during rollback of the transaction once they are completed and changes of domain page are logged.
There are several variants of restoring of database content after system crash:
There are no any CLRs, so we merely restore data structure till the point when a system was crashed.
And then rollback all page changes and restore initial state of the data structure before a transaction.
In such case we restore all structure modification changes, then by applying of CLR records, we will repeat partial rollback and by examination of a content of last CLR record, we will find which pages still should be rolled back and will restore initial state of the data structure before a transaction.
In such case, we restore all structure modification changes and by processing of CLR record will rollback all changes which exist after structure modification changes but will not rollback structure modification changes itself. There is a good reason why we can not rollback structure modification changes.
Page locks are released once we apply structure modification changes, and those pages may be changed by other successfully committed transactions. At the end of data, restore procedure changed data structure will be logically in the same state as it was before a transaction is started.
and skip structure modification changes because of a presence of CLR record. We do not perform logical remove of data but only revert content of domain page because of concurrent access to the data structure it may be in the invalid state till complete restore of a state
of all transactions will be completed.
At the end of data restore procedure changed data structure will be logically in the same state as it was before a transaction is started.
As you can see above if we restore system after crash, some space may be lost at the end of data restore procedure but amount of
such space should be so minor that may be not taken into account.
What is interesting that even if we implement only new transaction processing protocol but will not change
lock model of already existing components we still will increase system speed and scalability.
Let's suppose we have two ongoing transactions on the system, and component lock model is used.
First transaction changes components A and B and second transaction changes components A and C.
In the current implementation, those transactions will be serialized but in the new implementation, transactions will go in parallel once one of them completes changes in component A.
So at a first stage, we may implement protocol itself and then change lock model of components from component level to page level in next releases (such models will be described soon in separate OEPs if this OEP will be approved).
Let's consider the rest two operations on data structures - update and delete.
During execution of delete operation it is executed by following steps:
So during transaction rollback or restore, we revert only domain page content and as a result, a record is automatically restored.
Consider in details second item of delete algorithm - cleanup queue.
When we delete record in a cluster, it is reasonable to claim space consumed by data back to the storage manager.
To make that possible, we create a cleanup queue which is filled by operations performed during a transaction (delete/update) which contains position to the record to be cleared (this position is always the same even during internal page defragmentation and will not be changed after the record delete).
If a transaction is rolled back then, the cleanup queue is cleared, and no changes will be performed. If a transaction is committed then, changes are applied in a background thread. This clean up queue consumes very few memory, each entry queue consists of only two fields
clusterId and position of record inside of data file, but it also may be bounded, we may allow adding no more than
1 000 000 of such entries at any time for a single transaction and 10 000 000 entries in total may be contained in the queue. The same limit may be applied to the amount of locks which may be acquired by the transaction to avoid any risk of OOM.
One of the ideas is to use ThreadPoolExecutor with a bounded queue and a maximum number of threads equal to an amount of cores.
Cleanup threads which process this queue will pull each entry and process it in a separate transaction.
So even if a system is crashed then the transaction will be rolled back, and a tiny amount of space will be lost.
The logic of update of data is similar because an update is a mix of deletion of previous data and an addition of new data.
So during rollback:
Restore logic is same as logic during rollback with the only exception that we do not remove already added record but revert domain page content. We do not perform logical remove of data but only revert content of domain page because of concurrent access to the data structure it may be in the invalid state till complete restore of a state of all transactions will be performed.
Once again it is worth to note that all those procedures will work even if we will use component locks, not page locks, providing a much better level of scalability without of changing of component implementation.
The isolation of visibility of changes for single item requests for cluster and index is trivial.
We lock record or item inside of transaction then try to find this item inside of index or lock.
Implementation of range queries or iteration over ridbag or cluster is a bit complex:
Let's estimate risks of a presence of deadlocks in proposed transaction protocol and approaches to avoid them.
In a current implementation, we track changes which are going to be applied both to records, and key indexes.
To prevent deadlocks between records and keys we need to sort all records and all keys and acquire locks for all of them before we start transaction inside of storage.
Such approach will prevent any deadlocks between keys and existing records.
But there are still two variants of deadlocks:
To solve the first problem, we will use "try locks."
Algorithm of insertion of records will look like following:
To solve the second problem, it is proposed to use deadlock detection which is based on an algorithm of finding of cycles in "wait-for" graph. When a deadlock is detected, system rollbacks operation on one of the ridbags and will repeat it later. Such log detector is implemented on 80% in https://github.com/orientechnologies/orientdb/tree/dreadlock/core/src/main/java/com/orientechnologies/common/concur/dreadlock.
The detector has O(N) complexity and as result time which is needed to detect deadlock matters of hundreds of nanoseconds at max. So we may execute deadlock detection every 0.1 usec without risk of performance loss in case of presence of deadlock.
Low-level design
Some parts of the design have already implemented and described here for clarity.
Let's look at data structures are used in current transaction protocol.
Log records
Log records contain following fields:
Pages
Each page contains the pageLSN and CRC32 fields. Each page update looks like following:
When buffer flushes page to the disk, it checks that all records of WAL are flushed at least till the LSN stored in pageLSN.
If that is not the true flush of WAL content is performed. During page flush, CRC32 code for page content including pageLSN is calculated and assigned to the CRC32 field. CRC32 field is used to detect whether a page is partially flushed to the disk because of the process crash.
Transaction table
This table is used to track the state of the active transactions during normal processing of the transaction and restore of a database after a crash. In our project, it will be implemented as a mix of two classes atomic operations manager and the atomic operation itself.
Transaction table contains following fields:
undoable non-CLR log record, then this field’s value will be set to last LSN. If that most recent log record is a CLR, then this field’s value is set to the UndoNxtLSN value from that CLR.
Dirty pages table
This table contains all pages which are not written to the disk yet. Table consists of three fields (pageID, fileID, dirtyLSN). Dirty LSN value is calculated using the following approach. If a page is acquired from the disk cache for an update, then a value of latest LSN operation which is written to the log becomes a value of dirtyLSN. This table shows LSN of operations which are probably are not stored on the disk yet. Once a page is written to the disk its entry is removed from dirty pages table.
Rollbacks
The protocol supports partial rollbacks till provided LSN. This feature is used for example to rollback ridbag operations after deadlock is detected. It is possible because of usage of CLRs which allow rewinding transaction until the desired point.
Pseudo code for rollback looks like following:
Checkpoints
Obviously, it is impossible to write log forever. Also, it is very time consuming to restore the database from the first operation ever performed on the database. To solve those problems we periodically make checkpoints of database content.
During checkpoint following actions are performed:
record once log record which indicates end checkpoint is flushed to the disk. Master record is a special place in WAL, writes to which always forced to the disk. Also, its content is protected by CRC32 code and by usage of Twin Blocks pattern.
The checkpoint is treated to be valid only when "end checkpoint" record is stored to the disk.
During a checkpoint, transactions can be processed, and data can be written, the correct state of dirty pages and transaction tables will be restored during analysis pass of data restore process.
Restore data after crash
Data restore process consist of three passes:
Please note that WAL operations are mostly sequential (with exception of Undo pass), and several subsequent passes of WAL will not provide noticeable performance overhead. The main target of analysis pass is to minimize amount of random IO operations caused by
page loads.
Analysis Pass
The only records which are written at this pass are the transaction end records which indicates that transaction is finished.
During this pass, if a log record is encountered for a page whose identity does not already appear in the dirty pages table, then an entry is made in the table with the current log record’s LSN as the page’s recLSN. The transaction table is modified to track the state changes of transactions and also to note the LSN of the most recent log record that would need to be
undone if it were determined that the transaction had to be rolled back.
Returned redo LSN is minimum LSN of the dirty pages table.
Pseudo code for analysis pass looks like following:
Redo pass
The redo pass starts scanning the log records from the RedoLSN point. When a redoable log record is encountered, a check is made to see if the referenced page appears in the dirty pages table. If a page is not present in dirty pages table then it is already flushed on disk and result of an operation is not lost and not needed to be restored. If the log record’s LSN is greater than or equal to the recLSN for the page on the table, then it is suspected that the page state might be such that the log record’s update might have to be redone. To resolve this suspicion, the page is accessed. Then CRC32 code of page content is calculated. If CRC32 codes calculated and stored inside of page are not equal then the content of the page was written only at the half since the last flush and change on the page has to be redone. If CRC32 codes are equal, we compare LSNs of the page and log record. If the page’s LSN is found to be less than the log record’s LSN, then the update is redone. Thus, the RecLSN information serves to limit the number of pages which have to be examined. This routine reestablishes the database state as of the time of system failure.
Pseudo code for redo pass looks like following:
Undo pass
The input to this routine is the restart transaction table. The dirty pages table is not consulted during this undo pass.
Also, since history is repeated before the undo pass is initiated, the LSN on the page is not consulted to determine whether an undo operation should be performed or not. The restart undo routine rolls back transactions, in reverse chronological
order, in a single sweep of the log. This is done by continually taking the maximum of the LSNs of the next log record to be processed for each of the yet-to-be-completely-undone transactions until no transaction remains to be undone. The next record to process for each transaction to be rolled back is determined by an entry in the transaction table for each of those transactions.
In the process of rolling back the transactions, this routine writes CLRs.
Pseudo code for the undo pass looks like following:
Parallelisation of data restore after crash
Time of data restore is critical for applications which rely on database high availability.
There are two approaches how time of data restore can be decreased:
In "redo pass" the only limitation is that all changes applied to the pages have to be ordered.
So during redo pass, we may read all log records in a single thread and partition redo operations between different threads.
In undo pass all CLR records from single transaction has to be chained so we may process undo operations in parallel too but all processing will be partitioned using entries of transaction table.
Alternatives:
There are several proposed alternatives, the main differences between current proposal and the rest are following:
page locks may block a lot of underlying pages.
Risks and assumptions:
There is a risk of data corruption in case of incorrect implementation of data rollback or data restore.
Impact matrix:
The text was updated successfully, but these errors were encountered: