Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process crash when iteratively writing 'batches' of values via cursor in transaction #155

Closed
4deeptech opened this issue Jan 11, 2023 · 24 comments · Fixed by #159
Closed
Labels

Comments

@4deeptech
Copy link

This code enumerates and PUTS via cursor

        public void Append(IEnumerable<EventEnvelope> envList)
        {
            long version = 0;
            using (var tx = _env.BeginTransaction())
            using (var db = tx.OpenDatabase(_dbName, new DatabaseConfiguration { Flags = DatabaseOpenFlags.Create }))
            {
                var (resultCode, versionKey, versionValue) = tx.Get(db, Encoding.UTF8.GetBytes("masterStreamVersion"));
                if (resultCode == MDBResultCode.NotFound)
                {
                    version = 0;
                }
                else if (resultCode == MDBResultCode.Success)
                {
                    version = BitConverter.ToInt64(versionValue.CopyToNewArray(), 0);
                }
                using var cursor = tx.CreateCursor(db);
                {
                    foreach (EventEnvelope env in envList)
                    {
                        version += 1;
                        var eventResult = cursor.Put(Encoding.UTF8.GetBytes("masterStream_" + version), 
                            Encoding.UTF8.GetBytes(JsonConvert.SerializeObject(env, _serializxerSettings)),
                            CursorPutOptions.None);
                        if (eventResult != MDBResultCode.Success)
                        {
                            tx.Abort();
                            throw new Exception("LightningDB cursor put exception");
                        }
                    }
                    var versionResult = cursor.Put(Encoding.UTF8.GetBytes("masterStreamVersion"), BitConverter.GetBytes(version), CursorPutOptions.None);
                    if (versionResult != MDBResultCode.Success)
                    {
                        tx.Abort();
                        throw new Exception("LightningDB write version exception");
                    }
                    var trxResult = tx.Commit();
                    if (trxResult != MDBResultCode.Success)
                    {
                        throw new Exception("LightningDB append commit exception");
                    }
                }
            }
        }

If I call this Append method in a loop 3K times with about 15 messages in the enumeration it will crash the process. However, If I call it once with say 50K items it works fine. As a sanity check I ran the loop 2K times and that succeeded.

This is my first time using LightningDB/lmdb so I may be doing something wrong. There is a key/value containing a version number that is maintained/set after each 'batch' written in the same transaction. Note that the size of the DB is the same in both scenarios and the looping is all done in the same thread like this:

for(int i=0;i<3000;i++)
{
    Append(arrayWith15OrSoItems); //crashes around iteration 2040 in a unit test
}

vs

Append(arrayWith50000Items); //works fine with same type payload in a unit test
@AlgorithmsAreCool
Copy link
Collaborator

Interesting. From a quick glance this code looks correct. I'll take a deeper look at this today or tomorrow.

What kind of exception is being thrown?

@4deeptech
Copy link
Author

No dotnet exceptions thrown, just crashes the process. When running as an xunit test it shows the following:

"The active test run was aborted. Reason: Test host process crashed."

@AlgorithmsAreCool
Copy link
Collaborator

Lemme try to repro.

Btw, what OS is this happening on?

@AlgorithmsAreCool
Copy link
Collaborator

Confirmed a repro. My test console application dies after 2039 iterations of calling append() with no observable exception.

@4deeptech
Copy link
Author

4deeptech commented Jan 13, 2023

Wow, that's exactly where mine was happening. I apologize for not putting the OS details and version. Windows 11, .NET 6, LightningDB 0.14.1. Thanks for spinning up a repro sample. I guess this is a good test case to include into the test suite. I guess I'm not crazy and doing something wrong ;) Wonder if this happens on other OS's. What I was working on may end up running in a linux VM at some point but way too early in my prototype for that.

@AlgorithmsAreCool
Copy link
Collaborator

AlgorithmsAreCool commented Jan 13, 2023

No, it's definitely not just you. There is a crash happening deep in the call stack internal to LMDB. In fact it is faulting in ntdll, probably due to some type of resource exhaustion.

I can see the following error in windows event viewer

Faulting application name: lightningnet-155.exe, version: 1.0.0.0, time stamp: 0x634eda42
Faulting module name: ntdll.dll, version: 10.0.19041.2130, time stamp: 0xb5ced1c6
Exception code: 0xc0000374
Fault offset: 0x00000000000ff6a9
Faulting process id: 0x5e40
Faulting application start time: 0x01d926da7e3c35a2
Faulting application path: D:\dev\lightningnet-155\bin\Debug\net7.0\lightningnet-155.exe
Faulting module path: C:\windows\SYSTEM32\ntdll.dll
Report Id: 4b2b1262-b76b-4923-bf05-f47c1164b6ae
Faulting package full name: 
Faulting package-relative application ID: 

But what is very weird is the program crashes, but I can't observe the exception in VS2022. But when i use WinDBG, the program does not crash and concludes 5000 iteractions successfully. I am still trying to determined what the issue is exactly.

Historically, LMDB is extremely reliable as long as it is used correctly, so i would bet this is a problem with a invalid config. But I still trying to figure out what exactly...

@AlgorithmsAreCool
Copy link
Collaborator

AlgorithmsAreCool commented Jan 13, 2023

Observations

  • The crash occurs after 2039 iterations of Append, regardless of how many entries are in the envList argument.
  • If AutoResizeWindows is set to true, the failure occurs after 2038 interactions, one less then if the setting is false (default)
  • Once MapSize is set to an adequately large value, it has no bearing onthe crash. When set to a small limit, we receive a proper failure via the API during the call to Put. No crash
  • _env.EnvironmentStats.Entries never changes and is always 1 after each call to Append. It should be increasing...
  • Omitting the call to Commit causes the crash to go away...
  • The problem is isolated to use of cursors. Rewriting the code to not use cursors fixes the crash.
  • Omitting the cursor dispose has no effect.

@4deeptech
Copy link
Author

My first attempt at using the LightningDB.net I had it writing one item at a time and didn't get good throughput. So I looked in the unit tests to see if there was a bulk way to get data in and saw the cursor approach so I tried that and got better throughput in my unit tests and thought I was good to go. Then when I wired this into my prototype project it went to hell once I ramped up the number of items in my simulation. I have an Akka actor sending the chunks of items to insert over to the writer actor. Poof, went up in smoke.

How does the transaction work if you don't commit?

Is there a better way to batch stuff without using cursors?

@AlgorithmsAreCool
Copy link
Collaborator

@4deeptech Yep, you only need to use cursors for some of the fancier features like assigning multiple values to a single key (which can be pretty useful for advanced scenarios).

To perform a batch insert simply create a transaction and then call tx.Put(db, ...) as many times as needed and then call tx.Commit(). You can get pretty great write performance this way, millions of records per second depending on disk performance. You can use some documented but dangerous hacks to get faster, but they introduce the possibility of corruption if there is a crash, so they are not recommended.

On the up side, avoiding the use of cursors fixes this issue and In can make unbounded writes as long as the map size is adequate.

I am still trying to pin down the current issue, it is related to disposal of the database handle. I'm wondering if there is a leak somewhere.

@AlgorithmsAreCool
Copy link
Collaborator

AlgorithmsAreCool commented Jan 13, 2023

How does the transaction work if you don't commit?

If do not commit, the data written during the transaction is voided and released when the transaction is disposed.
You MUST dispose the transaction in all cases. It will be obvious if you fail to do so because there is only 1 write transaction allowed at a time. This means writes to LMDB are effectively single threaded (guarded by LMDB internal locking). In practice it isn't a big deal for most scenarios because the write latency so we don't care too much, but some very write intensive workloads would benefit from use of a LSM-tree based database or an append-only style database like FASTER.

@AlgorithmsAreCool
Copy link
Collaborator

After some fiddling, I managed to get WinDBG to hook the process mid crash. It looks like a genuine native LMDB bug involving a corrupt memory heap. I'll work on minimizing a repro and reporting this upstream.

PROCESS_NAME:  lightningnet-155.exe

ERROR_CODE: (NTSTATUS) 0xc0000374 - A heap has been corrupted.

SYMBOL_NAME:  ntdll!RtlpFreeHeapInternal+0

MODULE_NAME: ntdll

IMAGE_NAME:  ntdll.dll

FAILURE_BUCKET_ID:  HEAP_CORRUPTION_ACTIONABLE_BlockNotBusy_DOUBLE_FREE_c0000374_ntdll.dll!RtlpFreeHeapInternal

FAILURE_ID_HASH:  {f9e860eb-b03f-7415-804c-7e671e26c730}

@4deeptech
In the meantime, the recommended workaround is to not use cursors. It sounds like you just need to do some bulk writes. As I mentioned above, LMDB transactions support an unbounded number of Put operations per-transaction. So you don't need to use a cursor to fulfill this use case.

Another side note, you don't actually need to re-open the database handle each iteration. You can safely open a handle once per application and then reuse it for all transactions. This has a positive impact on performance. Each time you open a DB handle in LMDB, it performs a linear scan of all the databases in the environment. If your environment has lots of databases (dozens or more...) the time taken to open a DB starts to add up.

There are some other optimization opportunities i see in your example code pertaining to string operations and memory allocations, but idk how much of a performance win they would be compared to the IO operation.

Let me know if you'd like some example code, I'd be happy to help.

@AlgorithmsAreCool
Copy link
Collaborator

@4deeptech
Here is a minimal repro program of this bug. When you get some time could you run this in your environment to ensure it still crashes?

using System.Text;
using LightningDB;

//this is a single file formatted C# program

//executions begins here
Console.WriteLine("Hello, World!");

using var environment = CreateFreshEnvironment("C:/dev/temp");
environment.Open();

using var db = OpenDatabase(environment, "testDb");

//this loop fails after 2039 iterations
for (int i = 0; i < 5000; i++) {
    ReproCoreIteration(environment, db);
    Console.WriteLine($"Completed iteration {i}");
}


static void ReproCoreIteration(LightningEnvironment environment, LightningDatabase db)
{
    using var tx = environment.BeginTransaction();//auto-disposed at end of scope
    using var cursor = tx.CreateCursor(db); //auto-disposed at end of scope

    var guid = Guid.NewGuid().ToString();
    var guidBytes = Encoding.UTF8.GetBytes(guid);

    _ = cursor.Put(
        guidBytes,
        guidBytes,
        CursorPutOptions.None
    );

    var trxResult = tx.Commit();

    if (trxResult != MDBResultCode.Success)
        throw new Exception("LightningDB append commit exception");

    //program crashes here during transaction dispose after exactly 2039 iterations
    //Note : Commit() call is successful, failure happens after commit when transaction is disposed
}


static LightningDatabase OpenDatabase(LightningEnvironment environment, string dbName)
{
    using var tx = environment.BeginTransaction();
    var db = tx.OpenDatabase(dbName, new DatabaseConfiguration { Flags = DatabaseOpenFlags.Create });
    
    tx.Commit(); //ensure database is created

    return db;
}

static LightningEnvironment CreateFreshEnvironment(string path)
{
    if (Directory.Exists(path))
    {
        Directory.Delete(path, true);
        Console.WriteLine("Cleaned up previous directory");
    }

    return new LightningEnvironment(path, new EnvironmentConfiguration {
        MapSize = 100_000_000,
        MaxDatabases = 2
    });
}

Project

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net7.0</TargetFramework> <!-- other runtimes should also repro -->
    <RootNamespace>lightningnet_155_minimal</RootNamespace>
    <ImplicitUsings>enable</ImplicitUsings>
    <Nullable>enable</Nullable>
  </PropertyGroup>

<ItemGroup>
  <PackageReference Include="lightningdb" Version="0.14.1" />
</ItemGroup>

</Project>

@4deeptech
Copy link
Author

...
Completed iteration 2034
Completed iteration 2035
Completed iteration 2036
Completed iteration 2037
Completed iteration 2038

D:\bootstrap_projects\LightningDBErrorRepro\LightningDBErrorRepro\bin\Debug\net6.0\LightningDBErrorRepro.exe (process 3652) exited with code -1073740940.

@AlgorithmsAreCool
Copy link
Collaborator

I'm going to try updating the version of LMDB we depend on, but I am not hopeful about there being a fix already made since I have tried searching their bug tracker and haven't seen one that matches this description.

So this might take a while to resolve since it is an upstream fix.

@4deeptech
Copy link
Author

@AlgorithmsAreCool Sounds good. Thanks for verifying the issue and suggesting a workable alternative!

@pepelev
Copy link

pepelev commented Jan 19, 2023

Hi!

I have done research on this issue and I think I know the root of the problem.

First of all, I set up Windows error reporting (the documentation says this doesn't work for .net applications, but it's not true) to get a crash dump.
When the dump was collected, after fiddling with the symbols, I got the following picture of the stacktrace of the main thread:
stack-trace

This looks like cursor freed twice.

Then i altered LightningCursor.Dispose(bool) to count free calls like this:

Code
private long freed;
private void Dispose(bool disposing)
{
    if (_handle == IntPtr.Zero)
        return;

    if (!disposing)
        throw new InvalidOperationException("The LightningCursor was not disposed and cannot be reliably dealt with from the finalizer");

    Interlocked.Increment(ref freed);
    mdb_cursor_close(_handle);
    _handle = IntPtr.Zero;

    Transaction.Disposing -= Dispose;

    GC.SuppressFinalize(this);
}

and collect dump again. The dump showed that the fatal call mdb_cursor_close(_handle); is the first in this instance of LightningCursor.

Another interesting thing is that if i dispose the cursor before committing the transaction, the application will not crash.

Code
Console.WriteLine("Hello, World!");

using var environment = CreateFreshEnvironment("C:/dev/temp");
environment.Open();

using var db = OpenDatabase(environment, "testDb");

//this loop fails after 2039 iterations
for (int i = 0; i < 5000; i++) {
    ReproCoreIteration(environment, db);
    Console.WriteLine($"Completed iteration {i}");
}


static void ReproCoreIteration(LightningEnvironment environment, LightningDatabase db)
{
    using var tx = environment.BeginTransaction();//auto-disposed at end of scope
    using (var cursor = tx.CreateCursor(db)) // <-------------------------------------------------------
    { // <----------------------------------------------------------------------------------------------
        var guid = Guid.NewGuid().ToString();
        var guidBytes = Encoding.UTF8.GetBytes(guid);

        _ = cursor.Put(
            guidBytes,
            guidBytes,
            CursorPutOptions.None
        );
    } // <----------------------------------------------------------------------------------------------

    var trxResult = tx.Commit();

    if (trxResult != MDBResultCode.Success)
        throw new Exception("LightningDB append commit exception");

    //program crashes here during transaction dispose after exactly 2039 iterations
    //Note : Commit() call is successful, failure happens after commit when transaction is disposed
}


static LightningDatabase OpenDatabase(LightningEnvironment environment, string dbName)
{
    using var tx = environment.BeginTransaction();
    var db = tx.OpenDatabase(dbName, new DatabaseConfiguration { Flags = DatabaseOpenFlags.Create });

    tx.Commit(); //ensure database is created

    return db;
}

static LightningEnvironment CreateFreshEnvironment(string path)
{
    if (Directory.Exists(path))
    {
        Directory.Delete(path, true);
        Console.WriteLine("Cleaned up previous directory");
    }

    return new LightningEnvironment(path, new EnvironmentConfiguration {
        MapSize = 100_000_000,
        MaxDatabases = 2
    });
}

Finally, I found that the lmdb documentation on mdb_cursor_open states "A cursor in a write-transaction can be closed before its transaction ends, and will otherwise be closed when its transaction ends.". I am not a C developer, so i can get something wrong here. But it seems mdb_txn_commit calls mdb_cursors_close, which in turn calls free for transaction cursors. mdb_cursor_close also calls free for a cursor.

Therefore, when we use LightningTransaction.Commit() and then LightningCursor.Dispose() we get cursor double freed.

I do not know where the magic number 2039 (suspiciously close to 2048 ж) ) comes from. Perhaps this is the internal specifics of the Windows allocator.

Could you guys try to dispose the cursor before committing and check if the problem is reproducing?

@sebastienros
Copy link
Contributor

To confirm your hypothesis:

1- I disposed the cursor before committing the transaction. It worked for the 5000 iterations.
2- I then changed the Dispose method of the Cursor to not call mdb_cursor_close if the transaction has been committed. And committed the transaction before disposing the cursor (like your initial repro), and it also fixed the problem.

        private void Dispose(bool disposing)
        {
            if (_handle == IntPtr.Zero)
                return;

            if (!disposing)
                throw new InvalidOperationException("The LightningCursor was not disposed and cannot be reliably dealt with from the finalizer");

            // If the transaction is committed the cursor is already closed
            if (Transaction.State != LightningTransactionState.Commited)
            {
                mdb_cursor_close(_handle);
            }

            _handle = IntPtr.Zero;

            Transaction.Disposing -= Dispose;

            GC.SuppressFinalize(this);
        }

So I agree with your hypothesis, and this fix should solve the problem. Looking at the c code committing the transaction will already close the cursor, so the c# wrapper should not do that again.

@sebastienros
Copy link
Contributor

NB: I noticed a huge performance degradation when disposing the cursor before committing the transaction, in case that matters in your use case.

@AlgorithmsAreCool
Copy link
Collaborator

This is interesting, I'll re-check my test code. I thought I tested the cursor disposal when I was chasing this last week, Perhaps I didn't.

@CoreyKaylor
Copy link
Owner

CoreyKaylor commented Jan 25, 2023

Anybody up for submitting a PR? I think @sebastienros 's approach seems reasonable. Although, I think the intention was Transaction.State == LightningTransactionState.Commited no?

BTW, love all the activity.

@sebastienros
Copy link
Contributor

@CoreyKaylor I think my code is correct, mdb_txn_commit is invoked by Commit() which sets State = LightningTransactionState.Commited.

State = LightningTransactionState.Commited;
StateChanging?.Invoke(State);
return mdb_txn_commit(_handle);

So we need to call mdb_cursor_close(_handle) when Transaction.State != LightningTransactionState.Commited

I checked other places and I think this could also happen when the Database is dropped and then calls mdb_drop

var result = mdb_drop(transaction.Handle(), _handle, true);

Which will then close the cursor for a connection

https://github.com/LMDB/lmdb/blob/mdb.master/libraries/liblmdb/mdb.c#L11113

And I don't think the State is changed so it would also invoke mdb_cursor_close when the transaction is disposed. Maybe the Transaction object needs a flag to record if the associated cursor needs to be closed.

From what I see in other libraries, they call mdb_cursor_close as soon as abort is invoked. That would be possible if a transaction could reference its cursors.

@CoreyKaylor
Copy link
Owner

Yes, you're right. I read the comment on the line above wrong in relation to the code right below.

That's probably a reasonable suggestion for this scenario to use a flag on the transaction.

@CoreyKaylor
Copy link
Owner

Please take a look at PR #159. I believe this should solve the problem using @sebastienros suggestion of conditional check. It also coincidentally stabilized one of the flakey tests that I couldn't figure out.

@CoreyKaylor
Copy link
Owner

Published version 0.15.0 that includes this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants