Process crash when iteratively writing 'batches' of values via cursor in transaction #155

4deeptech · 2023-01-11T21:39:03Z

This code enumerates and PUTS via cursor

        public void Append(IEnumerable<EventEnvelope> envList)
        {
            long version = 0;
            using (var tx = _env.BeginTransaction())
            using (var db = tx.OpenDatabase(_dbName, new DatabaseConfiguration { Flags = DatabaseOpenFlags.Create }))
            {
                var (resultCode, versionKey, versionValue) = tx.Get(db, Encoding.UTF8.GetBytes("masterStreamVersion"));
                if (resultCode == MDBResultCode.NotFound)
                {
                    version = 0;
                }
                else if (resultCode == MDBResultCode.Success)
                {
                    version = BitConverter.ToInt64(versionValue.CopyToNewArray(), 0);
                }
                using var cursor = tx.CreateCursor(db);
                {
                    foreach (EventEnvelope env in envList)
                    {
                        version += 1;
                        var eventResult = cursor.Put(Encoding.UTF8.GetBytes("masterStream_" + version), 
                            Encoding.UTF8.GetBytes(JsonConvert.SerializeObject(env, _serializxerSettings)),
                            CursorPutOptions.None);
                        if (eventResult != MDBResultCode.Success)
                        {
                            tx.Abort();
                            throw new Exception("LightningDB cursor put exception");
                        }
                    }
                    var versionResult = cursor.Put(Encoding.UTF8.GetBytes("masterStreamVersion"), BitConverter.GetBytes(version), CursorPutOptions.None);
                    if (versionResult != MDBResultCode.Success)
                    {
                        tx.Abort();
                        throw new Exception("LightningDB write version exception");
                    }
                    var trxResult = tx.Commit();
                    if (trxResult != MDBResultCode.Success)
                    {
                        throw new Exception("LightningDB append commit exception");
                    }
                }
            }
        }

If I call this Append method in a loop 3K times with about 15 messages in the enumeration it will crash the process. However, If I call it once with say 50K items it works fine. As a sanity check I ran the loop 2K times and that succeeded.

This is my first time using LightningDB/lmdb so I may be doing something wrong. There is a key/value containing a version number that is maintained/set after each 'batch' written in the same transaction. Note that the size of the DB is the same in both scenarios and the looping is all done in the same thread like this:

for(int i=0;i<3000;i++)
{
    Append(arrayWith15OrSoItems); //crashes around iteration 2040 in a unit test
}

vs

Append(arrayWith50000Items); //works fine with same type payload in a unit test

AlgorithmsAreCool · 2023-01-12T20:08:35Z

Interesting. From a quick glance this code looks correct. I'll take a deeper look at this today or tomorrow.

What kind of exception is being thrown?

4deeptech · 2023-01-12T20:53:20Z

No dotnet exceptions thrown, just crashes the process. When running as an xunit test it shows the following:

"The active test run was aborted. Reason: Test host process crashed."

AlgorithmsAreCool · 2023-01-12T22:52:24Z

Lemme try to repro.

Btw, what OS is this happening on?

AlgorithmsAreCool · 2023-01-12T23:04:31Z

Confirmed a repro. My test console application dies after 2039 iterations of calling append() with no observable exception.

4deeptech · 2023-01-13T00:25:11Z

Wow, that's exactly where mine was happening. I apologize for not putting the OS details and version. Windows 11, .NET 6, LightningDB 0.14.1. Thanks for spinning up a repro sample. I guess this is a good test case to include into the test suite. I guess I'm not crazy and doing something wrong ;) Wonder if this happens on other OS's. What I was working on may end up running in a linux VM at some point but way too early in my prototype for that.

AlgorithmsAreCool · 2023-01-13T00:30:39Z

No, it's definitely not just you. There is a crash happening deep in the call stack internal to LMDB. In fact it is faulting in ntdll, probably due to some type of resource exhaustion.

I can see the following error in windows event viewer

Faulting application name: lightningnet-155.exe, version: 1.0.0.0, time stamp: 0x634eda42
Faulting module name: ntdll.dll, version: 10.0.19041.2130, time stamp: 0xb5ced1c6
Exception code: 0xc0000374
Fault offset: 0x00000000000ff6a9
Faulting process id: 0x5e40
Faulting application start time: 0x01d926da7e3c35a2
Faulting application path: D:\dev\lightningnet-155\bin\Debug\net7.0\lightningnet-155.exe
Faulting module path: C:\windows\SYSTEM32\ntdll.dll
Report Id: 4b2b1262-b76b-4923-bf05-f47c1164b6ae
Faulting package full name: 
Faulting package-relative application ID:

But what is very weird is the program crashes, but I can't observe the exception in VS2022. But when i use WinDBG, the program does not crash and concludes 5000 iteractions successfully. I am still trying to determined what the issue is exactly.

Historically, LMDB is extremely reliable as long as it is used correctly, so i would bet this is a problem with a invalid config. But I still trying to figure out what exactly...

AlgorithmsAreCool · 2023-01-13T00:41:09Z

Observations

The crash occurs after 2039 iterations of Append, regardless of how many entries are in the envList argument.
If AutoResizeWindows is set to true, the failure occurs after 2038 interactions, one less then if the setting is false (default)
Once MapSize is set to an adequately large value, it has no bearing onthe crash. When set to a small limit, we receive a proper failure via the API during the call to Put. No crash
_env.EnvironmentStats.Entries never changes and is always 1 after each call to Append. It should be increasing...
Omitting the call to Commit causes the crash to go away...
The problem is isolated to use of cursors. Rewriting the code to not use cursors fixes the crash.
Omitting the cursor dispose has no effect.

4deeptech · 2023-01-13T01:07:22Z

My first attempt at using the LightningDB.net I had it writing one item at a time and didn't get good throughput. So I looked in the unit tests to see if there was a bulk way to get data in and saw the cursor approach so I tried that and got better throughput in my unit tests and thought I was good to go. Then when I wired this into my prototype project it went to hell once I ramped up the number of items in my simulation. I have an Akka actor sending the chunks of items to insert over to the writer actor. Poof, went up in smoke.

How does the transaction work if you don't commit?

Is there a better way to batch stuff without using cursors?

AlgorithmsAreCool · 2023-01-13T01:14:14Z

@4deeptech Yep, you only need to use cursors for some of the fancier features like assigning multiple values to a single key (which can be pretty useful for advanced scenarios).

To perform a batch insert simply create a transaction and then call tx.Put(db, ...) as many times as needed and then call tx.Commit(). You can get pretty great write performance this way, millions of records per second depending on disk performance. You can use some documented but dangerous hacks to get faster, but they introduce the possibility of corruption if there is a crash, so they are not recommended.

On the up side, avoiding the use of cursors fixes this issue and In can make unbounded writes as long as the map size is adequate.

I am still trying to pin down the current issue, it is related to disposal of the database handle. I'm wondering if there is a leak somewhere.

AlgorithmsAreCool · 2023-01-13T01:18:24Z

How does the transaction work if you don't commit?

If do not commit, the data written during the transaction is voided and released when the transaction is disposed.
You MUST dispose the transaction in all cases. It will be obvious if you fail to do so because there is only 1 write transaction allowed at a time. This means writes to LMDB are effectively single threaded (guarded by LMDB internal locking). In practice it isn't a big deal for most scenarios because the write latency so we don't care too much, but some very write intensive workloads would benefit from use of a LSM-tree based database or an append-only style database like FASTER.

AlgorithmsAreCool · 2023-01-13T03:35:41Z

After some fiddling, I managed to get WinDBG to hook the process mid crash. It looks like a genuine native LMDB bug involving a corrupt memory heap. I'll work on minimizing a repro and reporting this upstream.

PROCESS_NAME:  lightningnet-155.exe

ERROR_CODE: (NTSTATUS) 0xc0000374 - A heap has been corrupted.

SYMBOL_NAME:  ntdll!RtlpFreeHeapInternal+0

MODULE_NAME: ntdll

IMAGE_NAME:  ntdll.dll

FAILURE_BUCKET_ID:  HEAP_CORRUPTION_ACTIONABLE_BlockNotBusy_DOUBLE_FREE_c0000374_ntdll.dll!RtlpFreeHeapInternal

FAILURE_ID_HASH:  {f9e860eb-b03f-7415-804c-7e671e26c730}

@4deeptech
In the meantime, the recommended workaround is to not use cursors. It sounds like you just need to do some bulk writes. As I mentioned above, LMDB transactions support an unbounded number of Put operations per-transaction. So you don't need to use a cursor to fulfill this use case.

Another side note, you don't actually need to re-open the database handle each iteration. You can safely open a handle once per application and then reuse it for all transactions. This has a positive impact on performance. Each time you open a DB handle in LMDB, it performs a linear scan of all the databases in the environment. If your environment has lots of databases (dozens or more...) the time taken to open a DB starts to add up.

There are some other optimization opportunities i see in your example code pertaining to string operations and memory allocations, but idk how much of a performance win they would be compared to the IO operation.

Let me know if you'd like some example code, I'd be happy to help.

AlgorithmsAreCool · 2023-01-13T16:35:23Z

@4deeptech
Here is a minimal repro program of this bug. When you get some time could you run this in your environment to ensure it still crashes?

using System.Text;
using LightningDB;

//this is a single file formatted C# program

//executions begins here
Console.WriteLine("Hello, World!");

using var environment = CreateFreshEnvironment("C:/dev/temp");
environment.Open();

using var db = OpenDatabase(environment, "testDb");

//this loop fails after 2039 iterations
for (int i = 0; i < 5000; i++) {
    ReproCoreIteration(environment, db);
    Console.WriteLine($"Completed iteration {i}");
}


static void ReproCoreIteration(LightningEnvironment environment, LightningDatabase db)
{
    using var tx = environment.BeginTransaction();//auto-disposed at end of scope
    using var cursor = tx.CreateCursor(db); //auto-disposed at end of scope

    var guid = Guid.NewGuid().ToString();
    var guidBytes = Encoding.UTF8.GetBytes(guid);

    _ = cursor.Put(
        guidBytes,
        guidBytes,
        CursorPutOptions.None
    );

    var trxResult = tx.Commit();

    if (trxResult != MDBResultCode.Success)
        throw new Exception("LightningDB append commit exception");

    //program crashes here during transaction dispose after exactly 2039 iterations
    //Note : Commit() call is successful, failure happens after commit when transaction is disposed
}


static LightningDatabase OpenDatabase(LightningEnvironment environment, string dbName)
{
    using var tx = environment.BeginTransaction();
    var db = tx.OpenDatabase(dbName, new DatabaseConfiguration { Flags = DatabaseOpenFlags.Create });
    
    tx.Commit(); //ensure database is created

    return db;
}

static LightningEnvironment CreateFreshEnvironment(string path)
{
    if (Directory.Exists(path))
    {
        Directory.Delete(path, true);
        Console.WriteLine("Cleaned up previous directory");
    }

    return new LightningEnvironment(path, new EnvironmentConfiguration {
        MapSize = 100_000_000,
        MaxDatabases = 2
    });
}

Project

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net7.0</TargetFramework> <!-- other runtimes should also repro -->
    <RootNamespace>lightningnet_155_minimal</RootNamespace>
    <ImplicitUsings>enable</ImplicitUsings>
    <Nullable>enable</Nullable>
  </PropertyGroup>

<ItemGroup>
  <PackageReference Include="lightningdb" Version="0.14.1" />
</ItemGroup>

</Project>

4deeptech · 2023-01-13T17:39:08Z

...
Completed iteration 2034
Completed iteration 2035
Completed iteration 2036
Completed iteration 2037
Completed iteration 2038

D:\bootstrap_projects\LightningDBErrorRepro\LightningDBErrorRepro\bin\Debug\net6.0\LightningDBErrorRepro.exe (process 3652) exited with code -1073740940.

AlgorithmsAreCool · 2023-01-14T00:26:33Z

I'm going to try updating the version of LMDB we depend on, but I am not hopeful about there being a fix already made since I have tried searching their bug tracker and haven't seen one that matches this description.

So this might take a while to resolve since it is an upstream fix.

4deeptech · 2023-01-14T20:48:57Z

@AlgorithmsAreCool Sounds good. Thanks for verifying the issue and suggesting a workable alternative!

pepelev · 2023-01-19T23:30:53Z

Hi!

I have done research on this issue and I think I know the root of the problem.

First of all, I set up Windows error reporting (the documentation says this doesn't work for .net applications, but it's not true) to get a crash dump.
When the dump was collected, after fiddling with the symbols, I got the following picture of the stacktrace of the main thread:

This looks like cursor freed twice.

Then i altered LightningCursor.Dispose(bool) to count free calls like this:

Code

private long freed;
private void Dispose(bool disposing)
{
    if (_handle == IntPtr.Zero)
        return;

    if (!disposing)
        throw new InvalidOperationException("The LightningCursor was not disposed and cannot be reliably dealt with from the finalizer");

    Interlocked.Increment(ref freed);
    mdb_cursor_close(_handle);
    _handle = IntPtr.Zero;

    Transaction.Disposing -= Dispose;

    GC.SuppressFinalize(this);
}

and collect dump again. The dump showed that the fatal call mdb_cursor_close(_handle); is the first in this instance of LightningCursor.

Another interesting thing is that if i dispose the cursor before committing the transaction, the application will not crash.

Code

Console.WriteLine("Hello, World!");

using var environment = CreateFreshEnvironment("C:/dev/temp");
environment.Open();

using var db = OpenDatabase(environment, "testDb");

//this loop fails after 2039 iterations
for (int i = 0; i < 5000; i++) {
    ReproCoreIteration(environment, db);
    Console.WriteLine($"Completed iteration {i}");
}


static void ReproCoreIteration(LightningEnvironment environment, LightningDatabase db)
{
    using var tx = environment.BeginTransaction();//auto-disposed at end of scope
    using (var cursor = tx.CreateCursor(db)) // <-------------------------------------------------------
    { // <----------------------------------------------------------------------------------------------
        var guid = Guid.NewGuid().ToString();
        var guidBytes = Encoding.UTF8.GetBytes(guid);

        _ = cursor.Put(
            guidBytes,
            guidBytes,
            CursorPutOptions.None
        );
    } // <----------------------------------------------------------------------------------------------

    var trxResult = tx.Commit();

    if (trxResult != MDBResultCode.Success)
        throw new Exception("LightningDB append commit exception");

    //program crashes here during transaction dispose after exactly 2039 iterations
    //Note : Commit() call is successful, failure happens after commit when transaction is disposed
}


static LightningDatabase OpenDatabase(LightningEnvironment environment, string dbName)
{
    using var tx = environment.BeginTransaction();
    var db = tx.OpenDatabase(dbName, new DatabaseConfiguration { Flags = DatabaseOpenFlags.Create });

    tx.Commit(); //ensure database is created

    return db;
}

static LightningEnvironment CreateFreshEnvironment(string path)
{
    if (Directory.Exists(path))
    {
        Directory.Delete(path, true);
        Console.WriteLine("Cleaned up previous directory");
    }

    return new LightningEnvironment(path, new EnvironmentConfiguration {
        MapSize = 100_000_000,
        MaxDatabases = 2
    });
}

Finally, I found that the lmdb documentation on mdb_cursor_open states "A cursor in a write-transaction can be closed before its transaction ends, and will otherwise be closed when its transaction ends.". I am not a C developer, so i can get something wrong here. But it seems mdb_txn_commit calls mdb_cursors_close, which in turn calls free for transaction cursors. mdb_cursor_close also calls free for a cursor.

Therefore, when we use LightningTransaction.Commit() and then LightningCursor.Dispose() we get cursor double freed.

I do not know where the magic number 2039 (suspiciously close to 2048 ж) ) comes from. Perhaps this is the internal specifics of the Windows allocator.

Could you guys try to dispose the cursor before committing and check if the problem is reproducing?

sebastienros · 2023-01-20T00:23:46Z

To confirm your hypothesis:

1- I disposed the cursor before committing the transaction. It worked for the 5000 iterations.
2- I then changed the Dispose method of the Cursor to not call mdb_cursor_close if the transaction has been committed. And committed the transaction before disposing the cursor (like your initial repro), and it also fixed the problem.

        private void Dispose(bool disposing)
        {
            if (_handle == IntPtr.Zero)
                return;

            if (!disposing)
                throw new InvalidOperationException("The LightningCursor was not disposed and cannot be reliably dealt with from the finalizer");

            // If the transaction is committed the cursor is already closed
            if (Transaction.State != LightningTransactionState.Commited)
            {
                mdb_cursor_close(_handle);
            }

            _handle = IntPtr.Zero;

            Transaction.Disposing -= Dispose;

            GC.SuppressFinalize(this);
        }

So I agree with your hypothesis, and this fix should solve the problem. Looking at the c code committing the transaction will already close the cursor, so the c# wrapper should not do that again.

sebastienros · 2023-01-20T00:24:52Z

NB: I noticed a huge performance degradation when disposing the cursor before committing the transaction, in case that matters in your use case.

AlgorithmsAreCool · 2023-01-20T04:45:54Z

This is interesting, I'll re-check my test code. I thought I tested the cursor disposal when I was chasing this last week, Perhaps I didn't.

CoreyKaylor · 2023-01-25T22:24:15Z

Anybody up for submitting a PR? I think @sebastienros 's approach seems reasonable. Although, I think the intention was Transaction.State == LightningTransactionState.Commited no?

BTW, love all the activity.

sebastienros · 2023-01-25T23:01:54Z

@CoreyKaylor I think my code is correct, mdb_txn_commit is invoked by Commit() which sets State = LightningTransactionState.Commited.

Lightning.NET/src/LightningDB/LightningTransaction.cs

Lines 299 to 301 in 4db175a

    
           State = LightningTransactionState.Commited; 
        
           StateChanging?.Invoke(State); 
        
           return mdb_txn_commit(_handle);

So we need to call mdb_cursor_close(_handle) when Transaction.State != LightningTransactionState.Commited

I checked other places and I think this could also happen when the Database is dropped and then calls mdb_drop

Lightning.NET/src/LightningDB/LightningDatabase.cs

Line 88 in 4db175a

var result = mdb_drop(transaction.Handle(), _handle, true);

Which will then close the cursor for a connection

https://github.com/LMDB/lmdb/blob/mdb.master/libraries/liblmdb/mdb.c#L11113

And I don't think the State is changed so it would also invoke mdb_cursor_close when the transaction is disposed. Maybe the Transaction object needs a flag to record if the associated cursor needs to be closed.

From what I see in other libraries, they call mdb_cursor_close as soon as abort is invoked. That would be possible if a transaction could reference its cursors.

CoreyKaylor · 2023-01-26T00:03:03Z

Yes, you're right. I read the comment on the line above wrong in relation to the code right below.

That's probably a reasonable suggestion for this scenario to use a flag on the transaction.

CoreyKaylor · 2023-02-04T17:04:31Z

Please take a look at PR #159. I believe this should solve the problem using @sebastienros suggestion of conditional check. It also coincidentally stabilized one of the flakey tests that I couldn't figure out.

CoreyKaylor · 2023-02-04T20:46:54Z

Published version 0.15.0 that includes this change.

AlgorithmsAreCool added the bug label Jan 13, 2023

CoreyKaylor added a commit that referenced this issue Feb 4, 2023

Reverting change due to test issue. Likely related to #155.

ea1aee3

CoreyKaylor mentioned this issue Feb 4, 2023

Only close cursor when it wasn't closed for us #159

Merged

CoreyKaylor closed this as completed in #159 Feb 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process crash when iteratively writing 'batches' of values via cursor in transaction #155

Process crash when iteratively writing 'batches' of values via cursor in transaction #155

4deeptech commented Jan 11, 2023

AlgorithmsAreCool commented Jan 12, 2023

4deeptech commented Jan 12, 2023

AlgorithmsAreCool commented Jan 12, 2023

AlgorithmsAreCool commented Jan 12, 2023

4deeptech commented Jan 13, 2023 •

edited

Loading

AlgorithmsAreCool commented Jan 13, 2023 •

edited

Loading

AlgorithmsAreCool commented Jan 13, 2023 •

edited

Loading

4deeptech commented Jan 13, 2023

AlgorithmsAreCool commented Jan 13, 2023

AlgorithmsAreCool commented Jan 13, 2023 •

edited

Loading

AlgorithmsAreCool commented Jan 13, 2023

AlgorithmsAreCool commented Jan 13, 2023

4deeptech commented Jan 13, 2023

AlgorithmsAreCool commented Jan 14, 2023

4deeptech commented Jan 14, 2023

pepelev commented Jan 19, 2023

sebastienros commented Jan 20, 2023

sebastienros commented Jan 20, 2023

AlgorithmsAreCool commented Jan 20, 2023

CoreyKaylor commented Jan 25, 2023 •

edited

Loading

sebastienros commented Jan 25, 2023

CoreyKaylor commented Jan 26, 2023

CoreyKaylor commented Feb 4, 2023

CoreyKaylor commented Feb 4, 2023

Process crash when iteratively writing 'batches' of values via cursor in transaction #155

Process crash when iteratively writing 'batches' of values via cursor in transaction #155

Comments

4deeptech commented Jan 11, 2023

AlgorithmsAreCool commented Jan 12, 2023

4deeptech commented Jan 12, 2023

AlgorithmsAreCool commented Jan 12, 2023

AlgorithmsAreCool commented Jan 12, 2023

4deeptech commented Jan 13, 2023 • edited Loading

AlgorithmsAreCool commented Jan 13, 2023 • edited Loading

AlgorithmsAreCool commented Jan 13, 2023 • edited Loading

4deeptech commented Jan 13, 2023

AlgorithmsAreCool commented Jan 13, 2023

AlgorithmsAreCool commented Jan 13, 2023 • edited Loading

AlgorithmsAreCool commented Jan 13, 2023

AlgorithmsAreCool commented Jan 13, 2023

Project

4deeptech commented Jan 13, 2023

AlgorithmsAreCool commented Jan 14, 2023

4deeptech commented Jan 14, 2023

pepelev commented Jan 19, 2023

sebastienros commented Jan 20, 2023

sebastienros commented Jan 20, 2023

AlgorithmsAreCool commented Jan 20, 2023

CoreyKaylor commented Jan 25, 2023 • edited Loading

sebastienros commented Jan 25, 2023

CoreyKaylor commented Jan 26, 2023

CoreyKaylor commented Feb 4, 2023

CoreyKaylor commented Feb 4, 2023

4deeptech commented Jan 13, 2023 •

edited

Loading

AlgorithmsAreCool commented Jan 13, 2023 •

edited

Loading

AlgorithmsAreCool commented Jan 13, 2023 •

edited

Loading

AlgorithmsAreCool commented Jan 13, 2023 •

edited

Loading

CoreyKaylor commented Jan 25, 2023 •

edited

Loading