Memory leak #41

hz61p1 · 2023-04-09T14:39:33Z

I am observing a memory leak

Part of the code

metadata_parser, gamedata_parser = cysimdjson.JSONParser(), cysimdjson.JSONParser()
with lz4.frame.open(filepath) as file:
  for line in file:
      idx, metadata, gamedata = line.rstrip(b'\n').split(chr(31).encode())
      metadata, gamedata = metadata_parser.parse(metadata), gamedata_parser.parse(gamedata)
      for key, value in gamedata.at_pointer('/0/common').items():
          if key not in test_data['common']:
              test_data['common'][key] = []
          value_type = str(type(value))
          if value_type not in test_data['common'][key]:
              test_data['common'][key].append(value_type)
      for _, player in gamedata.at_pointer('/1').items():
          for key, value in player.items():
              if key not in test_data['player']:
                  test_data['player'][key] = []
              value_type = str(type(value))
              if value_type not in test_data['player'][key]:
                  test_data['player'][key].append(value_type)
      for _, vehicles in gamedata.at_pointer('/0/vehicles').items():
          vehicles = [vehicles] if isinstance(vehicles, dict) else vehicles
          for vehicle in vehicles:
              if isinstance(vehicle, str):
                  continue
              for key, value in vehicle.items():
                  if key not in test_data['vehicle']:
                      test_data['vehicle'][key] = []
                  value_type = str(type(value))
                  if value_type not in test_data['vehicle'][key]:
                      test_data['vehicle'][key].append(value_type)

I am analyzing a large data dump, over 100gb, and memory leaks are preventing the process from completing successfully. The leak is somewhere on the C side of the extension, since profiling the python part didn't show anything. I followed the first manual and ran valgrind

valgrind log

Gist

I can provide more information, just tell me what and how ))

lemire · 2023-04-09T23:51:17Z

Can you reduce the problem? If you use a single line, do you see the problem? If so, can you share the line in question? If a single line is insufficient, how many lines does it take?

hz61p1 · 2023-04-10T06:58:03Z

I created a repository with a minimal reproducible example in which the same data is parsed in a loop. I continue to observe memory leaks.

lemire · 2023-04-10T17:27:29Z

Running valgrind --leak-check=full --show-leak-kinds=all python3 main.py on the leaky version above, I get the following valgrind dump...

dump.txt

lemire · 2023-04-10T17:41:16Z

It appears to be the iteration through the object that causes the leak...

def main(filepath: str, rounds: int = 1):
    metadata_parser, gamedata_parser = cysimdjson.JSONParser(), cysimdjson.JSONParser()
    with open(filepath, 'rb') as file:
        metadata_raw, gamedata_raw = file.readline(), file.readline()
        metadata, gamedata = metadata_parser.parse(metadata_raw), gamedata_parser.parse(gamedata_raw)
        for r in range(rounds):
            items = metadata.items()
            if (r + 1) % 1000 == 0 or r == 0:
                print(f'round={r + 1 if r > 0 else 0} mem={bytes2human(psutil.Process(os.getpid()).memory_info().rss)}')
            stats = []
            # Comment the next two lines and the leak goes away...
            for key, value in items:
                stats.append("1")
            items = None
main('6493223.json', rounds=10000)

lemire · 2023-04-10T17:57:10Z

Here is a simpler reproduction...

import os
import psutil as psutil
from cysimdjson import cysimdjson
from psutil._common import bytes2human

def main(rounds: int = 1):
    data = b'{"a":1, "sfs":3}'
    parser = cysimdjson.JSONParser()
    doc = parser.parse(data)
    for r in range(rounds):
        if (r + 1) % 1000 == 0 or r == 0:
            print(f'round={r + 1 if r > 0 else 0} mem={bytes2human(psutil.Process(os.getpid()).memory_info().rss)}')
        stats = []
        items = doc.items()
        for key, value in items:
            stats.append("1")
main(rounds=100000)

lemire · 2023-04-10T17:58:50Z

There is seemingly no leak with arrays...

import os
import psutil as psutil
from cysimdjson import cysimdjson
from psutil._common import bytes2human

def main(rounds: int = 1):
    data = b'[1,2,3]'
    parser = cysimdjson.JSONParser()
    doc = parser.parse(data)
    for r in range(rounds):
        if (r + 1) % 1000 == 0 or r == 0:
            print(f'round={r + 1 if r > 0 else 0} mem={bytes2human(psutil.Process(os.getpid()).memory_info().rss)}')
        stats = []
        for value in doc:
            stats.append("1")
main(rounds=100000)

lemire · 2023-04-10T18:05:40Z

The keys are leaking... or, least, we are leaking when going through the keys...

def main(rounds: int = 1):
    data = b'{"":1, "ffdfd":3}'
    parser = cysimdjson.JSONParser()
    doc = parser.parse(data)
    for r in range(rounds):
        if (r + 1) % 1000 == 0 or r == 0:
            print(f'round={r + 1 if r > 0 else 0} mem={bytes2human(psutil.Process(os.getpid()).memory_info().rss)}')
        stats = []
        items = doc.keys()
        for z in items:
            stats.append("1")
main(rounds=100000)

lemire · 2023-04-10T18:19:17Z

I don't think that there is a leak in simdjson, if you run the following program, it comes out clean in valgrind...

#include "simdjson.cpp"
#include "simdjson.h"
#include <iostream>
using namespace simdjson;
int main(int argc, char *argv[]) {

  padded_string json = R"(  { "foo": 1, "bar": 2 }  )"_padded;
  dom::parser parser;
  dom::object object; // invalid until the get() succeeds
  auto error = parser.parse(json).get(object);
  if (error) {
    return -1;
  }
  volatile size_t counter = 0;
  for (size_t times = 0; times < 100000; times++) {
    for (auto [key, value] : object) {
      counter += key.size();
    }
    if((counter %100) == 0) { std::cout << counter << std::endl; }
  }
  std::cout << counter << std::endl;
}

lemire · 2023-04-10T19:01:16Z

I have build the code with refnanny support and I do not see the leak.

lemire · 2023-04-10T19:31:48Z

I am giving up but there is a leak, and it is easily reproducible (see my code). All you need is an object with more than one key, and you iterate over the keys (e.g., calling keys()).

Importantly, you must iterate through it (merely calling keys() is not enough).

lemire · 2023-04-10T19:32:59Z

If the keys are made larger, the leak is larger.

lemire · 2023-04-10T19:34:49Z

It leaks even if you just iterate once through the keys.

lemire · 2023-04-10T19:37:03Z

Ok. So I think it is pretty clear that accessing the keys leaks memory.

:-(

TkTech · 2023-04-10T19:59:59Z

The fix to this should be trivial. Try this @lemire:

Index: cysimdjson/cysimdjson.pyx
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/cysimdjson/cysimdjson.pyx b/cysimdjson/cysimdjson.pyx
--- a/cysimdjson/cysimdjson.pyx	(revision 5f544eeb2a9c1cfa0d48a5bb00a742f2a78d3beb)
+++ b/cysimdjson/cysimdjson.pyx	(date 1681156775080)
@@ -1,11 +1,10 @@
 # cython: language_level=3
 
-from libc.stdint cimport int64_t, uint64_t, uint32_t
+from libc.stdint cimport int64_t, uint64_t
 from libcpp cimport bool
 from libcpp.string cimport string
 
 from cpython.bytes cimport PyBytes_AsStringAndSize
-from cpython.ref cimport PyObject
 
 from cython.operator cimport preincrement
 from cython.operator cimport dereference
@@ -121,7 +120,7 @@
 
 	cdef object element_to_py_string(simdjson_element & value) except + simdjson_error_handler
 
-	cdef PyObject * string_view_to_python_string(string_view & sv)
+	cdef object string_view_to_python_string(string_view & sv)
 	cdef string get_active_implementation()
 
 	cdef const char * PyUnicode_AsUTF8AndSize(object, Py_ssize_t *)
@@ -186,11 +185,10 @@
 
 	def keys(JSONObject self):
 		cdef string_view sv
-
 		cdef simdjson_object.iterator it = self.Object.begin()
 		while it != self.Object.end():
 			sv = it.key()
-			yield <object> string_view_to_python_string(sv)
+			yield string_view_to_python_string(sv)
 			preincrement(it)

lemire · 2023-04-10T20:07:17Z

@TkTech It works.

lemire · 2023-04-10T20:07:41Z

I will issue a PR unless you get to it first.

lemire · 2023-04-10T20:12:37Z

@TkTech Can you explain the fix? I realized that you needed to return an object to get reference counting working but I assumed that yield <object> would do that. Does it not?

TkTech · 2023-04-10T20:19:54Z

To explain what's happening here, object and PyObject [*] ultimately represent the same thing, but are handled in different ways. object is ref counted, PyObject is not. Casting to <object> is generally wrong, and needs close attention.

yield <object> string_view_to_python_string(sv) looks like an immediate red flag to me. So lets expand it:

object temp;
v = string_view_to_python_string(sv)
# At this point, v has a ref count of 1
temp = <object> v
# At this point, v has a ref count of 2. Wait, what?
yield temp
# At this point, v has a ref count of 1.

Casting v (a PyObject*) to object created a reference to it, and incremented the ref count. When Python is done with the object returned from keys(), it will decrement it by 1 and...nothing, because its ref count is now still 1.

Ultimately, this is just because the signature of string_view_to_python_string is wrong, since it returns an owned object not a borrowed one.

cdef  PyObject * string_view_to_python_string(string_view & sv)

Is telling Cython that this method will return a borrowed reference, and:

cdef object string_view_to_python_string(string_view & sv)

Is telling Cython that this method will return an "owned" reference.

lemire mentioned this issue Apr 10, 2023

fix: memory leak from keys (issue 41) #42

Merged

ateska closed this as completed in #42 Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak #41

Memory leak #41

hz61p1 commented Apr 9, 2023 •

edited

Loading

lemire commented Apr 9, 2023

hz61p1 commented Apr 10, 2023

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023 •

edited

Loading

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023 •

edited

Loading

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023 •

edited

Loading

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023

TkTech commented Apr 10, 2023

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023

TkTech commented Apr 10, 2023 •

edited

Loading

Memory leak #41

Memory leak #41

Comments

hz61p1 commented Apr 9, 2023 • edited Loading

lemire commented Apr 9, 2023

hz61p1 commented Apr 10, 2023

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023 • edited Loading

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023 • edited Loading

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023 • edited Loading

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023

TkTech commented Apr 10, 2023

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023

lemire commented Apr 10, 2023

TkTech commented Apr 10, 2023 • edited Loading

hz61p1 commented Apr 9, 2023 •

edited

Loading

lemire commented Apr 10, 2023 •

edited

Loading

lemire commented Apr 10, 2023 •

edited

Loading

lemire commented Apr 10, 2023 •

edited

Loading

TkTech commented Apr 10, 2023 •

edited

Loading