Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #41

Closed
hz61p1 opened this issue Apr 9, 2023 · 18 comments · Fixed by #42
Closed

Memory leak #41

hz61p1 opened this issue Apr 9, 2023 · 18 comments · Fixed by #42

Comments

@hz61p1
Copy link

hz61p1 commented Apr 9, 2023

I am observing a memory leak

Part of the code
metadata_parser, gamedata_parser = cysimdjson.JSONParser(), cysimdjson.JSONParser()
with lz4.frame.open(filepath) as file:
  for line in file:
      idx, metadata, gamedata = line.rstrip(b'\n').split(chr(31).encode())
      metadata, gamedata = metadata_parser.parse(metadata), gamedata_parser.parse(gamedata)
      for key, value in gamedata.at_pointer('/0/common').items():
          if key not in test_data['common']:
              test_data['common'][key] = []
          value_type = str(type(value))
          if value_type not in test_data['common'][key]:
              test_data['common'][key].append(value_type)
      for _, player in gamedata.at_pointer('/1').items():
          for key, value in player.items():
              if key not in test_data['player']:
                  test_data['player'][key] = []
              value_type = str(type(value))
              if value_type not in test_data['player'][key]:
                  test_data['player'][key].append(value_type)
      for _, vehicles in gamedata.at_pointer('/0/vehicles').items():
          vehicles = [vehicles] if isinstance(vehicles, dict) else vehicles
          for vehicle in vehicles:
              if isinstance(vehicle, str):
                  continue
              for key, value in vehicle.items():
                  if key not in test_data['vehicle']:
                      test_data['vehicle'][key] = []
                  value_type = str(type(value))
                  if value_type not in test_data['vehicle'][key]:
                      test_data['vehicle'][key].append(value_type)

I am analyzing a large data dump, over 100gb, and memory leaks are preventing the process from completing successfully. The leak is somewhere on the C side of the extension, since profiling the python part didn't show anything. I followed the first manual and ran valgrind

valgrind log

Gist

I can provide more information, just tell me what and how ))

@lemire
Copy link
Contributor

lemire commented Apr 9, 2023

Can you reduce the problem? If you use a single line, do you see the problem? If so, can you share the line in question? If a single line is insufficient, how many lines does it take?

@hz61p1
Copy link
Author

hz61p1 commented Apr 10, 2023

I created a repository with a minimal reproducible example in which the same data is parsed in a loop. I continue to observe memory leaks.

@lemire
Copy link
Contributor

lemire commented Apr 10, 2023

Running valgrind --leak-check=full --show-leak-kinds=all python3 main.py on the leaky version above, I get the following valgrind dump...

dump.txt

@lemire
Copy link
Contributor

lemire commented Apr 10, 2023

It appears to be the iteration through the object that causes the leak...

def main(filepath: str, rounds: int = 1):
    metadata_parser, gamedata_parser = cysimdjson.JSONParser(), cysimdjson.JSONParser()
    with open(filepath, 'rb') as file:
        metadata_raw, gamedata_raw = file.readline(), file.readline()
        metadata, gamedata = metadata_parser.parse(metadata_raw), gamedata_parser.parse(gamedata_raw)
        for r in range(rounds):
            items = metadata.items()
            if (r + 1) % 1000 == 0 or r == 0:
                print(f'round={r + 1 if r > 0 else 0} mem={bytes2human(psutil.Process(os.getpid()).memory_info().rss)}')
            stats = []
            # Comment the next two lines and the leak goes away...
            for key, value in items:
                stats.append("1")
            items = None
main('6493223.json', rounds=10000)

@lemire
Copy link
Contributor

lemire commented Apr 10, 2023

Here is a simpler reproduction...

import os
import psutil as psutil
from cysimdjson import cysimdjson
from psutil._common import bytes2human

def main(rounds: int = 1):
    data = b'{"a":1, "sfs":3}'
    parser = cysimdjson.JSONParser()
    doc = parser.parse(data)
    for r in range(rounds):
        if (r + 1) % 1000 == 0 or r == 0:
            print(f'round={r + 1 if r > 0 else 0} mem={bytes2human(psutil.Process(os.getpid()).memory_info().rss)}')
        stats = []
        items = doc.items()
        for key, value in items:
            stats.append("1")
main(rounds=100000)

@lemire
Copy link
Contributor

lemire commented Apr 10, 2023

There is seemingly no leak with arrays...

import os
import psutil as psutil
from cysimdjson import cysimdjson
from psutil._common import bytes2human

def main(rounds: int = 1):
    data = b'[1,2,3]'
    parser = cysimdjson.JSONParser()
    doc = parser.parse(data)
    for r in range(rounds):
        if (r + 1) % 1000 == 0 or r == 0:
            print(f'round={r + 1 if r > 0 else 0} mem={bytes2human(psutil.Process(os.getpid()).memory_info().rss)}')
        stats = []
        for value in doc:
            stats.append("1")
main(rounds=100000)

@lemire
Copy link
Contributor

lemire commented Apr 10, 2023

The keys are leaking... or, least, we are leaking when going through the keys...

def main(rounds: int = 1):
    data = b'{"":1, "ffdfd":3}'
    parser = cysimdjson.JSONParser()
    doc = parser.parse(data)
    for r in range(rounds):
        if (r + 1) % 1000 == 0 or r == 0:
            print(f'round={r + 1 if r > 0 else 0} mem={bytes2human(psutil.Process(os.getpid()).memory_info().rss)}')
        stats = []
        items = doc.keys()
        for z in items:
            stats.append("1")
main(rounds=100000)

@lemire
Copy link
Contributor

lemire commented Apr 10, 2023

I don't think that there is a leak in simdjson, if you run the following program, it comes out clean in valgrind...

#include "simdjson.cpp"
#include "simdjson.h"
#include <iostream>
using namespace simdjson;
int main(int argc, char *argv[]) {

  padded_string json = R"(  { "foo": 1, "bar": 2 }  )"_padded;
  dom::parser parser;
  dom::object object; // invalid until the get() succeeds
  auto error = parser.parse(json).get(object);
  if (error) {
    return -1;
  }
  volatile size_t counter = 0;
  for (size_t times = 0; times < 100000; times++) {
    for (auto [key, value] : object) {
      counter += key.size();
    }
    if((counter %100) == 0) { std::cout << counter << std::endl; }
  }
  std::cout << counter << std::endl;
}

@lemire
Copy link
Contributor

lemire commented Apr 10, 2023

I have build the code with refnanny support and I do not see the leak.

@lemire
Copy link
Contributor

lemire commented Apr 10, 2023

I am giving up but there is a leak, and it is easily reproducible (see my code). All you need is an object with more than one key, and you iterate over the keys (e.g., calling keys()).

Importantly, you must iterate through it (merely calling keys() is not enough).

@lemire
Copy link
Contributor

lemire commented Apr 10, 2023

If the keys are made larger, the leak is larger.

@lemire
Copy link
Contributor

lemire commented Apr 10, 2023

It leaks even if you just iterate once through the keys.

@lemire
Copy link
Contributor

lemire commented Apr 10, 2023

Ok. So I think it is pretty clear that accessing the keys leaks memory.

:-(

@TkTech
Copy link

TkTech commented Apr 10, 2023

The fix to this should be trivial. Try this @lemire:

Index: cysimdjson/cysimdjson.pyx
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/cysimdjson/cysimdjson.pyx b/cysimdjson/cysimdjson.pyx
--- a/cysimdjson/cysimdjson.pyx	(revision 5f544eeb2a9c1cfa0d48a5bb00a742f2a78d3beb)
+++ b/cysimdjson/cysimdjson.pyx	(date 1681156775080)
@@ -1,11 +1,10 @@
 # cython: language_level=3
 
-from libc.stdint cimport int64_t, uint64_t, uint32_t
+from libc.stdint cimport int64_t, uint64_t
 from libcpp cimport bool
 from libcpp.string cimport string
 
 from cpython.bytes cimport PyBytes_AsStringAndSize
-from cpython.ref cimport PyObject
 
 from cython.operator cimport preincrement
 from cython.operator cimport dereference
@@ -121,7 +120,7 @@
 
 	cdef object element_to_py_string(simdjson_element & value) except + simdjson_error_handler
 
-	cdef PyObject * string_view_to_python_string(string_view & sv)
+	cdef object string_view_to_python_string(string_view & sv)
 	cdef string get_active_implementation()
 
 	cdef const char * PyUnicode_AsUTF8AndSize(object, Py_ssize_t *)
@@ -186,11 +185,10 @@
 
 	def keys(JSONObject self):
 		cdef string_view sv
-
 		cdef simdjson_object.iterator it = self.Object.begin()
 		while it != self.Object.end():
 			sv = it.key()
-			yield <object> string_view_to_python_string(sv)
+			yield string_view_to_python_string(sv)
 			preincrement(it)

@lemire
Copy link
Contributor

lemire commented Apr 10, 2023

@TkTech It works.

@lemire
Copy link
Contributor

lemire commented Apr 10, 2023

I will issue a PR unless you get to it first.

@lemire
Copy link
Contributor

lemire commented Apr 10, 2023

@TkTech Can you explain the fix? I realized that you needed to return an object to get reference counting working but I assumed that yield <object> would do that. Does it not?

@TkTech
Copy link

TkTech commented Apr 10, 2023

To explain what's happening here, object and PyObject [*] ultimately represent the same thing, but are handled in different ways. object is ref counted, PyObject is not. Casting to <object> is generally wrong, and needs close attention.

yield <object> string_view_to_python_string(sv) looks like an immediate red flag to me. So lets expand it:

object temp;
v = string_view_to_python_string(sv)
# At this point, v has a ref count of 1
temp = <object> v
# At this point, v has a ref count of 2. Wait, what?
yield temp
# At this point, v has a ref count of 1.

Casting v (a PyObject*) to object created a reference to it, and incremented the ref count. When Python is done with the object returned from keys(), it will decrement it by 1 and...nothing, because its ref count is now still 1.

Ultimately, this is just because the signature of string_view_to_python_string is wrong, since it returns an owned object not a borrowed one.

cdef  PyObject * string_view_to_python_string(string_view & sv)

Is telling Cython that this method will return a borrowed reference, and:

cdef object string_view_to_python_string(string_view & sv)

Is telling Cython that this method will return an "owned" reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants