-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak #41
Comments
Can you reduce the problem? If you use a single line, do you see the problem? If so, can you share the line in question? If a single line is insufficient, how many lines does it take? |
I created a repository with a minimal reproducible example in which the same data is parsed in a loop. I continue to observe memory leaks. |
Running |
It appears to be the iteration through the object that causes the leak... def main(filepath: str, rounds: int = 1):
metadata_parser, gamedata_parser = cysimdjson.JSONParser(), cysimdjson.JSONParser()
with open(filepath, 'rb') as file:
metadata_raw, gamedata_raw = file.readline(), file.readline()
metadata, gamedata = metadata_parser.parse(metadata_raw), gamedata_parser.parse(gamedata_raw)
for r in range(rounds):
items = metadata.items()
if (r + 1) % 1000 == 0 or r == 0:
print(f'round={r + 1 if r > 0 else 0} mem={bytes2human(psutil.Process(os.getpid()).memory_info().rss)}')
stats = []
# Comment the next two lines and the leak goes away...
for key, value in items:
stats.append("1")
items = None
main('6493223.json', rounds=10000) |
Here is a simpler reproduction... import os
import psutil as psutil
from cysimdjson import cysimdjson
from psutil._common import bytes2human
def main(rounds: int = 1):
data = b'{"a":1, "sfs":3}'
parser = cysimdjson.JSONParser()
doc = parser.parse(data)
for r in range(rounds):
if (r + 1) % 1000 == 0 or r == 0:
print(f'round={r + 1 if r > 0 else 0} mem={bytes2human(psutil.Process(os.getpid()).memory_info().rss)}')
stats = []
items = doc.items()
for key, value in items:
stats.append("1")
main(rounds=100000) |
There is seemingly no leak with arrays... import os
import psutil as psutil
from cysimdjson import cysimdjson
from psutil._common import bytes2human
def main(rounds: int = 1):
data = b'[1,2,3]'
parser = cysimdjson.JSONParser()
doc = parser.parse(data)
for r in range(rounds):
if (r + 1) % 1000 == 0 or r == 0:
print(f'round={r + 1 if r > 0 else 0} mem={bytes2human(psutil.Process(os.getpid()).memory_info().rss)}')
stats = []
for value in doc:
stats.append("1")
main(rounds=100000) |
The keys are leaking... or, least, we are leaking when going through the keys...
|
I don't think that there is a leak in simdjson, if you run the following program, it comes out clean in valgrind... #include "simdjson.cpp"
#include "simdjson.h"
#include <iostream>
using namespace simdjson;
int main(int argc, char *argv[]) {
padded_string json = R"( { "foo": 1, "bar": 2 } )"_padded;
dom::parser parser;
dom::object object; // invalid until the get() succeeds
auto error = parser.parse(json).get(object);
if (error) {
return -1;
}
volatile size_t counter = 0;
for (size_t times = 0; times < 100000; times++) {
for (auto [key, value] : object) {
counter += key.size();
}
if((counter %100) == 0) { std::cout << counter << std::endl; }
}
std::cout << counter << std::endl;
} |
I have build the code with |
I am giving up but there is a leak, and it is easily reproducible (see my code). All you need is an object with more than one key, and you iterate over the keys (e.g., calling Importantly, you must iterate through it (merely calling |
If the keys are made larger, the leak is larger. |
It leaks even if you just iterate once through the keys. |
Ok. So I think it is pretty clear that accessing the keys leaks memory. :-( |
The fix to this should be trivial. Try this @lemire:
|
@TkTech It works. |
I will issue a PR unless you get to it first. |
@TkTech Can you explain the fix? I realized that you needed to return an |
To explain what's happening here,
object temp;
v = string_view_to_python_string(sv)
# At this point, v has a ref count of 1
temp = <object> v
# At this point, v has a ref count of 2. Wait, what?
yield temp
# At this point, v has a ref count of 1. Casting v (a Ultimately, this is just because the signature of
Is telling Cython that this method will return a borrowed reference, and:
Is telling Cython that this method will return an "owned" reference. |
I am observing a memory leak
Part of the code
I am analyzing a large data dump, over 100gb, and memory leaks are preventing the process from completing successfully. The leak is somewhere on the C side of the extension, since profiling the python part didn't show anything. I followed the first manual and ran valgrind
valgrind log
Gist
I can provide more information, just tell me what and how ))
The text was updated successfully, but these errors were encountered: