Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when tokenizer.tokenize() is used repetitively #16

Closed
trister95 opened this issue Apr 20, 2023 · 7 comments
Closed
Assignees
Labels

Comments

@trister95
Copy link

I am trying to tokenize a bunch of txt-files and store them as folia.xml-files.

The first file works fine, but after that the kernel crashes.

A little bit more info:

  • I'm working with the latest ucto version (0.6.4);
  • I've tried this in both VSCode an Colab. In both cases it crashes;
  • I've tried it withPython 3.11.3 and 3.8.10. In both cases it crashes;
  • It doesn't seem to have anything to do with the input txt-files: even if the txt-files are exactly the same it will work for the first file and crash at the second file;
  • When running the code out of a notebook I get this error: Canceled future for execute_request message before replies were done
    The Kernel crashed while executing code in the the current cell or a previous cell;
  • When running the code from the command line I get this error: Segmentation fault.
import ucto
configurationfile_ucto = "tokconfig-nld-historical"

tokenizer = ucto.Tokenizer(configurationfile_ucto, foliaoutput = True)

for f in list_with_paths_to_exact_same_files:
     tokenizer.tokenize(f, output_path)

Am I doing something wrong, or is there a bug here?

@proycon
Copy link
Owner

proycon commented Apr 20, 2023

I can reproduce the problem with a simple text file and feeding it twice as you said, ucto crashes with a segfault (which is not something that should ever happen).

It seems there are some loose ends we need to solve if we want to call tokenize() successively. I now wonder if it used to work or if this bug was always there. What you could as as a workaround in the meantime, is simply reinstantiate the tokenizer for each run:

import ucto
configurationfile_ucto = "tokconfig-nld-historical"

files = ["test.txt", "test.txt"]
for f in files:
    tokenizer = ucto.Tokenizer(configurationfile_ucto, foliaoutput = True)
    tokenizer.tokenize(f, "/tmp/")

This is a bit less performant due to the added initialization time every iteration, but hopefully still manageable.

As to the crash, I produced the following traceback so we (me and @kosloot?) can debug and fix it:

(gdb) bt
#0  folia::processor::generate_id (this=this@entry=0x5555556da9c0, prov=prov@entry=0x0, name="uctodata") at folia_provenance.cxx:168
#1  0x00007ffff68d762d in folia::processor::processor (this=this@entry=0x5555556da9c0, prov=0x0, parent=parent@entry=0x5555556d3df0, 
    atts_in=...) at folia_provenance.cxx:274
#2  0x00007ffff6897507 in folia::Document::add_processor (this=this@entry=0x5555556980d0, args=..., 
    parent=parent@entry=0x5555556d3df0) at folia_document.cxx:1068
#3  0x00007ffff7e1ba71 in Tokenizer::TokenizerClass::add_provenance_data (this=this@entry=0x7ffff774a020, 
    doc=doc@entry=0x5555556980d0, parent=parent@entry=0x0) at tokenize.cxx:533
#4  0x00007ffff7e1c182 in Tokenizer::TokenizerClass::add_provenance_setting (this=this@entry=0x7ffff774a020, 
    doc=doc@entry=0x5555556980d0, parent=parent@entry=0x0) at tokenize.cxx:603
#5  0x00007ffff7e1ccd8 in Tokenizer::TokenizerClass::start_document (this=this@entry=0x7ffff774a020, id="untitled")
    at tokenize.cxx:663
#6  0x00007ffff7e28dd8 in Tokenizer::TokenizerClass::tokenize (this=this@entry=0x7ffff774a020, IN=...) at tokenize.cxx:937
#7  0x00007ffff7e2933f in Tokenizer::TokenizerClass::tokenize (this=0x7ffff774a020, IN=..., OUT=...) at tokenize.cxx:1007
#8  0x00007ffff7e2d9f2 in Tokenizer::TokenizerClass::tokenize (this=this@entry=0x7ffff774a020, ifile="test.txt", ofile="/tmp/")
    at tokenize.cxx:999
#9  0x00007ffff7e7d706 in __pyx_pf_4ucto_9Tokenizer_2tokenize (__pyx_v_outputfile=<optimized out>, __pyx_v_inputfile=<optimized out>, 
    __pyx_v_self=0x7ffff774a010) at ucto_wrapper.cpp:3694
#10 __pyx_pw_4ucto_9Tokenizer_3tokenize (__pyx_v_self=0x7ffff774a010, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
    at ucto_wrapper.cpp:3649
#11 0x00007ffff7b57f4c in method_vectorcall_VARARGS_KEYWORDS (func=<optimized out>, args=0x7ffff7745bb0, nargsf=<optimized out>, 
    kwnames=<optimized out>) at Objects/descrobject.c:344
#12 0x00007ffff7b4676a in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff7745bb0, 
    callable=0x7ffff6d7fe70, tstate=0x55555555e480) at ./Include/cpython/abstract.h:114
#13 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff7745bb0, callable=0x7ffff6d7fe70)
    at ./Include/cpython/abstract.h:123
#14 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fffffffd3c0, 
    tstate=<optimized out>) at Python/ceval.c:5891
#15 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x7ffff7745a40, throwflag=<optimized out>) at Python/ceval.c:4198
#16 0x00007ffff7b44f80 in _PyEval_EvalFrame (throwflag=0, f=0x7ffff7745a40, tstate=0x55555555e480)
    at ./Include/internal/pycore_ceval.h:46
#17 _PyEval_Vector (tstate=tstate@entry=0x55555555e480, con=con@entry=0x7fffffffd4c0, locals=locals@entry=0x7ffff6d41dc0, 
    args=args@entry=0x0, argcount=argcount@entry=0, kwnames=kwnames@entry=0x0) at Python/ceval.c:5065
--Type <RET> for more, q to quit, c to continue without paging--
#18 0x00007ffff7bf39e4 in PyEval_EvalCode (co=0x7ffff6d3f470, globals=0x7ffff6d41dc0, locals=0x7ffff6d41dc0) at Python/ceval.c:1134
#19 0x00007ffff7c04383 in run_eval_code_obj (tstate=tstate@entry=0x55555555e480, co=co@entry=0x7ffff6d3f470, 
    globals=globals@entry=0x7ffff6d41dc0, locals=locals@entry=0x7ffff6d41dc0) at Python/pythonrun.c:1291
#20 0x00007ffff7bffaea in run_mod (mod=mod@entry=0x5555555de300, filename=filename@entry=0x7ffff6d2faf0, 
    globals=globals@entry=0x7ffff6d41dc0, locals=locals@entry=0x7ffff6d41dc0, flags=flags@entry=0x7fffffffd6a8, 
    arena=arena@entry=0x7ffff771fb90) at Python/pythonrun.c:1312
#21 0x00007ffff7aa223f in pyrun_file (fp=fp@entry=0x55555555a470, filename=filename@entry=0x7ffff6d2faf0, start=start@entry=257, 
    globals=globals@entry=0x7ffff6d41dc0, locals=locals@entry=0x7ffff6d41dc0, closeit=closeit@entry=1, flags=0x7fffffffd6a8)
    at Python/pythonrun.c:1208
#22 0x00007ffff7aa1ef0 in _PyRun_SimpleFileObject (fp=0x55555555a470, filename=0x7ffff6d2faf0, closeit=1, flags=0x7fffffffd6a8)
    at Python/pythonrun.c:456
#23 0x00007ffff7aa28a3 in _PyRun_AnyFileObject (fp=0x55555555a470, filename=0x7ffff6d2faf0, closeit=1, flags=0x7fffffffd6a8)
    at Python/pythonrun.c:90
#24 0x00007ffff7c10b5d in pymain_run_file_obj (skip_source_first_line=0, filename=0x7ffff6d2faf0, program_name=0x7ffff77cb140)
    at Modules/main.c:353
#25 pymain_run_file (config=0x5555555855a0) at Modules/main.c:372
#26 pymain_run_python (exitcode=0x7fffffffd6a4) at Modules/main.c:587
#27 Py_RunMain () at Modules/main.c:666
#28 0x00007ffff7be4f3b in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:720
#29 0x00007ffff783c790 in __libc_start_call_main (main=main@entry=0x555555555120 <main>, argc=argc@entry=2, 
    argv=argv@entry=0x7fffffffd8d8) at ../sysdeps/nptl/libc_start_call_main.h:58
#30 0x00007ffff783c84a in __libc_start_main_impl (main=0x555555555120 <main>, argc=2, argv=0x7fffffffd8d8, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd8c8) at ../csu/libc-start.c:360
#31 0x0000555555555045 in _start ()

@proycon proycon self-assigned this Apr 20, 2023
@proycon proycon added the bug label Apr 20, 2023
@proycon proycon changed the title Kernel crashes when tokenizer.tokenize() is used repetitively Segmentation fault when tokenizer.tokenize() is used repetitively Apr 20, 2023
@proycon
Copy link
Owner

proycon commented Apr 20, 2023

I changed the title a bit, I know you meant "kernel" to refer to the jupyter kernel, but people might misunderstand and think the entire linux kernel crashed because of ucto, that'd be quite a feat ;)

@kosloot
Copy link

kosloot commented Apr 21, 2023

Ok, this is definitely a bug in ucto itself. I can reproduce it without Python.
It is a problem inside the tokenize( string, string ) function, so it seems.
Needs some investigation

@kosloot
Copy link

kosloot commented Apr 21, 2023

Some data was not reset on next invocation of tokenize(). Should be fixed now in Ucto.

@proycon
Copy link
Owner

proycon commented Apr 21, 2023

Nice work! Are we ready for new releases? I guess such a crash warrants a new release quickly.

@trister95
Copy link
Author

Thanks a lot for the quick replies! Great work! :)

@proycon
Copy link
Owner

proycon commented Apr 22, 2023

ucto v0.29 and python-ucto v0.6.5 are now released, solving this issue

@proycon proycon closed this as completed Apr 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants