Segmentation fault when tokenizer.tokenize() is used repetitively #16

trister95 · 2023-04-20T12:18:46Z

I am trying to tokenize a bunch of txt-files and store them as folia.xml-files.

The first file works fine, but after that the kernel crashes.

A little bit more info:

I'm working with the latest ucto version (0.6.4);
I've tried this in both VSCode an Colab. In both cases it crashes;
I've tried it withPython 3.11.3 and 3.8.10. In both cases it crashes;
It doesn't seem to have anything to do with the input txt-files: even if the txt-files are exactly the same it will work for the first file and crash at the second file;
When running the code out of a notebook I get this error: Canceled future for execute_request message before replies were done
The Kernel crashed while executing code in the the current cell or a previous cell;
When running the code from the command line I get this error: Segmentation fault.

import ucto
configurationfile_ucto = "tokconfig-nld-historical"

tokenizer = ucto.Tokenizer(configurationfile_ucto, foliaoutput = True)

for f in list_with_paths_to_exact_same_files:
     tokenizer.tokenize(f, output_path)

Am I doing something wrong, or is there a bug here?

The text was updated successfully, but these errors were encountered:

proycon · 2023-04-20T22:18:40Z

I can reproduce the problem with a simple text file and feeding it twice as you said, ucto crashes with a segfault (which is not something that should ever happen).

It seems there are some loose ends we need to solve if we want to call tokenize() successively. I now wonder if it used to work or if this bug was always there. What you could as as a workaround in the meantime, is simply reinstantiate the tokenizer for each run:

import ucto
configurationfile_ucto = "tokconfig-nld-historical"

files = ["test.txt", "test.txt"]
for f in files:
    tokenizer = ucto.Tokenizer(configurationfile_ucto, foliaoutput = True)
    tokenizer.tokenize(f, "/tmp/")

This is a bit less performant due to the added initialization time every iteration, but hopefully still manageable.

As to the crash, I produced the following traceback so we (me and @kosloot?) can debug and fix it:

(gdb) bt
#0  folia::processor::generate_id (this=this@entry=0x5555556da9c0, prov=prov@entry=0x0, name="uctodata") at folia_provenance.cxx:168
#1  0x00007ffff68d762d in folia::processor::processor (this=this@entry=0x5555556da9c0, prov=0x0, parent=parent@entry=0x5555556d3df0, 
    atts_in=...) at folia_provenance.cxx:274
#2  0x00007ffff6897507 in folia::Document::add_processor (this=this@entry=0x5555556980d0, args=..., 
    parent=parent@entry=0x5555556d3df0) at folia_document.cxx:1068
#3  0x00007ffff7e1ba71 in Tokenizer::TokenizerClass::add_provenance_data (this=this@entry=0x7ffff774a020, 
    doc=doc@entry=0x5555556980d0, parent=parent@entry=0x0) at tokenize.cxx:533
#4  0x00007ffff7e1c182 in Tokenizer::TokenizerClass::add_provenance_setting (this=this@entry=0x7ffff774a020, 
    doc=doc@entry=0x5555556980d0, parent=parent@entry=0x0) at tokenize.cxx:603
#5  0x00007ffff7e1ccd8 in Tokenizer::TokenizerClass::start_document (this=this@entry=0x7ffff774a020, id="untitled")
    at tokenize.cxx:663
#6  0x00007ffff7e28dd8 in Tokenizer::TokenizerClass::tokenize (this=this@entry=0x7ffff774a020, IN=...) at tokenize.cxx:937
#7  0x00007ffff7e2933f in Tokenizer::TokenizerClass::tokenize (this=0x7ffff774a020, IN=..., OUT=...) at tokenize.cxx:1007
#8  0x00007ffff7e2d9f2 in Tokenizer::TokenizerClass::tokenize (this=this@entry=0x7ffff774a020, ifile="test.txt", ofile="/tmp/")
    at tokenize.cxx:999
#9  0x00007ffff7e7d706 in __pyx_pf_4ucto_9Tokenizer_2tokenize (__pyx_v_outputfile=<optimized out>, __pyx_v_inputfile=<optimized out>, 
    __pyx_v_self=0x7ffff774a010) at ucto_wrapper.cpp:3694
#10 __pyx_pw_4ucto_9Tokenizer_3tokenize (__pyx_v_self=0x7ffff774a010, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
    at ucto_wrapper.cpp:3649
#11 0x00007ffff7b57f4c in method_vectorcall_VARARGS_KEYWORDS (func=<optimized out>, args=0x7ffff7745bb0, nargsf=<optimized out>, 
    kwnames=<optimized out>) at Objects/descrobject.c:344
#12 0x00007ffff7b4676a in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff7745bb0, 
    callable=0x7ffff6d7fe70, tstate=0x55555555e480) at ./Include/cpython/abstract.h:114
#13 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff7745bb0, callable=0x7ffff6d7fe70)
    at ./Include/cpython/abstract.h:123
#14 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fffffffd3c0, 
    tstate=<optimized out>) at Python/ceval.c:5891
#15 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x7ffff7745a40, throwflag=<optimized out>) at Python/ceval.c:4198
#16 0x00007ffff7b44f80 in _PyEval_EvalFrame (throwflag=0, f=0x7ffff7745a40, tstate=0x55555555e480)
    at ./Include/internal/pycore_ceval.h:46
#17 _PyEval_Vector (tstate=tstate@entry=0x55555555e480, con=con@entry=0x7fffffffd4c0, locals=locals@entry=0x7ffff6d41dc0, 
    args=args@entry=0x0, argcount=argcount@entry=0, kwnames=kwnames@entry=0x0) at Python/ceval.c:5065
--Type <RET> for more, q to quit, c to continue without paging--
#18 0x00007ffff7bf39e4 in PyEval_EvalCode (co=0x7ffff6d3f470, globals=0x7ffff6d41dc0, locals=0x7ffff6d41dc0) at Python/ceval.c:1134
#19 0x00007ffff7c04383 in run_eval_code_obj (tstate=tstate@entry=0x55555555e480, co=co@entry=0x7ffff6d3f470, 
    globals=globals@entry=0x7ffff6d41dc0, locals=locals@entry=0x7ffff6d41dc0) at Python/pythonrun.c:1291
#20 0x00007ffff7bffaea in run_mod (mod=mod@entry=0x5555555de300, filename=filename@entry=0x7ffff6d2faf0, 
    globals=globals@entry=0x7ffff6d41dc0, locals=locals@entry=0x7ffff6d41dc0, flags=flags@entry=0x7fffffffd6a8, 
    arena=arena@entry=0x7ffff771fb90) at Python/pythonrun.c:1312
#21 0x00007ffff7aa223f in pyrun_file (fp=fp@entry=0x55555555a470, filename=filename@entry=0x7ffff6d2faf0, start=start@entry=257, 
    globals=globals@entry=0x7ffff6d41dc0, locals=locals@entry=0x7ffff6d41dc0, closeit=closeit@entry=1, flags=0x7fffffffd6a8)
    at Python/pythonrun.c:1208
#22 0x00007ffff7aa1ef0 in _PyRun_SimpleFileObject (fp=0x55555555a470, filename=0x7ffff6d2faf0, closeit=1, flags=0x7fffffffd6a8)
    at Python/pythonrun.c:456
#23 0x00007ffff7aa28a3 in _PyRun_AnyFileObject (fp=0x55555555a470, filename=0x7ffff6d2faf0, closeit=1, flags=0x7fffffffd6a8)
    at Python/pythonrun.c:90
#24 0x00007ffff7c10b5d in pymain_run_file_obj (skip_source_first_line=0, filename=0x7ffff6d2faf0, program_name=0x7ffff77cb140)
    at Modules/main.c:353
#25 pymain_run_file (config=0x5555555855a0) at Modules/main.c:372
#26 pymain_run_python (exitcode=0x7fffffffd6a4) at Modules/main.c:587
#27 Py_RunMain () at Modules/main.c:666
#28 0x00007ffff7be4f3b in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:720
#29 0x00007ffff783c790 in __libc_start_call_main (main=main@entry=0x555555555120 <main>, argc=argc@entry=2, 
    argv=argv@entry=0x7fffffffd8d8) at ../sysdeps/nptl/libc_start_call_main.h:58
#30 0x00007ffff783c84a in __libc_start_main_impl (main=0x555555555120 <main>, argc=2, argv=0x7fffffffd8d8, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd8c8) at ../csu/libc-start.c:360
#31 0x0000555555555045 in _start ()

proycon · 2023-04-20T22:26:06Z

I changed the title a bit, I know you meant "kernel" to refer to the jupyter kernel, but people might misunderstand and think the entire linux kernel crashed because of ucto, that'd be quite a feat ;)

kosloot · 2023-04-21T08:03:11Z

Ok, this is definitely a bug in ucto itself. I can reproduce it without Python.
It is a problem inside the tokenize( string, string ) function, so it seems.
Needs some investigation

kosloot · 2023-04-21T08:53:47Z

Some data was not reset on next invocation of tokenize(). Should be fixed now in Ucto.

proycon · 2023-04-21T09:15:11Z

Nice work! Are we ready for new releases? I guess such a crash warrants a new release quickly.

trister95 · 2023-04-21T10:13:01Z

Thanks a lot for the quick replies! Great work! :)

proycon · 2023-04-22T10:32:13Z

ucto v0.29 and python-ucto v0.6.5 are now released, solving this issue

proycon self-assigned this Apr 20, 2023

proycon added the bug label Apr 20, 2023

proycon changed the title ~~Kernel crashes when tokenizer.tokenize() is used repetitively~~ Segmentation fault when tokenizer.tokenize() is used repetitively Apr 20, 2023

kosloot added a commit to LanguageMachines/ucto that referenced this issue Apr 21, 2023

fixed a problem comparable to proycon/python-ucto#16 (for XmlInput)

4410794

proycon closed this as completed in CLARIAH/wp3-ucto@98e308c Apr 21, 2023

proycon reopened this Apr 21, 2023

proycon closed this as completed Apr 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault when tokenizer.tokenize() is used repetitively #16

Segmentation fault when tokenizer.tokenize() is used repetitively #16

trister95 commented Apr 20, 2023

proycon commented Apr 20, 2023 •

edited

Loading

proycon commented Apr 20, 2023

kosloot commented Apr 21, 2023

kosloot commented Apr 21, 2023 •

edited

Loading

proycon commented Apr 21, 2023

trister95 commented Apr 21, 2023

proycon commented Apr 22, 2023

Segmentation fault when tokenizer.tokenize() is used repetitively #16

Segmentation fault when tokenizer.tokenize() is used repetitively #16

Comments

trister95 commented Apr 20, 2023

proycon commented Apr 20, 2023 • edited Loading

proycon commented Apr 20, 2023

kosloot commented Apr 21, 2023

kosloot commented Apr 21, 2023 • edited Loading

proycon commented Apr 21, 2023

trister95 commented Apr 21, 2023

proycon commented Apr 22, 2023

proycon commented Apr 20, 2023 •

edited

Loading

kosloot commented Apr 21, 2023 •

edited

Loading