Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pywin32<300 causes NULL pointer deference during referrer graph generation #25

Closed
cool-RR opened this issue Jan 16, 2021 · 32 comments
Closed

Comments

@cool-RR
Copy link

cool-RR commented Jan 16, 2021

guppy3==3.1.0

Using Windows 7. I ran this:

h = guppy.hpy()
he = h.heap()

It waited for a few seconds and then the process crashed. Here are the error details from the dialog:

Problem signature:
  Problem Event Name:	APPCRASH
  Application Name:	python.exe
  Application Version:	3.8.1150.1013
  Application Timestamp:	5dfab277
  Fault Module Name:	python38.dll
  Fault Module Version:	3.8.1150.1013
  Fault Module Timestamp:	5dfab24b
  Exception Code:	c0000005
  Exception Offset:	000000000002feaf
  OS Version:	6.1.7601.2.1.0.256.1
  Locale ID:	1033
  Additional Information 1:	e8ad
  Additional Information 2:	e8adaaf2e9f8209565e4a118f3c1cb38
  Additional Information 3:	359a
  Additional Information 4:	359a43a120d45186052c5a12b0526d2d
@cool-RR
Copy link
Author

cool-RR commented Jan 16, 2021

Update: This only happens after keras has been imported.

@zhuyifei1999
Copy link
Owner

I just tried, Python 3.8 on Linux:

$ cat issue25.py 
import keras

import guppy
h = guppy.hpy()
he = h.heap()

print(he)
$ python issue25.py 
2021-01-16 22:06:26.697870: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-01-16 22:06:26.697921: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Partition of a set of 592677 objects. Total size = 78785975 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 150998  25 23972538  30  23972538  30 str
     1 136448  23 10686424  14  34658962  44 tuple
     2  41460   7  7348668   9  42007630  53 types.CodeType
     3  82657  14  7000740   9  49008370  62 bytes
     4  41609   7  5658824   7  54667194  69 function
     5   4039   1  3950792   5  58617986  74 type
     6   2471   0  3917936   5  62535922  79 dict of module
     7   7476   1  3280376   4  65816298  84 dict (no owner)
     8   4039   1  2036536   3  67852834  86 dict of type
     9   8052   1  1716128   2  69568962  88 dict of function
<762 more rows. Type e.g. '_.more' to view.>

Looks like by importing itself + the given guppy code doesn't cause a crash.

Is it possible for you to get a core dump of a stack trace of the crash? In the meantime, I'll try to find a Windows 7 install that I can test on. I'm not very familiar with Windows debuggers though.

@zhuyifei1999
Copy link
Owner

I'm not a Windows person. Would you be willing to share some steps on to get a test environment set up that is able to reproduce this crash?

@cool-RR
Copy link
Author

cool-RR commented Jan 17, 2021

Thanks for the considerable effort. You can find a binary of h5py for Windows, and many other packages, here: https://www.lfd.uci.edu/~gohlke/pythonlibs/

Then run the following:

import keras
import guppy
h = guppy.hpy()
he = h.heap()

@zhuyifei1999
Copy link
Owner

Ok, the wheel published there (h5py‑2.10.0‑cp38‑cp38‑win32.whl) does allow me to install h5py, and I was able to successfully install keras. However, upon importing keras it says "Keras requires TensorFlow 2.2 or higher" (also happened on Linux), and I went to check the page. Only amd64 wheels are available, both from that page and PyPI.

I did a slight googling around and it seems that TensorFlow for x86 32-bit is probably very complicated and completely unsupported (https://stackoverflow.com/q/44449972). Are you using amd64 Win7? Let me try to find a test VM image for that.

@cool-RR
Copy link
Author

cool-RR commented Jan 17, 2021 via email

@zhuyifei1999
Copy link
Owner

zhuyifei1999 commented Jan 17, 2021

If I'm not mistaken, when people say "amd64" it's just a silly way to say
64 bit processor, regardless of whether it's Intel or AMD. In other words,
if your Windows computer is from the last 10 years and it's not a weird
netbook or something, it's likely "amd64" :)

If the win32 wheel worked for you, I guess your VM is 32bit. You should do
a 64bit VM just because that's what 99% of Windows users do.

Yes, the host machine is x86 64-bit. I don't have a licensed Win 7 to test with, hence I'm looking for a VM to download. Microsoft's test VM image from https://developer.microsoft.com/en-us/microsoft-edge/tools/vms/ are only x86 32-bit images.

Also, you're going the reproduction route. Another way to tackle this would
be to get more logging output from my machine to let you figure out the
bug. If there's anything you want me to run, as long as it's not something
that requires a lot of setup and work, I'll be happy to do that.

I'm guessing this is a segfault in one of the C code. Is it possible for you to use faulthandler or get a C stack trace somehow? Is it possible to get a core dump somehow?

I'm also testing on a Win 10 amd64 install (licensed copy on bare metal) and is unable to reproduce the crash (he = h.heap() runs successfully and he prints successfully).

@zhuyifei1999
Copy link
Owner

Got a pirated copy of Windows 7. Will try to reproduce on that later.

@cool-RR
Copy link
Author

cool-RR commented Jan 17, 2021

LOL, GitHub is owned by Microsoft, hope they won't notice ;) They don't even let people buy legal copies of Windows 7, and I tried multiple times.

I've never used these tools you mentioned. I don't want to spend time researching, but if you'll give me lines to run, I'll run them.

@zhuyifei1999
Copy link
Owner

zhuyifei1999 commented Jan 18, 2021

Cannot reproduce on 64 bit Win7. The installation of packages are python -m venv venv, venv\Scripts\activate, python -m pip install -U pip wheel setuptools, pip install keras tensorflow, pip install guppy3.

Screenshot_2021-01-17_18-53-41

I've never used these tools you mentioned. I don't want to spend time researching, but if you'll give me lines to run, I'll run them.

The one I suggested is faulthandler. falulthandler is run passively; you just need to enable it:

import faulthandler
faulthandler.enable()

The problem is that faulthandler is only able to dump a stack trace for the interpreted python code. The fault probably happens in some native C code and it would be helpful to pinpoint the native function that faulted.

I googled around a bit and found https://stackoverflow.com/a/49050274 regarding Windows Error Reporting which might be helpful in that.

@cool-RR
Copy link
Author

cool-RR commented Jan 19, 2021

This is the output from faulthandler:

$ python fluff.py
2021-01-19 17:57:03.684490: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not
 load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2021-01-19 17:57:03.696491: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart
dlerror if you do not have a GPU set up on your machine.
Windows fatal exception: access violation

Current thread 0x0000865c (most recent call first):
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\View.py", line 479 in referrers
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\UniSet.py", line 556 in <lambda>
  File "C:\Program Files\Python38\lib\site-packages\guppy\etc\Descriptor.py", line 32 in __get__
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\RefPat.py", line 613 in relimg
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\RefPat.py", line 487 in get_children
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\RefPat.py", line 467 in linegenerator

  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\RefPat.py", line 423 in generate
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\RefPat.py", line 438 in get_row_index
ed
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\RefPat.py", line 456 in iterlines
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\RefPat.py", line 417 in _oh_get_line_
iter
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\OutputHandling.py", line 212 in line_
at
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\OutputHandling.py", line 232 in lines
_from
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\OutputHandling.py", line 307 in f
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\OutputHandling.py", line 339 in <lamb
da>
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\View.py", line 256 in enter
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\OutputHandling.py", line 339 in get_s
tr
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\OutputHandling.py", line 272 in get_s
tr_of_top
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\OutputHandling.py", line 386 in reprf
unc
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\View.py", line 356 in heap
  File "C:\Program Files\Python38\lib\site-packages\guppy\heapy\Use.py", line 192 in heap
  File "fluff.py", line 6 in <module>
Segmentation fault

@cool-RR
Copy link
Author

cool-RR commented Jan 19, 2021

I was able to create the WER dump only when the debugger was on, not sure why. When it was off, the crash still happens just without the Windows dialog. You can download the dump here but I have no idea how you would read it.

@zhuyifei1999
Copy link
Owner

This is the output from faulthandler:

Looks like the last python frame is View.py#L479, which would call into hv.c#L1518. This is a rather complex C function to workaround issue #7.

You can download the dump here but I have no idea how you would read it.

Looks like a minidump file:

zhuyifei1999@zhuyifei1999-ThinkPad-T480 ~/guppy3/issue25 $ file python.exe.13268.dmp
python.exe.13268.dmp: Mini DuMP crash report, 12 streams, Tue Jan 19 16:04:33 2021, 0x1826 type

Searching around Google has a tool called Breakpad to work with this format and I'm looking into it.

@zhuyifei1999
Copy link
Owner

zhuyifei1999@zhuyifei1999-ThinkPad-T480 ~/guppy3/issue25 $ breakpad/src/src/processor/minidump_dump python.exe.13268.dmp > python.exe.13268.dmp.info

The dumped information reports exception

MDException
  thread_id                                  = 0x4cb8
  exception_record.exception_code            = 0xc0000005
  exception_record.exception_flags           = 0x0
  exception_record.exception_record          = 0x0
  exception_record.exception_address         = 0x7fedd47feaf
  exception_record.number_parameters         = 2
  exception_record.exception_information[ 0] = 0x0
  exception_record.exception_information[ 1] = 0xd0
  thread_context.data_size                   = 1232
  thread_context.rva                         = 0xeb50

and the context:

MDRawContextAMD64
  p1_home       = 0x2b9370
  p2_home       = 0x2b8e80
  p3_home       = 0x0
  p4_home       = 0x0
  p5_home       = 0x3ff00000000
  p6_home       = 0x0
  context_flags = 0x10005f
  mx_csr        = 0x1fa9
  cs            = 0x33
  ds            = 0x2b
  es            = 0x2b
  fs            = 0x53
  gs            = 0x2b
  ss            = 0x2b
  eflags        = 0x10246
  dr0           = 0x0
  dr1           = 0x0
  dr2           = 0x0
  dr3           = 0x0
  dr6           = 0x0
  dr7           = 0x0
  rax           = 0x1d2975e0
  rcx           = 0x1e16e580
  rdx           = 0x83c3270
  rbx           = 0x1d2975e0
  rsp           = 0x2b95b0
  rbp           = 0x1e16e580
  rsi           = 0x1e16e580
  rdi           = 0x29e7e20
  r8            = 0x0
  r9            = 0x83c3270
  r10           = 0x1c146350
  r11           = 0x2b95f8
  r12           = 0x0
  r13           = 0x0
  r14           = 0x0
  r15           = 0x1
  rip           = 0x7fedd47feaf

This address maps into python core

module[4]
MDRawModule
  base_of_image                   = 0x7fedd450000
  size_of_image                   = 0x42c000
  checksum                        = 0x40703a
  time_date_stamp                 = 0x5dfab24b 2019-12-18 23:12:11
  module_name_rva                 = 0x68c0
  version_info.signature          = 0xfeef04bd
  version_info.struct_version     = 0x10000
  version_info.file_version       = 0x30008:0x47e03f5
  version_info.product_version    = 0x30008:0x47e03f5
  version_info.file_flags_mask    = 0x3f
  version_info.file_flags         = 0x0
  version_info.file_os            = 0x4
  version_info.file_type          = 0x2
  version_info.file_subtype       = 0x0
  version_info.file_date          = 0x0:0x0
  cv_record.data_size             = 57
  cv_record.rva                   = 0x11745
  misc_record.data_size           = 0
  misc_record.rva                 = 0x0
  (code_file)                     = "C:\Program Files\Python38\python38.dll"
  (code_identifier)               = "5DFAB24B42c000"
  (cv_record).cv_signature        = 0x53445352
  (cv_record).signature           = 07725f2f-6ae8-46c5-955b-103f10b1c445
  (cv_record).age                 = 1
  (cv_record).pdb_file_name       = "C:\A\27\b\bin\amd64\python38.pdb"
  (misc_record)                   = (null)
  (debug_file)                    = "C:\A\27\b\bin\amd64\python38.pdb"
  (debug_identifier)              = "07725F2F6AE846C5955B103F10B1C4451"
  (version)                       = "3.8.1150.1013"

The offset matches the original description

  Exception Offset:	000000000002feaf
>>> hex(0x7fedd47feaf - 0x7fedd450000)
'0x2feaf'

Let me see if I can locate which function is at 0x2feaf.

@zhuyifei1999
Copy link
Owner

  time_date_stamp                 = 0x5dfab24b 2019-12-18 23:12:11

This matches the release date of Python 3.8.1... hmm

Downloaded the Python 3.8 DLL from https://www.python.org/ftp/python/3.8.1/python-3.8.1-embed-amd64.zip, and readpe says:

zhuyifei1999@zhuyifei1999-ThinkPad-T480 ~/guppy3/issue25/py3.8.1/win $ readpe python38.dll
[...]
COFF/File header
[...]
    Date/time stamp:                 1576710731 (Wed, 18 Dec 2019 23:12:11 UTC)
[...]
Optional/Image header
[...]
    Checksum:                        0x40703a
[...]

Nice.

I was under the assumption that you are running under latest Python 3.8 (3.8.7). Let me see if I can reproduce it by using 3.8.1. If not I'll look deeper into the symbols.

@zhuyifei1999
Copy link
Owner

Not on Linux

(venv.py3.8.1) zhuyifei1999@zhuyifei1999-ThinkPad-T480 ~/guppy3/issue25/py3.8.1/Python-3.8.1 $ python
Python 3.8.1 (default, Jan 19 2021, 21:35:33) 
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import keras
2021-01-19 21:40:24.925223: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-01-19 21:40:24.925262: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
>>> import guppy
>>> h = guppy.hpy()
>>> he = h.heap()
>>> he
Partition of a set of 590742 objects. Total size = 78772366 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 149976  25 24049357  31  24049357  31 str
     1 135555  23 10635376  14  34684733  44 tuple
     2  41449   7  7346732   9  42031465  53 types.CodeType
     3  82631  14  6998144   9  49029609  62 bytes
     4  41603   7  5658008   7  54687617  69 function
     5   4038   1  3949728   5  58637345  74 type
     6   2473   0  3919752   5  62557097  79 dict of module
     7   7475   1  3242320   4  65799417  84 dict (no owner)
     8   4038   1  2036176   3  67835593  86 dict of type
     9   8052   1  1716128   2  69551721  88 dict of function
<765 more rows. Type e.g. '_.more' to view.>
>>> 

Not on Win 7 either

Screenshot_2021-01-19_22-00-17

AFAICT, the python38.dll does not have a symbol table

zhuyifei1999@zhuyifei1999-ThinkPad-T480 ~/guppy3/issue25/py3.8.1/win $ nm python38.dll 
nm: python38.dll: no symbols

And the nearest functions to 0x2feaf AFAICT, are

zhuyifei1999@zhuyifei1999-ThinkPad-T480 ~/guppy3/issue25/py3.8.1/win $ readpe python38.dll | grep -A 1 Function | grep 0x | sort | less
[...]
        0x2e140:                         PyObject_RichCompareBool
        0x2e384:                         PySet_Contains
        0x2e98:                          _PyUnicode_EncodeUTF32
        0x2f010:                         PyNumber_InPlaceAdd
        0x2f160:                         PyNumber_Add
        0x2fae4:                         PyLong_AsDouble
        0x2fe90:                         PyWeakref_NewRef
        0x302060:                        _Py_ascii_whitespace
        0x30554:                         PyLong_FromLongLong
        0x30718:                         _PyLong_Frexp
        0x30eb0:                         PyObject_GC_Track
[...]

This is the function of PyWeakref_NewRef, until its return (yes I'm aware that 2ff27 jumps further down but it's more effort than necessary to track precisely where the function ends):

zhuyifei1999@zhuyifei1999-ThinkPad-T480 ~/guppy3/issue25/py3.8.1/win $ objdump -d python38.dll | less
[...]
   18002fe90:   4c 8b dc                mov    %rsp,%r11
   18002fe93:   49 89 5b 10             mov    %rbx,0x10(%r11)
   18002fe97:   55                      push   %rbp
   18002fe98:   56                      push   %rsi
   18002fe99:   57                      push   %rdi
   18002fe9a:   41 56                   push   %r14
   18002fe9c:   41 57                   push   %r15
   18002fe9e:   48 83 ec 20             sub    $0x20,%rsp
   18002fea2:   4c 8b 41 08             mov    0x8(%rcx),%r8
   18002fea6:   45 33 f6                xor    %r14d,%r14d
   18002fea9:   4c 8b ca                mov    %rdx,%r9
   18002feac:   48 8b e9                mov    %rcx,%rbp
   18002feaf:   49 8b 80 d0 00 00 00    mov    0xd0(%r8),%rax
   18002feb6:   48 85 c0                test   %rax,%rax
   18002feb9:   0f 8e bd d2 12 00       jle    0x18015d17c
   18002febf:   48 8d 34 08             lea    (%rax,%rcx,1),%rsi
   18002fec3:   4d 89 73 08             mov    %r14,0x8(%r11)
   18002fec7:   48 8b 06                mov    (%rsi),%rax
   18002feca:   4c 8d 3d 5f 71 37 00    lea    0x37715f(%rip),%r15        # 0x1803a7030
   18002fed1:   4d 89 73 18             mov    %r14,0x18(%r11)
   18002fed5:   48 8b d0                mov    %rax,%rdx
   18002fed8:   41 8b de                mov    %r14d,%ebx
   18002fedb:   48 85 c0                test   %rax,%rax
   18002fede:   74 2c                   je     0x18002ff0c
   18002fee0:   4c 39 70 18             cmp    %r14,0x18(%rax)
   18002fee4:   75 26                   jne    0x18002ff0c
   18002fee6:   41 8b ce                mov    %r14d,%ecx
   18002fee9:   4c 39 78 08             cmp    %r15,0x8(%rax)
   18002feed:   75 0b                   jne    0x18002fefa
   18002feef:   49 89 43 08             mov    %rax,0x8(%r11)
   18002fef3:   48 8b ca                mov    %rdx,%rcx
   18002fef6:   48 8b 40 30             mov    0x30(%rax),%rax
   18002fefa:   48 8b d9                mov    %rcx,%rbx
   18002fefd:   48 85 c0                test   %rax,%rax
   18002ff00:   74 0a                   je     0x18002ff0c
   18002ff02:   4c 39 70 18             cmp    %r14,0x18(%rax)
   18002ff06:   0f 84 8e d2 12 00       je     0x18015d19a
   18002ff0c:   48 8d 05 6d 49 37 00    lea    0x37496d(%rip),%rax        # 0x1803a4880
   18002ff13:   49 8b fe                mov    %r14,%rdi
   18002ff16:   4c 3b c8                cmp    %rax,%r9
   18002ff19:   49 0f 45 f9             cmovne %r9,%rdi
   18002ff1d:   48 85 ff                test   %rdi,%rdi
   18002ff20:   49 0f 45 de             cmovne %r14,%rbx
   18002ff24:   48 85 db                test   %rbx,%rbx
   18002ff27:   74 17                   je     0x18002ff40
   18002ff29:   48 ff 03                incq   (%rbx)
   18002ff2c:   48 8b c3                mov    %rbx,%rax
   18002ff2f:   48 8b 5c 24 58          mov    0x58(%rsp),%rbx
   18002ff34:   48 83 c4 20             add    $0x20,%rsp
   18002ff38:   41 5f                   pop    %r15
   18002ff3a:   41 5e                   pop    %r14
   18002ff3c:   5f                      pop    %rdi
   18002ff3d:   5e                      pop    %rsi
   18002ff3e:   5d                      pop    %rbp
   18002ff3f:   c3                      retq   

2feaf is mov 0xd0(%r8),%rax which is indeed an instruction that could fault.

Questions for myself:

  • how does reference graph generation end up creating a new weak reference?
  • why would creating new weak reference fault?

@zhuyifei1999
Copy link
Owner

@cool-RR can you check, if you create a new virtual environment with the latest packages, it still faults inside the virtual environment? Like:

python -m venv venv
venv\Scripts\activate
python -m pip install -U pip wheel setuptools
pip install -U keras tensorflow guppy3
python fluff.py

@zhuyifei1999
Copy link
Owner

Microsoft x64 Calling Convention, arguments are at RCX RDX R8 R9

PyObject *
PyWeakref_NewRef(PyObject *ob, PyObject *callback)

ob in RCX, callback in RDX

   18002fea2:   4c 8b 41 08             mov    0x8(%rcx),%r8
gef➤  ptype /o PyObject
type = struct _object {
/*    0      |     8 */    Py_ssize_t ob_refcnt;
/*    8      |     8 */    struct _typeobject *ob_type;

                           /* total size (bytes):   16 */
                         }

struct _typeobject *r8 = ob->ob_type

   18002feaf:   49 8b 80 d0 00 00 00    mov    0xd0(%r8),%rax
gef➤  ptype /o struct _typeobject
/* offset    |  size */  type = struct _typeobject {
/*    0      |    24 */    PyVarObject ob_base;
/*   24      |     8 */    const char *tp_name;
[...]
/*  208      |     8 */    Py_ssize_t tp_weaklistoffset;
[...]
/*  408      |     8 */    int (*tp_print)(PyObject *, FILE *, int);

                           /* total size (bytes):  416 */
                         }

Py_ssize_t rax = r8->tp_weaklistoffset

That's weakrefobject.c#L801,

    if (!PyType_SUPPORTS_WEAKREFS(Py_TYPE(ob))) {
#define PyType_SUPPORTS_WEAKREFS(t) ((t)->tp_weaklistoffset > 0)

Then compare the context (#25 (comment))

  rcx           = 0x1e16e580
  r8            = 0x0

We have an object whose type is NULL... how does that happen?

@zhuyifei1999
Copy link
Owner

Looking at stack:

  rsp           = 0x2b95b0

After patching Breakpad like:

diff --git a/src/tools/linux/md2core/minidump-2-core.cc b/src/tools/linux/md2core/minidump-2-core.cc
index aade82c9..7d64bbef 100644
--- a/src/tools/linux/md2core/minidump-2-core.cc
+++ b/src/tools/linux/md2core/minidump-2-core.cc
@@ -630,7 +630,7 @@ ParseSystemInfo(const Options& options, CrashedProcess* crashinfo,
               "Linux") &&
       sysinfo->platform_id != MD_OS_NACL) {
     fprintf(stderr, "This minidump was not generated by Linux or NaCl.\n");
-    exit(1);
+    // exit(1);
   }
 
   if (options.verbose) {

I'm able to convert it into a core dump:

zhuyifei1999@zhuyifei1999-ThinkPad-T480 ~/guppy3/issue25 $ breakpad/src/src/tools/linux/md2core/minidump-2-core -v python.exe.13268.dmp > python.exe.13268.dmp.core
[...]
zhuyifei1999@zhuyifei1999-ThinkPad-T480 ~/guppy3/issue25 $ readelf -eW python.exe.13268.dmp.core
[...]
  LOAD           0x007000 0x00000000002b8000 0x0000000000000000 0x008000 0x008000 RW  0x1000
[...]

Ok we have the stack contents in core dump, just need to load it into gdb with a dummy executable:

zhuyifei1999@zhuyifei1999-ThinkPad-T480 ~/guppy3/issue25 $ cat null.S
.globl _start
_start:
zhuyifei1999@zhuyifei1999-ThinkPad-T480 ~/guppy3/issue25 $ gcc null.S -nostartfiles -o null
zhuyifei1999@zhuyifei1999-ThinkPad-T480 ~/guppy3/issue25 $ gdb ./null python.exe.13268.dmp.core
[...]
gef➤  info reg
rax            0x58                0x58
rbx            0x2b8490            0x2b8490
rcx            0x2                 0x2
rdx            0x2b8400            0x2b8400
rsi            0x0                 0x0
rdi            0x2                 0x2
rbp            0x2                 0x2
rsp            0x2b8358            0x2b8358
r8             0x0                 0x0
r9             0x40                0x40
r10            0x0                 0x0
r11            0x286               0x286
r12            0x0                 0x0
r13            0x2b8400            0x2b8400
r14            0x0                 0x0
r15            0x0                 0x0
rip            0x77089d5a          0x77089d5a
eflags         0x246               [ PF ZF IF ]
cs             0x33                0x33
ss             0x2b                0x2b
ds             0x2b                0x2b
es             0x2b                0x2b
fs             0x53                0x53
gs             0x2b                0x2b
gef➤  x/10x 0x2b8358
0x2b8358:	0x504d444d	0x61b1a793	0x0000000c	0x00000020
0x2b8368:	0x00000000	0x60070311	0x00001826	0x00000000
0x2b8378:	0x00000003	0x00000184

Nice.

@zhuyifei1999
Copy link
Owner

For future note, gdb's threads is not useful (these are all ntdll.dll + 0x69d5a):

gef➤  info threads
  Id   Target Id         Frame 
* 1849 LWP 19640         0x0000000077089d5a in ?? ()
  1850 LWP 24520         0x0000000077089d5a in ?? ()
  1851 LWP 33360         0x0000000077089d5a in ?? ()
  1852 LWP 27624         0x0000000077089d5a in ?? ()
  1853 LWP 19032         0x0000000077089d5a in ?? ()
  1854 LWP 14564         0x0000000077089d5a in ?? ()
  1855 LWP 15584         0x0000000077089d5a in ?? ()
  1856 LWP 22348         0x0000000077089d5a in ?? ()
   18002fe90:   4c 8b dc                mov    %rsp,%r11

r11 should point to the saved return address.

  rsp           = 0x2b95b0
  r11           = 0x2b95f8
gef➤  x/10xg 0x2b95f8
0x2b95f8:	0x0000000000000000	0x0000000000000000
0x2b9608:	0xf680000000000000	0x00005000000007fe
0x2b9618:	0x5caeb94d0000e00b	0xfeef04bd0000719e
0x2b9628:	0x000a000000010000	0x000a000038390bae
0x2b9638:	0x0000003f38390bae	0x0004000400000000

This makes no sense to me. The return address is NULL?

Looking at the object that's passed in

  rcx           = 0x1e16e580

it's mapped

  LOAD           0x000000 0x000000001e0f0000 0x0000000000000000 0x000000 0x09a000 R   0x1000

but the core dump does not contain the data (I'll see if I can figure out how to get it)

However, the second argument

  rdx           = 0x83c3270
  LOAD           0x007000 0x00000000002b8000 0x0000000000000000 0x008000 0x008000 RW  0x1000
  LOAD           0x000000 0x0000000009970000 0x0000000000000000 0x000000 0x1c88000 R   0x1000

This is not mapped at all.

@zhuyifei1999
Copy link
Owner

I stand corrected. I found another tool (https://github.com/skelsec/minidump) to look at dumps and it is in fact actually mapped:

(venv) zhuyifei1999@zhuyifei1999-ThinkPad-T480 ~/guppy3/issue25/py-minidump/minidump $ minidump ../../python.exe.13268.dmp --memory
[...]
0x8390000     | 0x8390000      | 4                 | 0x40000       | MEM_COMMIT  | PAGE_READWRITE    | MEM_PRIVATE
[...]

I guess I should write a tool myself to convert a mini dump into core dump

@zhuyifei1999
Copy link
Owner

zhuyifei1999 commented Jan 20, 2021

Performed this patch to Breakpad: https://gist.github.com/zhuyifei1999/ff2094d04b91c8ef704e79ab816993aa

gef➤  x/10xg 0x2b95f8
0x2b95f8:	0x000007fec43ccf98	0x0000000000000000
0x2b9608:	0x000000001d2975e0	0x000000001d18a100
0x2b9618:	0x00000000002b96a9	0x000000001e16e580
0x2b9628:	0x000007fec43ccd1b	0x0000000004e08140
0x2b9638:	0x000007fec43da8a0	0x0000000008380b30
gef➤  x/10xg 0x83c3270
0x83c3270:	0x00000000000002fc	0x000007fedd7f41f0
0x83c3280:	0x000007fec43d7850	0x0000000008380b30
0x83c3290:	0x0000000000000000	0x0000000000000000
0x83c32a0:	0x000007fedd4a4940	0x0000000000000000
0x83c32b0:	0x0000000000000002	0x000007fedd7f6cf0
gef➤  x/10xg 0x1e16e580
0x1e16e580:	0x0000000000000001	0x0000000000000000
0x1e16e590:	0x0000000000000000	0x000000001e13cf68
0x1e16e5a0:	0x00000000000001a0	0x0000000000000000
0x1e16e5b0:	0x000000001e0f6800	0x0000000000000000
0x1e16e5c0:	0x0000000000000000	0x0000000000000000
gef➤  x/10xg 0x83c3270
0x83c3270:	0x00000000000002fc	0x000007fedd7f41f0
0x83c3280:	0x000007fec43d7850	0x0000000008380b30
0x83c3290:	0x0000000000000000	0x0000000000000000
0x83c32a0:	0x000007fedd4a4940	0x0000000000000000
0x83c32b0:	0x0000000000000002	0x000007fedd7f6cf0
gef➤  x/10xg 0x000007fedd7f41f0 + 24
0x7fedd7f4208:	0x000007fedd780f50	0x0000000000000038
0x7fedd7f4218:	0x0000000000000000	0x000007fedd47ab90
0x7fedd7f4228:	0x0000000000000030	0x0000000000000000
0x7fedd7f4238:	0x0000000000000000	0x0000000000000000
0x7fedd7f4248:	0x000007fedd58e300	0x0000000000000000
gef➤  p (char *)0x000007fedd780f50
$2 = 0x7fedd780f50 "builtin_function_or_method"

I'm guessing it is working?

@zhuyifei1999
Copy link
Owner

gef➤  x/10xg 0x2b95f8
0x2b95f8:	0x000007fec43ccf98	0x0000000000000000

Return address is 0x000007fec43ccf98.
This belongs to heapyc of guppy (offset cf98):

0x7fec43c0000-0x7fec43e1000, ChkSum: 0x00000000, GUID: 61B1A793-000C-0000-2000-000000000000,  "C:\Program Files\Python38\Lib\site-packages\guppy\heapy\heapyc.cp38-win_amd64.pyd"

Assuming a wheel install,

zhuyifei1999@zhuyifei1999-ThinkPad-T480 ~/guppy3/issue25/guppy $ objdump -d heapy/heapyc.cp38-win_amd64.pyd | less
[...]
   18000cf73:   33 d2                   xor    %edx,%edx
   18000cf75:   48 8b cb                mov    %rbx,%rcx
   18000cf78:   44 8d 42 68             lea    0x68(%rdx),%r8d
   18000cf7c:   e8 81 4e ff ff          callq  0x180001e02
   18000cf81:   48 89 1f                mov    %rbx,(%rdi)
   18000cf84:   48 8b ce                mov    %rsi,%rcx
   18000cf87:   48 89 6b 40             mov    %rbp,0x40(%rbx)
   18000cf8b:   48 89 33                mov    %rsi,(%rbx)
   18000cf8e:   48 8b 55 30             mov    0x30(%rbp),%rdx
   18000cf92:   ff 15 e8 25 00 00       callq  *0x25e8(%rip)        # 0x18000f580
   18000cf98:   48 89 43 48             mov    %rax,0x48(%rbx)
   18000cf9c:   48 85 c0                test   %rax,%rax
   18000cf9f:   75 0b                   jne    0x18000cfac
   18000cfa1:   48 8b cb                mov    %rbx,%rcx
   18000cfa4:   ff 15 06 23 00 00       callq  *0x2306(%rip)        # 0x18000f2b0
   18000cfaa:   33 db                   xor    %ebx,%ebx
   18000cfac:   48 8b 6c 24 38          mov    0x38(%rsp),%rbp
   18000cfb1:   48 8b c3                mov    %rbx,%rax
   18000cfb4:   48 8b 5c 24 30          mov    0x30(%rsp),%rbx
   18000cfb9:   48 8b 74 24 40          mov    0x40(%rsp),%rsi
   18000cfbe:   48 83 c4 20             add    $0x20,%rsp
   18000cfc2:   5f                      pop    %rdi
   18000cfc3:   c3                      retq   
[...]

What could this function be?

@zhuyifei1999
Copy link
Owner

Educated guess: hv.c#L384, hv_new_xt_for_type_at_xtp

It's sure that the "something" it is creating a weak reference to is a type object... let's check its name

gef➤  x/10xg 0x1e16e580
0x1e16e580:	0x0000000000000001	0x0000000000000000
0x1e16e590:	0x0000000000000000	0x000000001e13cf68
0x1e16e5a0:	0x00000000000001a0	0x0000000000000000
0x1e16e5b0:	0x000000001e0f6800	0x0000000000000000
0x1e16e5c0:	0x0000000000000000	0x0000000000000000
gef➤  x/10xg 0x1e16e580 + 24
0x1e16e598:	0x000000001e13cf68	0x00000000000001a0
0x1e16e5a8:	0x0000000000000000	0x000000001e0f6800
0x1e16e5b8:	0x0000000000000000	0x0000000000000000
0x1e16e5c8:	0x0000000000000000	0x0000000000000000
0x1e16e5d8:	0x0000000000000000	0x0000000000000000
gef➤  p (char *)0x000000001e13cf68
$4 = 0x1e13cf68 "PyOleNothing"

Googling around I see pympler/pympler#80

Looking at the code of pywin32 I see mhammond/pywin32@daeb5f2

This was released in latest pywin32 https://pypi.org/project/pywin32/#history, pywin32==300. I have no idea why pywin32 would be imported by an older version of keras but we can check that is it imported and mapped into memory:

0x1e0f0000-0x1e18a000, ChkSum: 0x00000000, GUID: D6AFCF3D-D19E-4AC0-920B-1413D161983D,  "C:\Program Files\Python38\Lib\site-packages\pywin32_system32\pythoncom38.dll"
0x5b950000-0x5b978000, ChkSum: 0x00000000, GUID: 58BDE457-928B-43D9-B645-19979EA39325,  "C:\Program Files\Python38\Lib\site-packages\pywin32_system32\pywintypes38.dll"

Also successfully reproduced this:

Screenshot_2021-01-20_07-05-59
Screenshot_2021-01-20_07-06-58

Considering that it is not valid for an object to have a NULL as its type and be passed around to the python interpreter, I don't think this is something we should work around.

@cool-RR could you please confirm that the crash is resolved with an upgrade to pywin32==300?

@zhuyifei1999 zhuyifei1999 changed the title Crash under Windows 7 pywin32<300 causes NULL pointer deference during referrer graph generation Jan 20, 2021
@cool-RR
Copy link
Author

cool-RR commented Jan 20, 2021

"I'm not a Windows person." You are now 😆 I've been using Windows for users and doing some development for it, and I never got as deep as you now did.

That was amazing. Yes, the problem was fixed by upgrading pywin32, both in my test example and in my actual application. Thank you very much.

One question that can be asked now is whether to treat this as something that could be improved in guppy3. You could maybe show a warning when someone tries to use guppy3 with an old pywin32 installation, so people who get this crash wouldn't be as confused as we were. But I don't know how prevalent this problem is, and whether that's worth the code. Your decision.

@zhuyifei1999
Copy link
Owner

"I'm not a Windows person." You are now laughing I've been using Windows for users and doing some development for it, and I never got as deep as you now did.

All I did was figuring out how to convert a minidump into a core dump, the rest is my usual GDB process, just complexed by a lack of symbols 😉

You could maybe show a warning when someone tries to use guppy3 with an old pywin32 installation, so people who get this crash wouldn't be as confused as we were. But I don't know how prevalent this problem is, and whether that's worth the code.

Good idea. Hmm

@zhuyifei1999
Copy link
Owner

Wdyt of something like:

    if 'pythoncom' in sys.modules:
        try:
            import pkg_resources

            pywin32_ver = (pkg_resources.get_distribution('pywin32')
                           .parsed_version)
        except Exception:
            pass
        else:
            if pywin32_ver.major < 300:
                import warnings

                warnings.warn(
                    'pythoncom in pywin32 < 300 may cause crashes. '
                    'See https://github.com/zhuyifei1999/guppy3/issues/25')

Should it be more visible?

@cool-RR
Copy link
Author

cool-RR commented Jan 21, 2021

This is probably what I'd do. I might have the warning say "You probably want to upgrade to the newest version of pywin32 by running pip install pywin32 --upgrade".

@mhammond @kxrob Does this look like a good way to test the pywin32 version?

@mhammond
Copy link

IIUC, that will only work when installed via pip, but some users install via a bdist_wininst executable. If you care about that case, then you can probably look for site-packages\pywin32.version.txt as a fallback (it's just the build number with a trailing newline) - eg, https://github.com/mhammond/pywin32/blob/f3f55abf528902f3b98c37b0e661d8b52dff7f94/Pythonwin/pywin/framework/app.py#L338-L344

@zhuyifei1999
Copy link
Owner

    if 'pythoncom' in sys.modules:
        def get_pywin32_ver():
            try:
                import pkg_resources

                return pkg_resources.get_distribution('pywin32').version
            except Exception:
                pass

            try:
                import distutils.sysconfig

                site_pkg = distutils.sysconfig.get_python_lib(plat_specific=1)
                with open(os.path.join(site_pkg, 'pywin32.version.txt')) as f:
                    return f.read().strip()
            except Exception:
                pass

            return None

        pywin32_ver = get_pywin32_ver()

        if pywin32_ver:
            try:
                pywin32_ver = int(pywin32_ver)
            except ValueError:
                pass
            else:
                if pywin32_ver < 300:
                    warnings.warn(
                        'pythoncom in pywin32 < 300 may cause crashes. See '
                        'https://github.com/zhuyifei1999/guppy3/issues/25. '
                        'You may want to upgrade to the newest version of '
                        'pywin32 by running "pip install pywin32 --upgrade"')

Wdyt?

@cool-RR
Copy link
Author

cool-RR commented Jan 23, 2021

Looks good.

@zhuyifei1999
Copy link
Owner

For my future reference, lldb can natively work with minidumps:

$ lldb -c python.exe.13268.dmp
(lldb) target create --core "python.exe.13268.dmp"
Core file '/home/zhuyifei1999/guppy3/python.exe.13268.dmp' (x86_64) was loaded.

(lldb) x/10xg "0x1e16e580 + 24"    
0x1e16e598: 0x000000001e13cf68 0x00000000000001a0
0x1e16e5a8: 0x0000000000000000 0x000000001e0f6800
0x1e16e5b8: 0x0000000000000000 0x0000000000000000
0x1e16e5c8: 0x0000000000000000 0x0000000000000000
0x1e16e5d8: 0x0000000000000000 0x0000000000000000
(lldb) p (char *)0x000000001e13cf68
(char *) $1 = 0x000000001e13cf68 "PyOleNothing"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants