Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seg fault running Assise as local FS #5

Open
hayley-leblanc opened this issue May 12, 2021 · 4 comments
Open

Seg fault running Assise as local FS #5

hayley-leblanc opened this issue May 12, 2021 · 4 comments

Comments

@hayley-leblanc
Copy link

hayley-leblanc commented May 12, 2021

Hi folks,

I am trying to set up Assise to run as a local file system but I'm having trouble getting it to run. I've been able to successfully build Assise, configure storage, run mkfs, and start up the KernFS/SharedFS process. I followed the instructions here to configure Assise to run as a single local file system. When I try to run a program from libfs/tests (I've been using mkdir_user but have tried a few others), the KernFS appears to segfault. I spent some time trying to figure out where it might be occurring without much luck, although it appears to occur before mkdir_user's main function actually runs.

I did make some small changes to Assise, although I don't think they are the cause of the issue. I want to run Assise on a very small emulated PM device (128 MB would be best, a couple GB at most) so I had to reduce the number of inodes and the size of each LibFS's log in order to prevent asserts from failing.

I'm running Assise on a QEMU/KVM virtual machine with 4 cores and Linux kernel 5.1 and at 8GB of RAM. I've tried running it on 128MB, 1GB, 2GB, and 3GB of emulated PM and get the segmentation fault on all of them.

I also tried disabling the DISTRIBUTED compilation flag, but ran into build issues; I can post more details about that if I need to remove this flag to get things to work.

Thanks in advance for your help!

@simpeter
Copy link

simpeter commented May 12, 2021 via email

@wreda
Copy link
Contributor

wreda commented May 13, 2021

I've added myself as a watcher, so I should be getting notifications.

@hayley-leblanc : There's no need to disable the DISTRIBUTED flag as it has been deprecated. The steps you followed in the README should be sufficient. Since you've modified the storage configuration, I'd first double-check that you rebuilt both LibFS/KernFS and reran mkfs.sh successfully.

If you already did that, I'll likely need more context to know what might be causing this. Can you rerun KernFS in gdb and share the stack trace? You will need to first recompile KernFS with the -g flag.

@hayley-leblanc
Copy link
Author

I double checked that I cleaned and rebuilt LibFS and KernFS, ran change_dev_size.py, re-ran mkfs.sh, etc. with the new configurations, but I'm still running into the issue. Here's the output from running KernFS in gdb:

Starting program: /usr/bin/numactl -N0 -m0 kernfs
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
process 3005 is executing new program: /home/novavm/vmshare/assise/kernfs/tests/kernfs
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
initialize file system
dev-dax engine is initialized: dev_path /dev/dax0.0 size 3072 MB
[New Thread 0x7fff371ff700 (LWP 3009)]
[New Thread 0x7fff369fe700 (LWP 3010)]
[New Thread 0x7fff361fd700 (LWP 3011)]
[New Thread 0x7fff359fc700 (LWP 3012)]
[New Thread 0x7fff351fb700 (LWP 3013)]
[New Thread 0x7fff349fa700 (LWP 3014)]
[New Thread 0x7fff341f9700 (LWP 3015)]
[New Thread 0x7fff339f8700 (LWP 3016)]
[New Thread 0x7fff331f7700 (LWP 3017)]
Reading root inode with inum: 1fetching node's IP address..
Process pid is 3005
ip address on interface 'lo' is 127.0.0.1
cluster settings:
--- node 0 - ip:127.0.0.1
[New Thread 0x7fff329f6700 (LWP 3020)]
MLFS cluster initialized
[Local-Server] Listening on port 12345 for connections. interrupt (^C) to exit.
Adding connection with sockfd: 0
[New Thread 0x7fff321f5700 (LWP 3031)]
Adding connection with sockfd: 1
RECV <-- MSG_INIT [pid 0]
[New Thread 0x7fff319f4700 (LWP 3032)]
[add_peer_socket():80] Peer connected (ip: 127.0.0.1, pid: 3025)
[add_peer_socket():98] Established connection with 127.0.0.1 on sock:0 of type:0 and peer:0x7fff30e0f000
RECV <-- MSG_INIT [pid 2]
Adding connection with sockfd: 2
SEND --> MSG_SHM [paths: /shm_recv_0|/shm_send_0]
start shmem_poll_loop for sockfd 0
[add_peer_socket():98] Established connection with 127.0.0.1 on sock:1 of type:2 and peer:0x7fff30e0f000
SEND --> MSG_SHM [paths: /shm_recv_1|/shm_send_1]
start shmem_poll_loop for sockfd 1
[New Thread 0x7fff30bff700 (LWP 3033)]
RECV <-- MSG_INIT [pid 1]
[add_peer_socket():98] Established connection with 127.0.0.1 on sock:2 of type:1 and peer:0x7fff30e0f000
SEND --> MSG_SHM [paths: /shm_recv_2|/shm_send_2]
start shmem_poll_loop for sockfd 2
00000000000000000000000000000001
[New Thread 0x7fff2ffff700 (LWP 3034)]
[New Thread 0x7fff2f7fe700 (LWP 3035)]
Adding connection with sockfd: 3
[New Thread 0x7fff2effd700 (LWP 3048)]
Adding connection with sockfd: 4
RECV <-- MSG_INIT [pid 0]
[add_peer_socket():98] Established connection with 127.0.0.1 on sock:3 of type:0 and peer:0x7fff30e0f000
[New Thread 0x7fff2e7fc700 (LWP 3049)]
SEND --> MSG_SHM [paths: /shm_recv_3|/shm_send_3]
Adding connection with sockfd: 5
RECV <-- MSG_INIT [pid 2]
start shmem_poll_loop for sockfd 3
[New Thread 0x7fff2dbff700 (LWP 3050)]
[add_peer_socket():98] Established connection with 127.0.0.1 on sock:4 of type:2 and peer:0x7fff30e0f000
SEND --> MSG_SHM [paths: /shm_recv_4|/shm_send_4]
start shmem_poll_loop for sockfd 4
RECV <-- MSG_INIT [pid 1]
[add_peer_socket():98] Established connection with 127.0.0.1 on sock:5 of type:1 and peer:0x7fff30e0f000
SEND --> MSG_SHM [paths: /shm_recv_5|/shm_send_5]
start shmem_poll_loop for sockfd 5
00000000000000000000000000000011
[New Thread 0x7fff2cdff700 (LWP 3051)]
[New Thread 0x7fff2c5fe700 (LWP 3052)]

Thread 17 "kernfs" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff2effd700 (LWP 3048)]
0x00007ffff7f4b9dd in init_replication (remote_log_id=remote_log_id@entry=2, peer=0x7ffff746d0c0, begin=begin@entry=644609, size=size@entry=906753, addr=addr@entry=0, end=0x7fff2dc0f020) at ./global/mem.h:36
36		return calloc(1, size);

And the stack trace:

#0  0x00007ffff7f4b9dd in init_replication (
    remote_log_id=remote_log_id@entry=2, peer=0x7ffff746d0c0, 
    begin=begin@entry=644609, size=size@entry=906753, addr=addr@entry=0, 
    end=0x7fff2dc0f020) at ./global/mem.h:36
#1  0x00007ffff7f4d24b in register_peer_log (peer=0x7fff30e0f000, 
    find_id=<optimized out>) at distributed/peer.c:271
#2  0x00007ffff7f57d31 in signal_callback (msg=0x7ffff789f008) at fs.c:2389
#3  0x00007ffff7b11e09 in shmem_poll_loop (sockfd=sockfd@entry=3)
    at shmem_ch.c:106
#4  0x00007ffff7b121a6 in local_server_thread (arg=<optimized out>)
    at shmem_ch.c:339
#5  0x00007ffff7d18609 in start_thread (arg=<optimized out>)
    at pthread_create.c:477
#6  0x00007ffff7e54293 in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@wreda
Copy link
Contributor

wreda commented May 24, 2021

It seems your segfault was due to an outdated mkdir_user script. It was calling init_fs() explicitly, which is not needed in the case of Assise (since this function is called automatically by LibFS). I've introduced a patch that addresses this.

Please pull and rebuild LibFS, KernFS, and the tests directory. Let me know if you're still having issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants