Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on pg_reload_conf() #10

Closed
anse1 opened this issue Sep 9, 2017 · 11 comments
Closed

Crash on pg_reload_conf() #10

anse1 opened this issue Sep 9, 2017 · 11 comments
Assignees
Labels

Comments

@anse1
Copy link

anse1 commented Sep 9, 2017

Hi,

running sqlsmith on a database with powa installed produced
crashes on pg_reload_conf(). In the backtrace below the handler
fired in a CriticalSection, where AllocSetContextCreate is
forbidden. There are probably a lot more failure modes. For
example, any elog() would be promoted to a PANIC in a Critical
Section.

IMHO powa must do this like the romans do: Don't do anything in
the signal handler besides setting a flag that a reload is
requested and do the actual work outside the handler where it is
safe to do so.

regards,
Andreas

Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
58      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#1  0x00007f455d6b440a in __GI_abort () at abort.c:89
#2  0x000055b745359ec3 in ExceptionalCondition (conditionName=conditionName@entry=0x55b7453dae63 "!(CritSectionCount == 0)", errorType=errorType@entry=0x55b74539d29d "FailedAssertion", fileName=fileName@e$
#3  0x000055b74537f1a2 in MemoryContextCreate (tag=tag@entry=T_AllocSetContext, size=size@entry=200, methods=methods@entry=0x55b7458023e0 <AllocSetMethods>, parent=0x55b745a55dc0, name=name@entry=0x55b745$
#4  0x000055b74537e5a0 in AllocSetContextCreate (parent=<optimized out>, name=name@entry=0x55b74551c03c "config file processing", minContextSize=minContextSize@entry=0, initBlockSize=initBlockSize@entry=8$
#5  0x000055b745379905 in ProcessConfigFile (context=PGC_SIGHUP) at guc-file.l:147
#6  0x00007f455d057dae in powa_sighup (postgres_signal_arg=<optimized out>) at powa.c:342
#7  <signal handler called>
#8  0x00007f455d75bbc0 in __close_nocancel () at ../sysdeps/unix/syscall-template.S:84
#9  0x000055b74504709c in XLogFileClose () at xlog.c:3557
#10 0x000055b74504740d in XLogWrite (WriteRqst=..., flexible=flexible@entry=0 '\000') at xlog.c:2253
#11 0x000055b74504809e in XLogFlush (record=15030149624) at xlog.c:2728
#12 0x000055b74503e6ca in RecordTransactionCommit () at xact.c:1286
#13 CommitTransaction () at xact.c:2039
#14 0x000055b74503f465 in CommitTransactionCommand () at xact.c:2874
#15 0x00007f455d057ff2 in powa_main (main_arg=<optimized out>) at powa.c:293
#16 0x000055b7451e560b in StartBackgroundWorker () at bgworker.c:742
#17 0x000055b7451f1135 in do_start_bgworker (rw=0x55b745a48ba0) at postmaster.c:5612
#18 maybe_start_bgworkers () at postmaster.c:5809
#19 0x000055b7451f14a3 in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2834
#20 <signal handler called>
#21 0x00007f455d761273 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84
#22 0x000055b744fb29be in ServerLoop () at postmaster.c:1683
#23 0x000055b7451f3153 in PostmasterMain (argc=3, argv=0x55b745a245b0) at postmaster.c:1327
#24 0x000055b744fb469d in main (argc=3, argv=0x55b745a245b0) at main.c:228
@anse1
Copy link
Author

anse1 commented Sep 9, 2017 via email

@rjuju rjuju self-assigned this Sep 10, 2017
@rjuju rjuju added the bug label Sep 10, 2017
@rjuju
Copy link
Member

rjuju commented Sep 10, 2017

Thanks again for the report! I definitely agree about the signal handler, I'll refactor the code to use a flag. BTW, I think only elog(ERROR) and above are promoted to PANIC in a critsection (not a good reason to use elog inside such sections though). And thanks also for sqlsmith, great tool as usual :)

@rjuju
Copy link
Member

rjuju commented Sep 10, 2017

@anse1 do you still have the query(ies) that raised this issue, or should I try to hit it running sqlsmith?

@anse1
Copy link
Author

anse1 commented Sep 10, 2017 via email

@anse1
Copy link
Author

anse1 commented Sep 11, 2017

Digging some more through the collected core dumps, I also found
the one below with a backtrace that doesn't sport the signal handler.
Looking at the code, the failed assertion did hold a couple of lines earlier
in _bt_page_recyclable(page). Since I don't see code in between
that could change the outcome of the assertion, I think that the
signal handler was the culprit here as well.

Core was generated by `postgres: bgworker: powa   snapshot                              '.
Program terminated with signal SIGABRT, Aborted.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#1  0x00007f28c03bf40a in __GI_abort () at abort.c:89
#2  0x000055b6718afec3 in ExceptionalCondition (conditionName=conditionName@entry=0x55b6718f3950 "!(((PageHeader) (page))->pd_special >= (__builtin_offsetof (PageHeaderData, pd_linp)))", errorType=errorType@entry=0x55b6718f329d "FailedAssertion", fileName=fileName@entry=0x55b6718f38c8 "../../../../src/include/storage/bufpage.h", lineNumber=lineNumber@entry=314) at assert.c:54
#3  0x000055b67156316e in PageValidateSpecialPointer (page=0x7f28bdf35300 "") at ../../../../src/include/storage/bufpage.h:314
#4  _bt_getbuf (rel=rel@entry=0x7f28bc88f1e0, access=access@entry=2, blkno=18) at nbtpage.c:629
#5  0x000055b67156321c in _bt_getbuf (rel=rel@entry=0x7f28bc88f1e0, blkno=blkno@entry=4294967295, access=access@entry=2) at nbtpage.c:573
#6  0x000055b6715605f5 in _bt_split (newitemonleft=1 '\001', newitem=0x55b674471670, newitemsz=16, newitemoff=2, firstright=192, cbuf=0, buf=1628, rel=0x7f28bc88f1e0) at nbtinsert.c:986
#7  _bt_insertonpg (rel=rel@entry=0x7f28bc88f1e0, buf=buf@entry=1628, cbuf=cbuf@entry=0, stack=stack@entry=0x55b674471600, itup=itup@entry=0x55b674471670, newitemoff=2, split_only_page=0 '\000') at nbtinsert.c:781
#8  0x000055b6715628d4 in _bt_doinsert (rel=rel@entry=0x7f28bc88f1e0, itup=itup@entry=0x55b674471670, checkUnique=checkUnique@entry=UNIQUE_CHECK_NO, heapRel=heapRel@entry=0x7f28bc887be0) at nbtinsert.c:204
#9  0x000055b671566ad1 in btinsert (rel=0x7f28bc88f1e0, values=<optimized out>, isnull=<optimized out>, ht_ctid=0x55b673635134, heapRel=0x7f28bc887be0, checkUnique=UNIQUE_CHECK_NO) at nbtree.c:284
#10 0x000055b671689e92 in ExecInsertIndexTuples (slot=slot@entry=0x55b6736fddf0, tupleid=tupleid@entry=0x55b673635134, estate=estate@entry=0x55b6736ea9d0, noDupErr=noDupErr@entry=0 '\000', specConflict=specConflict@entry=0x0, arbiterIndexes=arbiterIndexes@entry=0x0) at execIndexing.c:388
#11 0x000055b6716aa9c9 in ExecInsert (canSetTag=0 '\000', estate=0x55b6736ea9d0, onconflict=<optimized out>, arbiterIndexes=0x0, planSlot=0x55b6736fddf0, slot=0x55b6736fddf0, mtstate=0x55b6736f80c8) at nodeModifyTable.c:496
#12 ExecModifyTable (node=node@entry=0x55b6736f80c8) at nodeModifyTable.c:1511
#13 0x000055b67168edf8 in ExecProcNode (node=node@entry=0x55b6736f80c8) at execProcnode.c:396
#14 0x000055b67168ae93 in ExecPostprocessPlan (estate=0x55b6736ea9d0) at execMain.c:1421
#15 standard_ExecutorFinish (queryDesc=0x55b6735b5198) at execMain.c:407
#16 0x00007f28bff67c75 in pgss_ExecutorFinish (queryDesc=0x55b6735b5198) at pg_stat_statements.c:900
#17 0x00007f28bfb5be43 in pgqs_ExecutorFinish (queryDesc=0x55b6735b5198) at pg_qualstats.c:592
#18 0x000055b6716ba72f in _SPI_pquery (tcount=<optimized out>, fire_triggers=<optimized out>, queryDesc=<optimized out>) at spi.c:2424
#19 _SPI_execute_plan (plan=plan@entry=0x55b673354c90, paramLI=paramLI@entry=0x55b673983cf8, snapshot=snapshot@entry=0x0, crosscheck_snapshot=crosscheck_snapshot@entry=0x0, read_only=read_only@entry=0 '\000', fire_triggers=fire_triggers@entry=1 '\001', tcount=<optimized out>) at spi.c:2188
#20 0x000055b6716bac2a in SPI_execute_plan_with_paramlist (plan=0x55b673354c90, params=0x55b673983cf8, read_only=<optimized out>, tcount=1) at spi.c:446
#21 0x00007f28bc903ad2 in exec_stmt_execsql (estate=estate@entry=0x7ffe72a8b4b0, stmt=stmt@entry=0x55b67331d2b8) at pl_exec.c:3515
#22 0x00007f28bc904bc6 in exec_stmt (stmt=0x55b67331d2b8, estate=0x7ffe72a8b4b0) at pl_exec.c:1503
#23 exec_stmts (estate=0x7ffe72a8b4b0, stmts=<optimized out>) at pl_exec.c:1398
#24 0x00007f28bc907627 in exec_stmt_block (estate=estate@entry=0x7ffe72a8b4b0, block=0x55b67331d708) at pl_exec.c:1336
#25 0x00007f28bc907845 in plpgsql_exec_function (func=func@entry=0x55b673246ff0, fcinfo=fcinfo@entry=0x55b67397e138, simple_eval_estate=simple_eval_estate@entry=0x0) at pl_exec.c:434
#26 0x00007f28bc8fbf95 in plpgsql_call_handler (fcinfo=0x55b67397e138) at pl_handler.c:255
#27 0x000055b67168fd64 in ExecMakeFunctionResultNoSets (fcache=0x55b67397e0c8, econtext=0x55b67397dea0, isNull=0x55b67397ea78 "\177~\177\177\177\177\177\177p\244Ps\266U", isDone=<optimized out>) at execQual.c:2041
#28 0x000055b67169624e in ExecTargetList (tupdesc=<optimized out>, isDone=0x7ffe72a8b80c, itemIsDone=0x55b67397ebb8, isnull=0x55b67397ea78 "\177~\177\177\177\177\177\177p\244Ps\266U", values=0x55b67397ea58, econtext=0x55b67397dea0, targetlist=0x55b67397eb80) at execQual.c:5423
#29 ExecProject (projInfo=<optimized out>, isDone=isDone@entry=0x7ffe72a8b80c) at execQual.c:5647
#30 0x000055b6716ad169 in ExecResult (node=node@entry=0x55b67397dd88) at nodeResult.c:155
#31 0x000055b67168ee08 in ExecProcNode (node=node@entry=0x55b67397dd88) at execProcnode.c:392
#32 0x000055b67168ac8e in ExecutePlan (dest=0x55b671d4da60 <spi_printtupDR>, direction=<optimized out>, numberTuples=0, sendTuples=<opgtimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x55b67397dd88, estate=0x55b67397dc70) at execMain.c:1569
#33 standard_ExecutorRun (queryDesc=0x55b67397c080, direction=<optimized out>, count=0) at execMain.c:338
#34 0x00007f28bff67d5d in pgss_ExecutorRun (queryDesc=0x55b67397c080, direction=ForwardScanDirection, count=0) at pg_stat_statements.c:877
#35 0x00007f28bfb5bf25 in pgqs_ExecutorRun (queryDesc=0x55b67397c080, direction=ForwardScanDirection, count=0) at pg_qualstats.c:564
#36 0x000055b6716ba87f in _SPI_pquery (tcount=0, fire_triggers=1 '\001', queryDesc=0x55b67397c080) at spi.c:2412
#37 _SPI_execute_plan (plan=plan@entry=0x7ffe72a8bc10, paramLI=paramLI@entry=0x0, snapshot=snapshot@entry=0x0, crosscheck_snapshot=crosscheck_snapshot@entry=0x0, read_only=read_only@entry=0 '\000', fire_triggers=fire_triggers@entry=1 '\001', tcount=<optimized out>) at spi.c:2188
#38 0x000055b6716baac3 in SPI_execute (src=src@entry=0x55b673968978 "SELECT powa_statements_snapshot()", read_only=<optimized out>, tcount=tcount@entry=0) at spi.c:383
#39 0x00007f28bc906186 in exec_stmt_dynexecute (stmt=0x55b673269f80, estate=0x7ffe72a8c310) at pl_exec.c:3707
#40 exec_stmt (stmt=0x55b673269f80, estate=0x7ffe72a8c310) at pl_exec.c:1507
#41 exec_stmts (estate=0x7ffe72a8c310, stmts=<optimized out>) at pl_exec.c:1398
#42 0x00007f28bc9074d3 in exec_stmt_block (estate=estate@entry=0x7ffe72a8c310, block=block@entry=0x55b67326b0d8) at pl_exec.c:1197
#43 0x00007f28bc904dff in exec_stmt (stmt=0x55b67326b0d8, estate=0x7ffe72a8c310) at pl_exec.c:1431
#44 exec_stmts (estate=0x7ffe72a8c310, stmts=<optimized out>) at pl_exec.c:1398
#45 0x00007f28bc9086f1 in exec_for_query (estate=estate@entry=0x7ffe72a8c310, stmt=stmt@entry=0x55b673269798, portal=0x55b6732c6130, prefetch_ok=prefetch_ok@entry=1 '\001') at pl_exec.c:5209
#46 0x00007f28bc90503e in exec_stmt_fors (stmt=0x55b673269798, estate=0x7ffe72a8c310) at pl_exec.c:2116
#47 exec_stmt (stmt=0x55b673269798, estate=0x7ffe72a8c310) at pl_exec.c:1467
#48 exec_stmts (estate=0x7ffe72a8c310, stmts=<optimized out>) at pl_exec.c:1398
#49 0x00007f28bc907627 in exec_stmt_block (estate=estate@entry=0x7ffe72a8c310, block=0x55b673270810) at pl_exec.c:1336
#50 0x00007f28bc907845 in plpgsql_exec_function (func=func@entry=0x55b67323f920, fcinfo=fcinfo@entry=0x55b673727058, simple_eval_estate=simple_eval_estate@entry=0x0) at pl_exec.c:434
#51 0x00007f28bc8fbf95 in plpgsql_call_handler (fcinfo=0x55b673727058) at pl_handler.c:255
#52 0x000055b67168fd64 in ExecMakeFunctionResultNoSets (fcache=0x55b673726fe8, econtext=0x55b673726dc0, isNull=0x55b673727998 "\b~", isDone=<optimized out>) at execQual.c:2041
#53 0x000055b67169624e in ExecTargetList (tupdesc=<optimized out>, isDone=0x7ffe72a8c66c, itemIsDone=0x55b673727ad8, isnull=0x55b673727998 "\b~", values=0x55b673727978, econtext=0x55b673726dc0, targetlist=0x55b673727aa0) at execQual.c:5423
#54 ExecProject (projInfo=<optimized out>, isDone=isDone@entry=0x7ffe72a8c66c) at execQual.c:5647
#55 0x000055b6716ad169 in ExecResult (node=node@entry=0x55b673726ca8) at nodeResult.c:155
#56 0x000055b67168ee08 in ExecProcNode (node=node@entry=0x55b673726ca8) at execProcnode.c:392
#57 0x000055b67168ac8e in ExecutePlan (dest=0x55b671d4da60 <spi_printtupDR>, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x55b673726ca8, estate=0x55b673726b90) at execMain.c:1569
#58 standard_ExecutorRun (queryDesc=0x55b673722fa0, direction=<optimized out>, count=0) at execMain.c:338
#59 0x00007f28bff67d5d in pgss_ExecutorRun (queryDesc=0x55b673722fa0, direction=ForwardScanDirection, count=0) at pg_stat_statements.c:877
#60 0x00007f28bfb5bf25 in pgqs_ExecutorRun (queryDesc=0x55b673722fa0, direction=ForwardScanDirection, count=0) at pg_qualstats.c:564
#61 0x000055b6716ba87f in _SPI_pquery (tcount=0, fire_triggers=1 '\001', queryDesc=0x55b673722fa0) at spi.c:2412
#62 _SPI_execute_plan (plan=plan@entry=0x7ffe72a8ca70, paramLI=paramLI@entry=0x0, snapshot=snapshot@entry=0x0, crosscheck_snapshot=crosscheck_snapshot@entry=0x0, read_only=read_only@entry=0 '\000', fire_triggers=fire_triggers@entry=1 '\001', tcount=<optimized out>) at spi.c:2188
#63 0x000055b6716baac3 in SPI_execute (src=src@entry=0x7f28bfd63398 "SELECT powa_take_snapshot()", read_only=read_only@entry=0 '\000', tcount=tcount@entry=0) at spi.c:383
#64 0x00007f28bfd62fc2 in powa_main (main_arg=<optimized out>) at powa.c:288
#65 0x000055b67173b60b in StartBackgroundWorker () at bgworker.c:742
#66 0x000055b671747135 in do_start_bgworker (rw=0x55b672ed1ba0) at postmaster.c:5612
#67 maybe_start_bgworkers () at postmaster.c:5809
#68 0x000055b6717474a3 in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2834
#69 <signal handler called>
#70 0x00007f28c046c273 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84
#71 0x000055b6715089be in ServerLoop () at postmaster.c:1683
#72 0x000055b671749153 in PostmasterMain (argc=3, argv=0x55b672ead5b0) at postmaster.c:1327
#73 0x000055b67150a69d in main (argc=3, argv=0x55b672ead5b0) at main.c:228
(gdb) p debug_query_string
$1 = 0x0
(gdb) frame 3
#3  0x000055b67156316e in PageValidateSpecialPointer (page=0x7f28bdf35300 "") at ../../../../src/include/storage/bufpage.h:314
(gdb) up
#4  _bt_getbuf (rel=rel@entry=0x7f28bc88f1e0, access=access@entry=2, blkno=18) at nbtpage.c:629
(gdb) list
624						 * WAL record that will allow us to conflict with queries
625						 * running on standby.
626						 */
627						if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
628						{
629							BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
630	
631							_bt_log_reuse_page(rel, blkno, opaque->btpo.xact);
632						}
633	
(gdb) list -
614				if (blkno == InvalidBlockNumber)
615					break;
616				buf = ReadBuffer(rel, blkno);
617				if (ConditionalLockBuffer(buf))
618				{
619					page = BufferGetPage(buf);
620					if (_bt_page_recyclable(page))
621					{
622						/*
623						 * If we are generating WAL for Hot Standby then create a

@anse1
Copy link
Author

anse1 commented Sep 11, 2017

Just to clarify: I didn't notice your fix on the branch, so my last report was still about a core dump with the old handler.

I'll do some testing with the new handler this evening.

@rjuju
Copy link
Member

rjuju commented Sep 11, 2017

Thanks! Actually I forgot to push it yesterday :/ FTR, @marco44 has also been doing some testing on it, and it looks ok. He couldn't reproduce the same problem as what's reported here though. I hope I'll also be able to do some more testing too soon.

@anse1
Copy link
Author

anse1 commented Sep 11, 2017 via email

@rjuju
Copy link
Member

rjuju commented Sep 11, 2017

Thanks a lot for the confirmation!

I'll do some more testing to make sure the frequency is updated as needed in every situation and push the fix.

@rjuju
Copy link
Member

rjuju commented Sep 18, 2017

Sorry for late replied. I double checked the patch and everything looks goof to me, so I just pushed 3e71401. Thanks again for reporting this issue!

@rjuju
Copy link
Member

rjuju commented Sep 19, 2017

I just released new version including this bugfix.

Thanks!

@rjuju rjuju closed this as completed Sep 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants