View Issue Details

IDProjectCategoryView StatusLast Update
0002572GNUnetDHT servicepublic2012-11-05 18:33
ReporterLRN Assigned ToLRN  
PriorityhighSeveritycrashReproducibilityalways
Status closedResolutionfixed 
Product VersionGit master 
Target Version0.9.4Fixed in Version0.9.4 
Summary0002572: Crash in dht service during test_stream_local
Descriptionsubj
Additional Information
Reading symbols from d:\progs\gnunet\bin\gnunet-service-dht.exe...done.
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 12860.0xbe4]
0x74fd280d in KERNELBASE!DeleteAce () from C:\Windows\syswow64\KernelBase.dll
(gdb) up
#1  0x62ac6abc in GNUNET_abort () at common_logging.c:291
291       DebugBreak ();
(gdb)
#2  0x62afa2ac in GNUNET_SCHEDULER_cancel (task=13451671603782742029) at scheduler.c:918
918           GNUNET_assert (0);
(gdb)
#3  0x62acbd7f in connect_fail_continuation (connection=0x1d59c78) at connection.c:564
564         GNUNET_SCHEDULER_cancel (connection->nth.timeout_task);
(gdb) p/x *connection->nth
Structure has no component named operator*.
(gdb) p/x connection->nth
$1 = {notify_ready = 0xbaadf00d, notify_ready_cls = 0xbaadf00d, connection = 0xbaadf00d, transmit_timeout = {abs_value = 0xbaadf00dbaadf00d}, timeout_task = 0xbaadf00dbaadf00d, notify_size = 0xbaadf00d}
(gdb) p/x *connection
$2 = {cfg = 0x1d64930, ap_head = 0x1d584a8, ap_tail = 0xbaadf00d, addr = 0xbaadf00d, hostname = 0xbaadf00d, sock = 0xbaadf00d, receiver = 0xbaadf00d, receiver_cls = 0xbaadf00d, write_buffer = 0xbaadf00d, write_buffer_size = 0xbaadf00d,
  write_buffer_off = 0xbaadf00d, write_buffer_pos = 0xbaadf00d, addrlen = 0xbaadf00d, read_task = 0xbaadf00dbaadf00d, write_task = 0xbaadf00dbaadf00d, dns_active = 0xbaadf00d, nth = {notify_ready = 0xbaadf00d, notify_ready_cls = 0xbaadf00d,
    connection = 0xbaadf00d, transmit_timeout = {abs_value = 0xbaadf00dbaadf00d}, timeout_task = 0xbaadf00dbaadf00d, notify_size = 0xbaadf00d}, receive_timeout = {abs_value = 0xbaadf00dbaadf00d}, max = 0xbaadf00d, port = 0xf00d, persist = 0xbaad}
(gdb) bt
#0  0x74fd280d in KERNELBASE!DeleteAce () from C:\Windows\syswow64\KernelBase.dll
#1  0x62ac6abc in GNUNET_abort () at common_logging.c:291
#2  0x62afa2ac in GNUNET_SCHEDULER_cancel (task=13451671603782742029) at scheduler.c:918
#3  0x62acbd7f in connect_fail_continuation (connection=0x1d59c78) at connection.c:564
#4  0x62acc653 in connect_probe_continuation (cls=0x1d10440, tc=0x28fc70) at connection.c:662
#5  0x62af95bc in run_ready (rs=0x739fe8, ws=0x73b000) at scheduler.c:602
#6  0x62af9d5e in GNUNET_SCHEDULER_run (task=0x62b060b7 <service_task>, task_cls=0x28fe08) at scheduler.c:790
#7  0x62b06e16 in GNUNET_SERVICE_run (argc=3, argv=0x1d0bce0, service_name=0x41009c <__register_frame_info+4259996> "dht", options=GNUNET_SERVICE_OPTION_NONE, task=0x4015cc <run>, task_cls=0x0) at service.c:1804
#8  0x004017a6 in main (argc=3, argv=0x1d0bce0) at gnunet-service-dht.c:184
TagsNo tags attached.
Attached Files
dht-12860.log (25,980 bytes)

Relationships

related to 0002495 closedChristian Grothoff Memory corruption in core / client. 

Activities

LRN

2012-09-30 20:16

developer   ~0006393

Last edited: 2012-09-30 20:16

Log of dht service prior to the crash is attached - https://gnunet.org/bugs/file_download.php?file_id=553&type=bug .
My understanding is that testcase is shutting down, other services are down, and dht can't connect to them anymore, but still tries that (for some reason), even though it did get a GNUNET_SCHEDULER_REASON_SHUTDOWN signal in its parent control handler.

Christian Grothoff

2012-10-06 16:04

manager   ~0006404

I've triple-checked the task management code in connection.c itself --- I cannot find anything wrong there :-(. The super-high task ID indicates that we're using bad memory (use-after-free, etc.), but connection.c itself seems fine. So it could be memory corruption anywhere in the DHT service :-(.

Christian Grothoff

2012-10-06 16:06

manager   ~0006405

Last edited: 2012-10-06 16:06

$ valgrind --trace-children=yes .libs/test_stream_local

==20365== Command: /home/grothoff/bin/gnunet-service-dht -c /tmp/test_stream_localucZR0a/0/config
==20365==
==20365== Jump to the invalid address stated on the next line
==20365== at 0x5D6: ???
==20365== Address 0x5d6 is not stack'd, malloc'd or (recently) free'd
==20365==
==20365==
==20365== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==20365== Bad permissions for mapped region at address 0x5D6
==20365== at 0x5D6: ???
==20365==
==20365== Command: /home/grothoff/bin/gnunet-service-dht -c /tmp/test_stream_localucZR0a/0/config
==20365==
==20365== Jump to the invalid address stated on the next line
==20365== at 0x5D6: ???
==20365== Address 0x5d6 is not stack'd, malloc'd or (recently) free'd
==20365==
==20365==
==20365== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==20365== Bad permissions for mapped region at address 0x5D6
==20365== at 0x5D6: ???
==20365==

--- just as funky, could be very much related.

Christian Grothoff

2012-10-06 16:08

manager   ~0006406

Looks a bit different every time:

==20416== Command: /home/grothoff/bin/gnunet-service-dht -c /tmp/test_stream_localhr2lFR/0/config
==20416==
==20416== Jump to the invalid address stated on the next line
==20416== at 0x5D6: ???
==20416== by 0x40056C7: open_verify (dl-load.c:1911)
==20416== by 0x4005D57: open_path (dl-load.c:2172)
==20416== by 0x4007D41: _dl_map_object (dl-load.c:2378)
==20416== by 0x400D211: openaux (dl-deps.c:65)
==20416== by 0x400DBD5: _dl_catch_error (dl-error.c:178)
==20416== by 0x400C2DB: _dl_map_object_deps (dl-deps.c:247)
==20416== by 0x40032B6: dl_main (rtld.c:1815)
==20416== by 0x4014DDD: _dl_sysdep_start (dl-sysdep.c:244)
==20416== by 0x4001422: _dl_start (rtld.c:342)
==20416== by 0x4000AF7: ??? (in /lib/x86_64-linux-gnu/ld-2.13.so)
==20416== by 0x2: ???
==20416== Address 0x5d6 is not stack'd, malloc'd or (recently) free'd
==20416==
==20416==
==20416== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==20416== Bad permissions for mapped region at address 0x5D6
==20416== at 0x5D6: ???
==20416== by 0x40056C7: open_verify (dl-load.c:1911)
==20416== by 0x4005D57: open_path (dl-load.c:2172)
==20416== by 0x4007D41: _dl_map_object (dl-load.c:2378)
==20416== by 0x400D211: openaux (dl-deps.c:65)
==20416== by 0x400DBD5: _dl_catch_error (dl-error.c:178)
==20416== by 0x400C2DB: _dl_map_object_deps (dl-deps.c:247)
==20416== by 0x40032B6: dl_main (rtld.c:1815)
==20416== by 0x4014DDD: _dl_sysdep_start (dl-sysdep.c:244)
==20416== by 0x4001422: _dl_start (rtld.c:342)
==20416== by 0x4000AF7: ??? (in /lib/x86_64-linux-gnu/ld-2.13.so)
==20416== by 0x2: ???

Christian Grothoff

2012-10-06 16:11

manager   ~0006407

Very timing dependent -- if test_stream_local is run without valgrind on *everything*, the bug doesn't appear (because the dht-service isn't even started!).

LRN

2012-10-07 02:54

developer   ~0006415

yes, it depends on timing. If i run it under full debug logging, dht service just hangs without crashing.

Christian Grothoff

2012-10-08 15:31

manager   ~0006423

Diagnosis from LRN:


(02:39:50 PM) LRN: connect_fail_continuation() calls signal_receive_error(), which calls the receiver callback, which destroys the connection completely. Then the control goes back to connect_fail_continuation(), which then proceeds to cancel connection's nth.timeout_task and call, and even calls signal_transmit_error() again (but it never reaches that, crashes on cancelling)
(02:40:32 PM) LRN: so the problem is either that the receiver callback kills the connection OR that connection calls the receiver first, and does cancellation second.
(02:41:36 PM) LRN: ah, wait, i'm confusing signal_receive_error() and signal_transmit_error()
(02:42:20 PM) LRN: ok, it doesn't matter. But then the problem is definitely in receiver killing the connection
(03:12:33 PM) LRN: receive_handler is process_ats_message, if gdb should be beleived


Possibly fixed in SVN 24225.

LRN

2012-10-08 16:12

developer   ~0006424

Fixed in r24222, r24223, r24225.

Issue History

Date Modified Username Field Change
2012-09-30 18:46 LRN New Issue
2012-09-30 20:16 LRN File Added: dht-12860.log
2012-09-30 20:16 LRN Note Added: 0006393
2012-09-30 20:16 LRN Note Edited: 0006393
2012-10-06 16:04 Christian Grothoff Note Added: 0006404
2012-10-06 16:06 Christian Grothoff Note Added: 0006405
2012-10-06 16:06 Christian Grothoff Note Edited: 0006405
2012-10-06 16:08 Christian Grothoff Note Added: 0006406
2012-10-06 16:11 Christian Grothoff Note Added: 0006407
2012-10-07 02:54 LRN Note Added: 0006415
2012-10-07 12:51 Christian Grothoff Priority normal => high
2012-10-07 12:51 Christian Grothoff Status new => confirmed
2012-10-07 12:51 Christian Grothoff Product Version => Git master
2012-10-08 15:31 Christian Grothoff Note Added: 0006423
2012-10-08 16:12 LRN Note Added: 0006424
2012-10-08 16:18 Christian Grothoff Status confirmed => resolved
2012-10-08 16:18 Christian Grothoff Fixed in Version => 0.9.4
2012-10-08 16:18 Christian Grothoff Resolution open => fixed
2012-10-08 16:18 Christian Grothoff Assigned To => LRN
2012-10-08 23:39 Christian Grothoff Relationship added related to 0002495
2012-11-05 17:50 Christian Grothoff Target Version => 0.9.4
2012-11-05 18:33 Christian Grothoff Status resolved => closed