View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0002713 | GNUnet | testbed service | public | 2012-12-21 00:00 | 2013-12-24 20:54 |
Reporter | Bart Polot | Assigned To | Bart Polot | ||
Priority | high | Severity | crash | Reproducibility | always |
Status | closed | Resolution | won't fix | ||
Product Version | 0.9.4 | ||||
Target Version | 0.10.0 | ||||
Summary | 0002713: GNUNET_TESTBED_operation_done kills dead task | ||||
Description | Program terminated with signal 6, Aborted. #0 0x00007f151c2dafd5 in raise () from /usr/lib/libc.so.6 (gdb) bt #0 0x00007f151c2dafd5 in raise () from /usr/lib/libc.so.6 #1 0x00007f151c2dc458 in abort () from /usr/lib/libc.so.6 #2 0x00007f151d718528 in GNUNET_abort () at common_logging.c:293 #3 0x00007f151d74fd69 in GNUNET_SCHEDULER_cancel (task=17601693521461418629) at scheduler.c:914 #4 0x00007f151e3afa88 in GNUNET_TESTBED_operation_release_ ( operation=0x7f151e3b10b0 <opstart_peer_stop>) at testbed_api_operations.c:373 #5 0x00007f151e3ad036 in GNUNET_TESTBED_operation_done ( operation=0x7f151e3b10b0 <opstart_peer_stop>) at testbed_api.c:2194 #6 0x0000000000404548 in GNUNET_MESH_TEST_cleanup (ctx=0x1986c20) at mesh_test_lib.c:205 #7 0x0000000000401b4c in disconnect_mesh_peers (cls=0x2a4, tc=0x7fff7e95d7f0) at test_mesh_small.c:279 #8 0x00007f151d74efa9 in run_ready (rs=0x1987ea0, ws=0x1987f30) at scheduler.c:597 #9 0x00007f151d74f7f8 in GNUNET_SCHEDULER_run (task=0x7f151d748580 <program_main>, task_cls=0x7fff7e95da90) at scheduler.c:786 #10 0x00007f151d748fb7 in GNUNET_PROGRAM_run2 (argc=3, argv=0x7fff7e95dc30, binaryName=0x404e0b "test_mesh_small", binaryHelp=0x7f151e3bb949 "nohelp", options=0x7fff7e95dbf0, task=0x7f151e3b77a8 <run>, task_cls=0x1986cf0, run_without_scheduler=0) at program.c:273 #11 0x00007f151d7490d8 in GNUNET_PROGRAM_run (argc=3, argv=0x7fff7e95dc30, binaryName=0x404e0b "test_mesh_small", binaryHelp=0x7f151e3bb949 "nohelp", options=0x7fff7e95dbf0, task=0x7f151e3b77a8 <run>, task_cls=0x1986cf0) at program.c:308 #12 0x00007f151e3b7ab6 in GNUNET_TESTBED_test_run (testname=0x404e0b "test_mesh_small", cfg_filename=0x404df6 "test_mesh_small.conf", num_peers=5, event_mask=0, cc=0x0, cc_cls=0x0, test_master=0x4045a9 <mesh_test_run>, test_master_cls=0x1986c20) at testbed_api_test.c:152 #13 0x000000000040481f in GNUNET_MESH_TEST_run (testname=0x404e0b "test_mesh_small", cfgname=0x404df6 "test_mesh_small.conf", num_peers=5, tmain=0x403926 <tmain>, tmain_cls=0x0, new_tunnel=0x4029a9 <incoming_tunnel>, cleaner=0x402c5d <tunnel_cleaner>, handlers=0x605860 <handlers>, stypes=0x0) at mesh_test_lib.c:287 #14 0x000000000040424e in main (argc=1, argv=0x7fff7e95de28) at test_mesh_small.c:983 scheduler.c:914: Attempt to cancel dead task As far as I am aware GNUNET_TESTBED_operation_done is not called twice. The code in mesh_test_lib does as follows: { GNUNET_assert (NULL != ctx->ops[i]); GNUNET_TESTBED_operation_done (ctx->ops[i]); ctx->ops[i] = NULL; } And that's the only call to op_done in the whole library... | ||||
Steps To Reproduce | Run test_mesh_small_signal, on shutdown the testcase crashes. | ||||
Additional Information | Other testcases using the same library do not crash (test_mesh_2dtorus, test_mesh_regex). Even test with (almost) the same code as test_mesh_small_unicast do not crash. | ||||
Tags | No tags attached. | ||||
Attached Files | mesh_test_fail.patch (704 bytes)
Index: mesh_test_lib.c =================================================================== --- mesh_test_lib.c (revision 25640) +++ mesh_test_lib.c (working copy) @@ -200,6 +200,7 @@ { unsigned int i; + GNUNET_break (0); for (i = 0; i < ctx->num_peers; i++) { GNUNET_assert (NULL != ctx->ops[i]); Index: gnunet-service-mesh.c =================================================================== --- gnunet-service-mesh.c (revision 25640) +++ gnunet-service-mesh.c (working copy) @@ -5906,6 +5906,7 @@ uint32_t ttl; size_t size; + GNUNET_assert (0); GNUNET_log (GNUNET_ERROR_TYPE_DEBUG, "got a unicast packet from %s\n", GNUNET_i2s (peer)); /* Check size */ | ||||
|
Seems related to crashes in the services. If a service crashes and restarts, the handler becomes invalid and I imagine it can have something to do with the bug... Shouldn't the testbed notify the "profiler" of a service crash? Or it already does and I am missing something? |
|
Which service is crashing? mesh or testbed? In any case the client handle shouldn't become invalid if a service crashes. So, I'm not sure which one you are referring here. Testbed currently does not support notifying profiler whenever a service crashes. From the back trace it tools like there is some linking pollution. In stack frames 4 and 5 'opstart_peer_stop' is a static function in testbed_api_peer.c. I wonder how it came here at the first place. Also, I cannot reproduce this on my laptop and desktop. I'm not sure if this is relevant: Christian has bumped library version numbers; you should do a `sudo ldconfig` to get the use the correct ones after `make install` |
|
What is crashing is (was)... both. 1. I start the testcase, with a bug in mesh service. 2. Mesh service crashes so the testcase eventually fails. 3. After a timeout the testcase shuts down and calls operation_done on the connect operation to the serivces. 4. The testcase also crashes when calling operation_done on what I suspect is the operation used to connect to the now dead service. |
|
BTW, you need to make a service that you connected to in the test crash in order to reproduce this problem. If the service doesn't crash, the operation_done succeeds and the testcase doesn't crash either. If you want a "simple way" to do force a crash, insert a GNUNET_assert (0) in any mesh_handler in the mesh service, for instance in handle_mesh_data_unicast and run any test_mesh_small_* test. |
|
Ok, I've added an assertion failure in mesh service but didn't observe any crash. harsha@nebula:~/repos/gnunet/src/mesh$ libtool --mode=execute gdb ./test_mesh_small_signal GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/harsha/repos/gnunet/src/mesh/.libs/lt-test_mesh_small_signal...done. (gdb) r Starting program: /home/harsha/repos/gnunet/src/mesh/.libs/lt-test_mesh_small_signal Dec 23 10:10:43-665985 mesh-894 ERROR Assertion failed at gnunet-service-mesh.c:7403. Dec 23 10:10:43-682386 mesh-930 ERROR Assertion failed at gnunet-service-mesh.c:7403. Dec 23 10:10:43-700037 mesh-931 ERROR Assertion failed at gnunet-service-mesh.c:7403. Dec 23 10:10:43-722007 mesh-932 ERROR Assertion failed at gnunet-service-mesh.c:7403. Dec 23 10:10:43-742763 mesh-933 ERROR Assertion failed at gnunet-service-mesh.c:7403. Dec 23 10:10:43-778547 mesh-934 ERROR Assertion failed at gnunet-service-mesh.c:7403. Dec 23 10:10:43-829667 mesh-935 ERROR Assertion failed at gnunet-service-mesh.c:7403. Dec 23 10:10:43-912846 mesh-936 ERROR Assertion failed at gnunet-service-mesh.c:7403. Dec 23 10:10:44-072649 mesh-937 ERROR Assertion failed at gnunet-service-mesh.c:7403. Dec 23 10:10:44-351751 mesh-938 ERROR Assertion failed at gnunet-service-mesh.c:7403. Dec 23 10:10:44-885130 mesh-939 ERROR Assertion failed at gnunet-service-mesh.c:7403. Dec 23 10:10:45-924201 mesh-940 ERROR Assertion failed at gnunet-service-mesh.c:7403. Dec 23 10:10:47-993462 mesh-941 ERROR Assertion failed at gnunet-service-mesh.c:7403. Dec 23 10:10:52-115602 mesh-942 ERROR Assertion failed at gnunet-service-mesh.c:7403. Dec 23 10:11:00-339378 mesh-943 ERROR Assertion failed at gnunet-service-mesh.c:7403. Dec 23 10:11:16-758287 mesh-944 ERROR Assertion failed at gnunet-service-mesh.c:7403. Dec 23 10:11:49-547251 mesh-949 ERROR Assertion failed at gnunet-service-mesh.c:7403. Dec 23 10:12:43-740792 test_mesh_small-880 ERROR FAILED! (0/5) Program exited with code 01. (gdb) |
|
Ok, I'm able to reproduce this; it occurred only when I put a break in handle_mesh_data_unicast. |
|
It looks like there is some memory corruption in mesh test case: The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /home/harsha/repos/gnunet/src/mesh/.libs/lt-test_mesh_small_signal Breakpoint 1, mesh_test_run (cls=0x607d40, num_peers=5, peers=0x636b20) at mesh_test_lib.c:229 229 struct GNUNET_MESH_TEST_Context *ctx = cls; (gdb) bt #0 0x00007ffff6f8a1b5 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007ffff6f8cfc0 in *__GI_abort () at abort.c:92 #2 0x00007ffff77568d1 in GNUNET_abort () at common_logging.c:293 #3 0x00007ffff778e72a in GNUNET_SCHEDULER_cancel (task=17025232769157995141) at scheduler.c:914 #4 0x00007ffff79c2e4c in GNUNET_TESTBED_operation_release_ (operation=0x7ffff79c4477) at testbed_api_operations.c:373 #5 0x00007ffff79c03b4 in GNUNET_TESTBED_operation_done (operation=0x7ffff79c4477) at testbed_api.c:2194 #6 0x00000000004045b6 in GNUNET_MESH_TEST_cleanup (ctx=0x607d40) at mesh_test_lib.c:206 #7 0x0000000000401af9 in disconnect_mesh_peers (cls=0x2a4, tc=0x7fffffffdbb0) at test_mesh_small.c:279 #8 0x00007ffff778d931 in run_ready (rs=0x608fc0, ws=0x609050) at scheduler.c:597 #9 0x00007ffff778e1b3 in GNUNET_SCHEDULER_run (task=0x7ffff7786ca8 <program_main>, task_cls=0x7fffffffde50) at scheduler.c:786 #10 0x00007ffff7787711 in GNUNET_PROGRAM_run2 (argc=3, argv=0x7fffffffe010, binaryName=0x404eeb "test_mesh_small", binaryHelp=0x7ffff79cee89 "nohelp", options=0x7fffffffdfd0, task=0x7ffff79cabbc <run>, task_cls=0x607e20, run_without_scheduler=0) at program.c:273 #11 0x00007ffff7787846 in GNUNET_PROGRAM_run (argc=3, argv=0x7fffffffe010, binaryName=0x404eeb "test_mesh_small", binaryHelp=0x7ffff79cee89 "nohelp", options=0x7fffffffdfd0, task=0x7ffff79cabbc <run>, task_cls=0x607e20) at program.c:308 #12 0x00007ffff79caedd in GNUNET_TESTBED_test_run (testname=0x404eeb "test_mesh_small", cfg_filename=0x404ed6 "test_mesh_small.conf", num_peers=5, event_mask=0, cc=0, cc_cls=0x0, test_master=0x404634 <mesh_test_run>, test_master_cls=0x607d40) at testbed_api_test.c:152 #13 0x00000000004048b5 in GNUNET_MESH_TEST_run (testname=0x404eeb "test_mesh_small", cfgname=0x404ed6 "test_mesh_small.conf", num_peers=5, tmain=0x4038de <tmain>, tmain_cls=0x0, new_tunnel=0x40295f <incoming_tunnel>, cleaner=0x402c13 <tunnel_cleaner>, handlers=0x605800, stypes=0x0) at mesh_test_lib.c:290 #14 0x00000000004041ed in main (argc=1, argv=0x7fffffffe238) at test_mesh_small.c:983 (gdb) select-frame 14 (gdb) p test_ctx $15 = (struct GNUNET_MESH_TEST_Context *) 0x607d40 (gdb) p *test_ctx $16 = {peers = 0x639600, meshes = 0x7ffff79c2298, ops = 0x6372a0, app_main = 0, app_main_cls = 0x0, num_peers = 122, new_tunnel = 0x13bc7205875, cleaner = 0x300000000, handlers = 0xffffffffffffffff, stypes = 0x1} (gdb) p peers_running $17 = 5 (gdb) |
|
observe that in the `test_ctx' num_peers is 122 where as the static variable `peer_running' is 5. Shouldn't both of these be same? Also, I noticed that the addresses for test_ctx->peers and test_ctx->ops are also corrupted. |
|
Ok, I've put a break in GNUNET_MESH_TEST_cleanup () and from the following output it looks like it is getting called twice! This could explain the memory corruption. harsha@nebula:~/repos/gnunet/src/mesh$ libtool --mode=execute gdb ./test_mesh_small_signal GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/harsha/repos/gnunet/src/mesh/.libs/lt-test_mesh_small_signal...done. (gdb) r Starting program: /home/harsha/repos/gnunet/src/mesh/.libs/lt-test_mesh_small_signal Dec 23 10:43:07-853571 mesh-2502 ERROR Assertion failed at gnunet-service-mesh.c:5909. Dec 23 10:43:07-862344 test_mesh_small-2482 ERROR Assertion failed at mesh_test_lib.c:203. Dec 23 10:43:07-864806 test_mesh_small-2482 ERROR Assertion failed at mesh_test_lib.c:203. Dec 23 10:43:07-864869 util-scheduler-2482 ERROR Attempt to cancel dead task 17025232769157995141! Dec 23 10:43:07-864889 test_mesh_small-2482 ERROR Assertion failed at scheduler.c:914. Program received signal SIGABRT, Aborted. 0x00007ffff6f8a1b5 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. in ../nptl/sysdeps/unix/sysv/linux/raise.c (gdb) |
|
attached a patch file showing where I was using breaks. |
|
Mesh1 testcase no longer relevant, I'm pretty sure I have resolved this issue in the mesh2 testcase recently (caused by hardcoded local array overflow). |
Date Modified | Username | Field | Change |
---|---|---|---|
2012-12-21 00:00 | Bart Polot | New Issue | |
2012-12-21 00:00 | Bart Polot | Status | new => assigned |
2012-12-21 00:00 | Bart Polot | Assigned To | => Sree Harsha Totakura |
2012-12-21 03:18 | Bart Polot | Note Added: 0006742 | |
2012-12-21 13:05 | Sree Harsha Totakura | Note Added: 0006743 | |
2012-12-21 13:06 | Sree Harsha Totakura | Status | assigned => feedback |
2012-12-21 14:15 | Bart Polot | Note Added: 0006744 | |
2012-12-21 14:15 | Bart Polot | Status | feedback => assigned |
2012-12-21 14:20 | Bart Polot | Note Added: 0006745 | |
2012-12-21 20:33 | Christian Grothoff | Priority | normal => high |
2012-12-22 16:17 | Christian Grothoff | Target Version | => 0.10.0 |
2012-12-23 10:21 | Sree Harsha Totakura | Note Added: 0006750 | |
2012-12-23 10:26 | Sree Harsha Totakura | Note Added: 0006751 | |
2012-12-23 10:47 | Sree Harsha Totakura | Note Added: 0006752 | |
2012-12-23 10:48 | Sree Harsha Totakura | Note Added: 0006753 | |
2012-12-23 10:52 | Sree Harsha Totakura | Note Added: 0006754 | |
2012-12-23 10:52 | Sree Harsha Totakura | Assigned To | Sree Harsha Totakura => Bart Polot |
2012-12-23 10:53 | Sree Harsha Totakura | File Added: mesh_test_fail.patch | |
2012-12-23 10:54 | Sree Harsha Totakura | Note Added: 0006755 | |
2013-06-19 02:32 | Bart Polot | Note Added: 0007159 | |
2013-06-19 02:32 | Bart Polot | Status | assigned => resolved |
2013-06-19 02:32 | Bart Polot | Resolution | open => won't fix |
2013-12-24 20:54 | Christian Grothoff | Status | resolved => closed |