View Issue Details

IDProjectCategoryView StatusLast Update
0002713GNUnettestbed servicepublic2013-12-24 20:54
ReporterBart PolotAssigned ToBart Polot 
PriorityhighSeveritycrashReproducibilityalways
Status closedResolutionwon't fix 
Product Version0.9.4 
Target Version0.10.0Fixed in Version 
Summary0002713: GNUNET_TESTBED_operation_done kills dead task
DescriptionProgram terminated with signal 6, Aborted.
#0 0x00007f151c2dafd5 in raise () from /usr/lib/libc.so.6
(gdb) bt
#0 0x00007f151c2dafd5 in raise () from /usr/lib/libc.so.6
#1 0x00007f151c2dc458 in abort () from /usr/lib/libc.so.6
#2 0x00007f151d718528 in GNUNET_abort () at common_logging.c:293
#3 0x00007f151d74fd69 in GNUNET_SCHEDULER_cancel (task=17601693521461418629)
    at scheduler.c:914
#4 0x00007f151e3afa88 in GNUNET_TESTBED_operation_release_ (
    operation=0x7f151e3b10b0 <opstart_peer_stop>) at testbed_api_operations.c:373
#5 0x00007f151e3ad036 in GNUNET_TESTBED_operation_done (
    operation=0x7f151e3b10b0 <opstart_peer_stop>) at testbed_api.c:2194
#6 0x0000000000404548 in GNUNET_MESH_TEST_cleanup (ctx=0x1986c20)
    at mesh_test_lib.c:205
#7 0x0000000000401b4c in disconnect_mesh_peers (cls=0x2a4, tc=0x7fff7e95d7f0)
    at test_mesh_small.c:279
#8 0x00007f151d74efa9 in run_ready (rs=0x1987ea0, ws=0x1987f30) at scheduler.c:597
#9 0x00007f151d74f7f8 in GNUNET_SCHEDULER_run (task=0x7f151d748580 <program_main>,
    task_cls=0x7fff7e95da90) at scheduler.c:786
#10 0x00007f151d748fb7 in GNUNET_PROGRAM_run2 (argc=3, argv=0x7fff7e95dc30,
    binaryName=0x404e0b "test_mesh_small", binaryHelp=0x7f151e3bb949 "nohelp",
    options=0x7fff7e95dbf0, task=0x7f151e3b77a8 <run>, task_cls=0x1986cf0,
    run_without_scheduler=0) at program.c:273
#11 0x00007f151d7490d8 in GNUNET_PROGRAM_run (argc=3, argv=0x7fff7e95dc30,
    binaryName=0x404e0b "test_mesh_small", binaryHelp=0x7f151e3bb949 "nohelp",
    options=0x7fff7e95dbf0, task=0x7f151e3b77a8 <run>, task_cls=0x1986cf0)
    at program.c:308
#12 0x00007f151e3b7ab6 in GNUNET_TESTBED_test_run (testname=0x404e0b "test_mesh_small",
    cfg_filename=0x404df6 "test_mesh_small.conf", num_peers=5, event_mask=0, cc=0x0,
    cc_cls=0x0, test_master=0x4045a9 <mesh_test_run>, test_master_cls=0x1986c20)
    at testbed_api_test.c:152
#13 0x000000000040481f in GNUNET_MESH_TEST_run (testname=0x404e0b "test_mesh_small",
    cfgname=0x404df6 "test_mesh_small.conf", num_peers=5, tmain=0x403926 <tmain>,
    tmain_cls=0x0, new_tunnel=0x4029a9 <incoming_tunnel>,
    cleaner=0x402c5d <tunnel_cleaner>, handlers=0x605860 <handlers>, stypes=0x0)
    at mesh_test_lib.c:287
#14 0x000000000040424e in main (argc=1, argv=0x7fff7e95de28) at test_mesh_small.c:983

scheduler.c:914: Attempt to cancel dead task

As far as I am aware GNUNET_TESTBED_operation_done is not called twice. The code in mesh_test_lib does as follows:
{
    GNUNET_assert (NULL != ctx->ops[i]);
    GNUNET_TESTBED_operation_done (ctx->ops[i]);
    ctx->ops[i] = NULL;
}
And that's the only call to op_done in the whole library...
Steps To ReproduceRun test_mesh_small_signal, on shutdown the testcase crashes.
Additional InformationOther testcases using the same library do not crash (test_mesh_2dtorus, test_mesh_regex). Even test with (almost) the same code as test_mesh_small_unicast do not crash.
TagsNo tags attached.

Activities

Bart Polot

2012-12-21 03:18

manager   ~0006742

Seems related to crashes in the services.

If a service crashes and restarts, the handler becomes invalid and I imagine it can have something to do with the bug...

Shouldn't the testbed notify the "profiler" of a service crash? Or it already does and I am missing something?

Sree Harsha Totakura

2012-12-21 13:05

developer   ~0006743

Which service is crashing? mesh or testbed?
In any case the client handle shouldn't become invalid if a service crashes. So, I'm not sure which one you are referring here.

Testbed currently does not support notifying profiler whenever a service crashes.

From the back trace it tools like there is some linking pollution. In stack frames 4 and 5 'opstart_peer_stop' is a static function in testbed_api_peer.c. I wonder how it came here at the first place.

Also, I cannot reproduce this on my laptop and desktop.

I'm not sure if this is relevant: Christian has bumped library version numbers; you should do a `sudo ldconfig` to get the use the correct ones after `make install`

Bart Polot

2012-12-21 14:15

manager   ~0006744

What is crashing is (was)... both.

1. I start the testcase, with a bug in mesh service.
2. Mesh service crashes so the testcase eventually fails.
3. After a timeout the testcase shuts down and calls operation_done on the connect operation to the serivces.
4. The testcase also crashes when calling operation_done on what I suspect is the operation used to connect to the now dead service.

Bart Polot

2012-12-21 14:20

manager   ~0006745

BTW, you need to make a service that you connected to in the test crash in order to reproduce this problem. If the service doesn't crash, the operation_done succeeds and the testcase doesn't crash either.

If you want a "simple way" to do force a crash, insert a GNUNET_assert (0) in any mesh_handler in the mesh service, for instance in handle_mesh_data_unicast and run any test_mesh_small_* test.

Sree Harsha Totakura

2012-12-23 10:21

developer   ~0006750

Ok, I've added an assertion failure in mesh service but didn't observe any crash.

harsha@nebula:~/repos/gnunet/src/mesh$ libtool --mode=execute gdb ./test_mesh_small_signal
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/harsha/repos/gnunet/src/mesh/.libs/lt-test_mesh_small_signal...done.
(gdb) r
Starting program: /home/harsha/repos/gnunet/src/mesh/.libs/lt-test_mesh_small_signal
Dec 23 10:10:43-665985 mesh-894 ERROR Assertion failed at gnunet-service-mesh.c:7403.
Dec 23 10:10:43-682386 mesh-930 ERROR Assertion failed at gnunet-service-mesh.c:7403.
Dec 23 10:10:43-700037 mesh-931 ERROR Assertion failed at gnunet-service-mesh.c:7403.
Dec 23 10:10:43-722007 mesh-932 ERROR Assertion failed at gnunet-service-mesh.c:7403.
Dec 23 10:10:43-742763 mesh-933 ERROR Assertion failed at gnunet-service-mesh.c:7403.
Dec 23 10:10:43-778547 mesh-934 ERROR Assertion failed at gnunet-service-mesh.c:7403.
Dec 23 10:10:43-829667 mesh-935 ERROR Assertion failed at gnunet-service-mesh.c:7403.
Dec 23 10:10:43-912846 mesh-936 ERROR Assertion failed at gnunet-service-mesh.c:7403.
Dec 23 10:10:44-072649 mesh-937 ERROR Assertion failed at gnunet-service-mesh.c:7403.
Dec 23 10:10:44-351751 mesh-938 ERROR Assertion failed at gnunet-service-mesh.c:7403.
Dec 23 10:10:44-885130 mesh-939 ERROR Assertion failed at gnunet-service-mesh.c:7403.
Dec 23 10:10:45-924201 mesh-940 ERROR Assertion failed at gnunet-service-mesh.c:7403.
Dec 23 10:10:47-993462 mesh-941 ERROR Assertion failed at gnunet-service-mesh.c:7403.
Dec 23 10:10:52-115602 mesh-942 ERROR Assertion failed at gnunet-service-mesh.c:7403.
Dec 23 10:11:00-339378 mesh-943 ERROR Assertion failed at gnunet-service-mesh.c:7403.
Dec 23 10:11:16-758287 mesh-944 ERROR Assertion failed at gnunet-service-mesh.c:7403.
Dec 23 10:11:49-547251 mesh-949 ERROR Assertion failed at gnunet-service-mesh.c:7403.
Dec 23 10:12:43-740792 test_mesh_small-880 ERROR FAILED! (0/5)

Program exited with code 01.
(gdb)

Sree Harsha Totakura

2012-12-23 10:26

developer   ~0006751

Ok, I'm able to reproduce this; it occurred only when I put a break in handle_mesh_data_unicast.

Sree Harsha Totakura

2012-12-23 10:47

developer   ~0006752

It looks like there is some memory corruption in mesh test case:

The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/harsha/repos/gnunet/src/mesh/.libs/lt-test_mesh_small_signal

Breakpoint 1, mesh_test_run (cls=0x607d40, num_peers=5, peers=0x636b20) at mesh_test_lib.c:229
229 struct GNUNET_MESH_TEST_Context *ctx = cls;
(gdb) bt
#0 0x00007ffff6f8a1b5 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007ffff6f8cfc0 in *__GI_abort () at abort.c:92
#2 0x00007ffff77568d1 in GNUNET_abort () at common_logging.c:293
#3 0x00007ffff778e72a in GNUNET_SCHEDULER_cancel (task=17025232769157995141) at scheduler.c:914
#4 0x00007ffff79c2e4c in GNUNET_TESTBED_operation_release_ (operation=0x7ffff79c4477) at testbed_api_operations.c:373
#5 0x00007ffff79c03b4 in GNUNET_TESTBED_operation_done (operation=0x7ffff79c4477) at testbed_api.c:2194
#6 0x00000000004045b6 in GNUNET_MESH_TEST_cleanup (ctx=0x607d40) at mesh_test_lib.c:206
#7 0x0000000000401af9 in disconnect_mesh_peers (cls=0x2a4, tc=0x7fffffffdbb0) at test_mesh_small.c:279
#8 0x00007ffff778d931 in run_ready (rs=0x608fc0, ws=0x609050) at scheduler.c:597
#9 0x00007ffff778e1b3 in GNUNET_SCHEDULER_run (task=0x7ffff7786ca8 <program_main>, task_cls=0x7fffffffde50) at scheduler.c:786
#10 0x00007ffff7787711 in GNUNET_PROGRAM_run2 (argc=3, argv=0x7fffffffe010, binaryName=0x404eeb "test_mesh_small", binaryHelp=0x7ffff79cee89 "nohelp",
    options=0x7fffffffdfd0, task=0x7ffff79cabbc <run>, task_cls=0x607e20, run_without_scheduler=0) at program.c:273
#11 0x00007ffff7787846 in GNUNET_PROGRAM_run (argc=3, argv=0x7fffffffe010, binaryName=0x404eeb "test_mesh_small", binaryHelp=0x7ffff79cee89 "nohelp",
    options=0x7fffffffdfd0, task=0x7ffff79cabbc <run>, task_cls=0x607e20) at program.c:308
#12 0x00007ffff79caedd in GNUNET_TESTBED_test_run (testname=0x404eeb "test_mesh_small", cfg_filename=0x404ed6 "test_mesh_small.conf", num_peers=5,
    event_mask=0, cc=0, cc_cls=0x0, test_master=0x404634 <mesh_test_run>, test_master_cls=0x607d40) at testbed_api_test.c:152
#13 0x00000000004048b5 in GNUNET_MESH_TEST_run (testname=0x404eeb "test_mesh_small", cfgname=0x404ed6 "test_mesh_small.conf", num_peers=5,
    tmain=0x4038de <tmain>, tmain_cls=0x0, new_tunnel=0x40295f <incoming_tunnel>, cleaner=0x402c13 <tunnel_cleaner>, handlers=0x605800, stypes=0x0)
    at mesh_test_lib.c:290
#14 0x00000000004041ed in main (argc=1, argv=0x7fffffffe238) at test_mesh_small.c:983
(gdb) select-frame 14
(gdb) p test_ctx
$15 = (struct GNUNET_MESH_TEST_Context *) 0x607d40
(gdb) p *test_ctx
$16 = {peers = 0x639600, meshes = 0x7ffff79c2298, ops = 0x6372a0, app_main = 0, app_main_cls = 0x0, num_peers = 122, new_tunnel = 0x13bc7205875,
  cleaner = 0x300000000, handlers = 0xffffffffffffffff, stypes = 0x1}
(gdb) p peers_running
$17 = 5
(gdb)

Sree Harsha Totakura

2012-12-23 10:48

developer   ~0006753

observe that in the `test_ctx' num_peers is 122 where as the static variable `peer_running' is 5. Shouldn't both of these be same?

Also, I noticed that the addresses for test_ctx->peers and test_ctx->ops are also corrupted.

Sree Harsha Totakura

2012-12-23 10:52

developer   ~0006754

Ok, I've put a break in GNUNET_MESH_TEST_cleanup () and from the following output it looks like it is getting called twice! This could explain the memory corruption.


harsha@nebula:~/repos/gnunet/src/mesh$ libtool --mode=execute gdb ./test_mesh_small_signal
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/harsha/repos/gnunet/src/mesh/.libs/lt-test_mesh_small_signal...done.
(gdb) r
Starting program: /home/harsha/repos/gnunet/src/mesh/.libs/lt-test_mesh_small_signal
Dec 23 10:43:07-853571 mesh-2502 ERROR Assertion failed at gnunet-service-mesh.c:5909.
Dec 23 10:43:07-862344 test_mesh_small-2482 ERROR Assertion failed at mesh_test_lib.c:203.
Dec 23 10:43:07-864806 test_mesh_small-2482 ERROR Assertion failed at mesh_test_lib.c:203.
Dec 23 10:43:07-864869 util-scheduler-2482 ERROR Attempt to cancel dead task 17025232769157995141!
Dec 23 10:43:07-864889 test_mesh_small-2482 ERROR Assertion failed at scheduler.c:914.

Program received signal SIGABRT, Aborted.
0x00007ffff6f8a1b5 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
    in ../nptl/sysdeps/unix/sysv/linux/raise.c
(gdb)

Sree Harsha Totakura

2012-12-23 10:53

developer  

mesh_test_fail.patch (704 bytes)
Index: mesh_test_lib.c
===================================================================
--- mesh_test_lib.c	(revision 25640)
+++ mesh_test_lib.c	(working copy)
@@ -200,6 +200,7 @@
 {
   unsigned int i;
 
+  GNUNET_break (0);
   for (i = 0; i < ctx->num_peers; i++)
   {
     GNUNET_assert (NULL != ctx->ops[i]);
Index: gnunet-service-mesh.c
===================================================================
--- gnunet-service-mesh.c	(revision 25640)
+++ gnunet-service-mesh.c	(working copy)
@@ -5906,6 +5906,7 @@
   uint32_t ttl;
   size_t size;
 
+  GNUNET_assert (0);
   GNUNET_log (GNUNET_ERROR_TYPE_DEBUG, "got a unicast packet from %s\n",
               GNUNET_i2s (peer));
   /* Check size */
mesh_test_fail.patch (704 bytes)

Sree Harsha Totakura

2012-12-23 10:54

developer   ~0006755

attached a patch file showing where I was using breaks.

Bart Polot

2013-06-19 02:32

manager   ~0007159

Mesh1 testcase no longer relevant, I'm pretty sure I have resolved this issue in the mesh2 testcase recently (caused by hardcoded local array overflow).

Issue History

Date Modified Username Field Change
2012-12-21 00:00 Bart Polot New Issue
2012-12-21 00:00 Bart Polot Status new => assigned
2012-12-21 00:00 Bart Polot Assigned To => Sree Harsha Totakura
2012-12-21 03:18 Bart Polot Note Added: 0006742
2012-12-21 13:05 Sree Harsha Totakura Note Added: 0006743
2012-12-21 13:06 Sree Harsha Totakura Status assigned => feedback
2012-12-21 14:15 Bart Polot Note Added: 0006744
2012-12-21 14:15 Bart Polot Status feedback => assigned
2012-12-21 14:20 Bart Polot Note Added: 0006745
2012-12-21 20:33 Christian Grothoff Priority normal => high
2012-12-22 16:17 Christian Grothoff Target Version => 0.10.0
2012-12-23 10:21 Sree Harsha Totakura Note Added: 0006750
2012-12-23 10:26 Sree Harsha Totakura Note Added: 0006751
2012-12-23 10:47 Sree Harsha Totakura Note Added: 0006752
2012-12-23 10:48 Sree Harsha Totakura Note Added: 0006753
2012-12-23 10:52 Sree Harsha Totakura Note Added: 0006754
2012-12-23 10:52 Sree Harsha Totakura Assigned To Sree Harsha Totakura => Bart Polot
2012-12-23 10:53 Sree Harsha Totakura File Added: mesh_test_fail.patch
2012-12-23 10:54 Sree Harsha Totakura Note Added: 0006755
2013-06-19 02:32 Bart Polot Note Added: 0007159
2013-06-19 02:32 Bart Polot Status assigned => resolved
2013-06-19 02:32 Bart Polot Resolution open => won't fix
2013-12-24 20:54 Christian Grothoff Status resolved => closed