View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0003575 | GNUnet | core service | public | 2014-12-13 03:02 | 2018-06-07 00:38 |
Reporter | Bart Polot | Assigned To | Bart Polot | ||
Priority | high | Severity | major | Reproducibility | always |
Status | closed | Resolution | fixed | ||
Product Version | Git master | ||||
Target Version | 0.11.0 | Fixed in Version | 0.11.0pre66 | ||
Summary | 0003575: Core takes an arbitrarily long time to call back | ||||
Description | When running CADET tests, core starts taking a longer and longer time to authorize CADET to transmit, This eventually causes the other end to assume the connection timed out. The tests consists of one peer sending traffic to another peer over an unreliable channel (so, max speed), the receiving peer only generating "keepalive" traffic (ACKs, key exchanges, etc, no payload). The problem *always* happens on the _receiving_ peer. | ||||
Steps To Reproduce | Run test_cadet_2_speed with a high packet count (#define in test_cadet.c, 20k does the trick for me, sometimes even 4k). Small packet counts do not expose this behaviour. | ||||
Additional Information | Expanding the config file to: [ats] # Network specific inbound/outbound quotas UNSPECIFIED_QUOTA_IN = unlimited UNSPECIFIED_QUOTA_OUT = unlimited # LOOPBACK LOOPBACK_QUOTA_IN = unlimited LOOPBACK_QUOTA_OUT = unlimited # LAN LAN_QUOTA_IN = unlimited LAN_QUOTA_OUT = unlimited #WAN WAN_QUOTA_OUT = unlimitedq WAN_QUOTA_IN = unlimited # WLAN WLAN_QUOTA_IN = unlimited WLAN_QUOTA_OUT = unlimited # BLUETOOTH BLUETOOTH_QUOTA_IN = unlimited BLUETOOTH_QUOTA_OUT = unlimited Does not help. In case of bandwith limiting, it would affect the sending peer first, anyway. The timeout seems to get gradually bigger until it exceeds the CADET connection tolerance and it times out. Here is a filtered log including the time taken to fill each core request and the messages CADET still has in the queue for CORE after the callback. Note that there is no gap between the phases, at 02:19:32-414983 something suddenly breaks and requests start taking a long time, the next callback occuring at 02:19:32-795017. This causes CADET messages to pile up (in this test the rekeying frequency is extremely high and nevertheless this requires some optimization, but that is another story...) *************************** Normal phase *************************** Dec 13 02:19:30-342949 cadet-p2p-1374 DEBUG core wait time 591 µs Dec 13 02:19:30-343078 cadet-p2p-1374 DEBUG QQQ queue length: 0 -- Dec 13 02:19:30-352436 cadet-p2p-1374 DEBUG core wait time 674 µs Dec 13 02:19:30-352568 cadet-p2p-1374 DEBUG QQQ queue length: 0 -- Dec 13 02:19:30-359621 cadet-p2p-1374 DEBUG core wait time 627 µs Dec 13 02:19:30-359748 cadet-p2p-1374 DEBUG QQQ queue length: 0 -- Dec 13 02:19:30-372446 cadet-p2p-1374 DEBUG core wait time 666 µs Dec 13 02:19:30-372573 cadet-p2p-1374 DEBUG QQQ queue length: 0 -- Dec 13 02:19:30-379735 cadet-p2p-1374 DEBUG core wait time 645 µs Dec 13 02:19:30-379854 cadet-p2p-1374 DEBUG QQQ queue length: 0 | ||||
Tags | No tags attached. | ||||
related to | 0003652 | closed | Christian Grothoff | inbound quota is enforced, but ignored by sender, leading to connectivity issues |
|
Haven't tried to run anything yet, but looking at the code, I suspect the routine to look at closely is gnunet-service-core_sessions.c::try_transmission(), which is just one BIG heuristic that tries to pack high-priority messages (knapsack!), decide on what requests to solicit from clients (based on size and priority). It doesn't do any logging (yet), but I suspect if we add some, we might see something interesting. The rest of CORE doesn't really make decisions about when to give clients a chance to transmit, so it must be this one. |
|
Just to clarify, in the testcase there are two separate peers, so two core processes, each one with just the one client. |
|
Dec 13 02:19:38-080846 cadet-p2p-1374 DEBUG core wait time 200473 µs The line to log that message doesn't even exist in SVN HEAD. So how do you plan for me to reproduce this exactly? Keeping your code in a private repo is not that helpful... |
|
When I run that test, I get a core dump with the following: warning: Could not load shared library symbols for linux-vdso.so.1. Do you need "set solib-search-path" or "set sysroot"? [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/home/grothoff/lib//gnunet/libexec/gnunet-service-cadet -c /tmp/testbedWqp5Bk/1'. Program terminated with signal 11, Segmentation fault. #0 0x000000000040e816 in handle_pong (t=0x21a1590, msg=0x7fff11792378) at gnunet-service-cadet_tunnel.c:1960 1960 if (challenge != t->kx_ctx->challenge) (gdb) ba #0 0x000000000040e816 in handle_pong (t=0x21a1590, msg=0x7fff11792378) at gnunet-service-cadet_tunnel.c:1960 #1 0x000000000040f6e0 in GCT_handle_kx (t=0x21a1590, message=0x7fff11792378) at gnunet-service-cadet_tunnel.c:2126 #2 0x0000000000424300 in handle_cadet_kx (peer=0x7fff11792334, msg=0x7fff11792354) at gnunet-service-cadet_connection.c:2173 #3 0x0000000000424657 in GCC_handle_kx (cls=0x0, peer=0x7fff11792334, message=0x7fff11792354) at gnunet-service-cadet_connection.c:2220 #4 0x00007f31e33d0ea3 in main_notify_handler (cls=0x217c690, msg=0x7fff11792330) at core_api.c:967 #5 0x00007f31e382846c in receive_task (cls=0x217c020, tc=0x7fff11792450) at client.c:595 #6 0x00007f31e3892a02 in run_ready (rs=0x217f280, ws=0x217f310) at scheduler.c:595 #7 0x00007f31e3893a91 in GNUNET_SCHEDULER_run (task=0x7f31e38a938d <service_task>, task_cls=0x7fff117927f0) at scheduler.c:817 #8 0x00007f31e38ac840 in GNUNET_SERVICE_run (argc=3, argv=0x7fff11792a88, service_name=0x45caaa "cadet", options=GNUNET_SERVICE_OPTION_NONE, task=0x454670 <run>, task_cls=0x0) at service.c:1498 #9 0x0000000000454d8e in main (argc=3, argv=0x7fff11792a88) at gnunet-service-cadet.c:174 (gdb) print t->kx_ctx $2 = (struct CadetTunnelKXCtx *) 0x0 |
|
I pushed the cherry-picked stable-ish changes to svn, many crashes should be fixed, the core timing code is there as well. |
|
Re-running with SVN HEAD, I now get: Dec 15 10:35:54-832434 cadet-tun-6896 WARNING Wrong PONG challenge on 6YNB The peak latency I see (with 20000) is something like: Dec 15 10:35:54-843662 cadet-p2p-6896 DEBUG core wait time 23098 µs which IMO is not indicative of an issue. However, the test aborted early: Dec 15 10:35:54-854712 cadet-tun-6905 WARNING Tunnel 6DEV is already scheduled for destruction. Tunnel debug dump: Dec 15 10:35:54-854757 cadet-tun-6905 WARNING TTT DEBUG TUNNEL TOWARDS 6DEV Dec 15 10:35:54-854773 cadet-tun-6905 WARNING TTT cstate CADET_TUNNEL_READY, estate CADET_TUNNEL_KEY_REKEY Dec 15 10:35:54-854788 cadet-tun-6905 WARNING TTT kx_ctx 0x1f1c1f0, rekey_task 311991, finish task 0 Dec 15 10:35:54-854811 cadet-tun-6905 WARNING TTT my EPHM R3D165M3 Dec 15 10:35:54-854831 cadet-tun-6905 WARNING TTT peers EPHM: Y3WVDNPH Dec 15 10:35:54-854851 cadet-tun-6905 WARNING TTT ENC key: FRDB24EA Dec 15 10:35:54-854870 cadet-tun-6905 WARNING TTT DEC key: RXNWKWED Dec 15 10:35:54-854889 cadet-tun-6905 WARNING TTT OLD ENC key: 416QZ6EX Dec 15 10:35:54-854910 cadet-tun-6905 WARNING TTT OLD DEC key: QDY8XAX9 Dec 15 10:35:54-854930 cadet-tun-6905 WARNING TTT tq_head (nil), tq_tail (nil) Dec 15 10:35:54-854944 cadet-tun-6905 WARNING TTT destroy 312013 Dec 15 10:35:54-854957 cadet-tun-6905 WARNING TTT channels: Dec 15 10:35:54-854970 cadet-tun-6905 WARNING TTT connections: Dec 15 10:35:54-854991 cadet-con-6905 WARNING CCC DEBUG CONNECTION CFN1N735 (->6DEV) Dec 15 10:35:54-855010 cadet-con-6905 WARNING CCC path 6DEV 6YNB, own pos: 1 Dec 15 10:35:54-855025 cadet-con-6905 WARNING CCC state: CADET_CONNECTION_READY, destroy: 0 Dec 15 10:35:54-855039 cadet-con-6905 WARNING CCC pending messages: 3 Dec 15 10:35:54-855053 cadet-con-6905 WARNING CCC FWD flow control: Dec 15 10:35:54-855066 cadet-con-6905 WARNING CCC queue: 0/3 Dec 15 10:35:54-855080 cadet-con-6905 WARNING CCC last PID sent: 4294967295, recv: 4294967295 Dec 15 10:35:54-855094 cadet-con-6905 WARNING CCC last ACK sent: 0, recv: 0 Dec 15 10:35:54-855109 cadet-con-6905 WARNING CCC POLL: task 0, msg (nil), msg_ack (nil)) Dec 15 10:35:54-855124 cadet-con-6905 WARNING CCC BCK flow control: Dec 15 10:35:54-855137 cadet-con-6905 WARNING CCC queue: 0/3 Dec 15 10:35:54-855148 cadet-con-6905 WARNING CCC last PID sent: 416, recv: 20154 Dec 15 10:35:54-855162 cadet-con-6905 WARNING CCC last ACK sent: 20218, recv: 480 Dec 15 10:35:54-855175 cadet-con-6905 WARNING CCC POLL: task 0, msg (nil), msg_ack (nil)) Dec 15 10:35:54-855189 cadet-con-6905 WARNING CCC DEBUG CONNECTION END Dec 15 10:35:54-855202 cadet-tun-6905 WARNING TTT DEBUG TUNNEL END Dec 15 10:35:54-855218 cadet-6905 ERROR Assertion failed at gnunet-service-cadet_tunnel.c:2618. Caboom. |
|
Next run, I pretty much immediately got: Dec 15 10:54:59-322897 cadet-con-7217 WARNING PID 1 not expected (3+), dropping! Dec 15 10:54:59-322933 test_cadet_small-7190 ERROR Assertion failed at cadet_api.c:495. |
|
Now test fails with: Dec 18 08:45:00-790074 cadet-con-29694 WARNING PID 1581 not expected (1583+), dropping! Dec 18 08:45:00-790145 test_cadet_small-29656 ERROR payload 1576, expected: 1575 Dec 18 08:45:00-790587 test_cadet_small-29656 ERROR payload 1577, expected: 1576 Dec 18 08:45:00-790917 cadet-con-29694 WARNING PID 1584 not expected (1586+), dropping! Dec 18 08:45:00-790931 test_cadet_small-29656 ERROR payload 1579, expected: 1577 Dec 18 08:45:00-791235 test_cadet_small-29656 ERROR payload 1580, expected: 1578 Dec 18 08:45:00-791779 test_cadet_small-29656 ERROR payload 1582, expected: 1579 Dec 18 08:45:00-791818 cadet-con-29694 WARNING PID 1587 not expected (1589+), dropping! Dec 18 08:45:00-792303 test_cadet_small-29656 ERROR payload 1583, expected: 1580 Dec 18 08:45:00-793162 cadet-con-29694 WARNING PID 1590 not expected (1592+), dropping! Dec 18 08:45:00-793177 test_cadet_small-29656 ERROR payload 1585, expected: 1581 Dec 18 08:45:00-793765 test_cadet_small-29656 ERROR payload 1586, expected: 1582 |
|
This might related to 0003652. |
|
0003652 is now fixed, need to test this one again. |
|
With SVN HEAD, I see two things: 1) core wait time still increases for one of the 2 peers slowly but steadily. 2) eventually things go really bad, causing the wrong messages to be delivered to the application (testcase failure!) Starts harmless with a few of these: Mar 08 17:08:15-452419 cadet-tun-18042 WARNING Wrong PONG challenge on 6DEV Then starts to go worse with checksum errors and protocol violations: Mar 08 17:08:17-368174 cadet-p2p-18039 DEBUG core wait time 38737 µs Mar 08 17:08:17-369617 cadet-p2p-18039 DEBUG core wait time 1393 µs Mar 08 17:08:17-376794 cadet-p2p-18042 DEBUG core wait time 7456 µs Mar 08 17:08:17-377162 cadet-p2p-18042 DEBUG core wait time 345 µs Mar 08 17:08:17-377841 cadet-tun-18039 WARNING Wrong PONG challenge on 6YNB Mar 08 17:08:17-388043 cadet-p2p-18039 DEBUG core wait time 18390 µs Mar 08 17:08:17-388896 cadet-p2p-18039 DEBUG core wait time 809 µs Mar 08 17:08:17-389853 cadet-18042 ERROR Failed checksum validation on tunnel 6DEV with KX Mar 08 17:08:17-389883 cadet-18042 WARNING External protocol violation detected at gnunet-service-cadet_tunnel.c:2140. And then seems to recover (the above errors stop), except things go haywire on the application-side: Mar 08 17:08:29-566535 cadet-18042 ERROR Failed checksum validation on tunnel 6DEV with KX Mar 08 17:08:29-566557 cadet-18042 WARNING External protocol violation detected at gnunet-service-cadet_tunnel.c:2140. Mar 08 17:08:29-566673 cadet-tun-18042 WARNING Wrong PONG challenge on 6DEV Mar 08 17:08:29-566800 cadet-p2p-18042 DEBUG core wait time 225 µs Mar 08 17:08:29-585120 cadet-p2p-18039 DEBUG core wait time 19365 µs Mar 08 17:08:29-586246 cadet-p2p-18042 DEBUG core wait time 156 µs Mar 08 17:08:29-586211 test_cadet_small-18020 ERROR payload 19641, expected: 19023 Mar 08 17:08:29-599637 cadet-p2p-18042 DEBUG core wait time 172 µs Mar 08 17:08:29-606131 test_cadet_small-18020 ERROR payload 19642, expected: 19024 Mar 08 17:08:29-606147 cadet-p2p-18042 DEBUG core wait time 141 µs Mar 08 17:08:29-606459 cadet-p2p-18039 DEBUG core wait time 21293 µs Mar 08 17:08:29-625060 cadet-p2p-18039 DEBUG core wait time 18564 µs Mar 08 17:08:29-625896 cadet-p2p-18042 DEBUG core wait time 72 µs Mar 08 17:08:29-625932 test_cadet_small-18020 ERROR payload 19643, expected: 19025 Mar 08 17:08:29-645160 cadet-p2p-18039 DEBUG core wait time 20055 µs Mar 08 17:08:29-646623 test_cadet_small-18020 ERROR payload 19644, expected: 19026 Mar 08 17:08:29-646718 cadet-p2p-18042 DEBUG core wait time 225 µs Eventually, that ends: Mar 08 17:08:36-807864 test_cadet_small-18020 ERROR payload 19998, expected: 19380 Mar 08 17:08:36-807900 cadet-p2p-18042 DEBUG core wait time 141 µs Mar 08 17:08:36-828189 test_cadet_small-18020 ERROR payload 19999, expected: 19381 Mar 08 17:08:36-828234 cadet-p2p-18042 DEBUG core wait time 148 µs Mar 08 17:08:37-054200 cadet-p2p-18042 DEBUG core wait time 219 µs Mar 08 17:08:37-055602 cadet-p2p-18039 DEBUG core wait time 264 µs Mar 08 17:08:37-119840 cadet-p2p-18039 DEBUG core wait time 251 µs Mar 08 17:08:37-121396 cadet-p2p-18042 DEBUG core wait time 196 µs Mar 08 17:08:37-387752 cadet-p2p-18042 DEBUG core wait time 112 µs Mar 08 17:08:37-388761 cadet-p2p-18039 DEBUG core wait time 193 µs Mar 08 17:08:37-453400 cadet-p2p-18039 DEBUG core wait time 174 µs Mar 08 17:08:37-466031 cadet-p2p-18042 DEBUG core wait time 134 µs Mar 08 17:08:37-485125 cadet-p2p-18042 DEBUG core wait time 172 µs Mar 08 17:08:37-509867 cadet-p2p-18039 DEBUG core wait time 16566 µs Mar 08 17:08:37-518340 cadet-p2p-18042 DEBUG core wait time 253 µs Mar 08 17:08:37-518696 cadet-p2p-18042 DEBUG core wait time 326 µs Mar 08 17:08:37-519032 cadet-tun-18039 WARNING Wrong PONG challenge on 6YNB Mar 08 17:08:37-524932 cadet-p2p-18039 DEBUG core wait time 5791 µs Mar 08 17:08:37-531316 cadet-tun-18042 WARNING Wrong PONG challenge on 6DEV Mar 08 17:08:37-531430 cadet-p2p-18042 DEBUG core wait time 209 µs |
|
Closing, as recent fixes to core should improve this, and the rest is most likely to be blamed on transport (which needs a major rewrite). |
Date Modified | Username | Field | Change |
---|---|---|---|
2014-12-13 03:02 | Bart Polot | New Issue | |
2014-12-13 03:02 | Bart Polot | Status | new => assigned |
2014-12-13 03:02 | Bart Polot | Assigned To | => Christian Grothoff |
2014-12-13 21:37 | Christian Grothoff | Note Added: 0008674 | |
2014-12-13 23:38 | Bart Polot | Note Added: 0008675 | |
2014-12-14 00:52 | Christian Grothoff | Note Added: 0008676 | |
2014-12-14 00:53 | Christian Grothoff | Note Added: 0008677 | |
2014-12-14 00:54 | Christian Grothoff | Assigned To | Christian Grothoff => Bart Polot |
2014-12-15 08:24 | Bart Polot | Note Added: 0008690 | |
2014-12-15 10:57 | Christian Grothoff | Note Added: 0008692 | |
2014-12-15 10:58 | Christian Grothoff | Note Added: 0008693 | |
2014-12-15 10:59 | Christian Grothoff | Note Edited: 0008692 | |
2014-12-18 08:53 | Christian Grothoff | Note Added: 0008706 | |
2015-02-11 20:18 | Christian Grothoff | Note Added: 0008874 | |
2015-02-11 20:18 | Christian Grothoff | Relationship added | related to 0003652 |
2015-03-08 16:59 | Christian Grothoff | Note Added: 0009006 | |
2015-03-08 17:14 | Christian Grothoff | Note Added: 0009007 | |
2015-03-08 17:16 | Christian Grothoff | Additional Information Updated | |
2015-03-20 21:32 | Christian Grothoff | Target Version | 0.11.0pre66 => |
2015-03-21 00:37 | Christian Grothoff | Target Version | => 0.11.0 |
2018-06-07 00:38 | Christian Grothoff | Status | assigned => closed |
2018-06-07 00:38 | Christian Grothoff | Resolution | open => fixed |
2018-06-07 00:38 | Christian Grothoff | Fixed in Version | => 0.11.0pre66 |
2018-06-07 00:38 | Christian Grothoff | Note Added: 0012988 |