View Issue Details

IDProjectCategoryView StatusLast Update
0003575GNUnetcore servicepublic2018-06-07 00:38
ReporterBart PolotAssigned ToBart Polot 
PriorityhighSeveritymajorReproducibilityalways
Status closedResolutionfixed 
Product VersionSVN HEAD 
Target Version0.11.0Fixed in Version0.11.0pre66 
Summary0003575: Core takes an arbitrarily long time to call back
DescriptionWhen running CADET tests, core starts taking a longer and longer time to authorize CADET to transmit, This eventually causes the other end to assume the connection timed out.

The tests consists of one peer sending traffic to another peer over an unreliable channel (so, max speed), the receiving peer only generating "keepalive" traffic (ACKs, key exchanges, etc, no payload).

The problem *always* happens on the _receiving_ peer.
Steps To ReproduceRun test_cadet_2_speed with a high packet count (#define in test_cadet.c, 20k does the trick for me, sometimes even 4k). Small packet counts do not expose this behaviour.
Additional InformationExpanding the config file to:
[ats]
# Network specific inbound/outbound quotas
UNSPECIFIED_QUOTA_IN = unlimited
UNSPECIFIED_QUOTA_OUT = unlimited
# LOOPBACK
LOOPBACK_QUOTA_IN = unlimited
LOOPBACK_QUOTA_OUT = unlimited
# LAN
LAN_QUOTA_IN = unlimited
LAN_QUOTA_OUT = unlimited
#WAN
WAN_QUOTA_OUT = unlimitedq
WAN_QUOTA_IN = unlimited
# WLAN
WLAN_QUOTA_IN = unlimited
WLAN_QUOTA_OUT = unlimited
# BLUETOOTH
BLUETOOTH_QUOTA_IN = unlimited
BLUETOOTH_QUOTA_OUT = unlimited

Does not help. In case of bandwith limiting, it would affect the sending peer first, anyway.



The timeout seems to get gradually bigger until it exceeds the CADET connection tolerance and it times out.

Here is a filtered log including the time taken to fill each core request and
the messages CADET still has in the queue for CORE after the callback. Note that there is no gap between the phases, at 02:19:32-414983 something suddenly breaks and requests start taking a long time, the next callback occuring at 02:19:32-795017. This causes CADET messages to pile up (in this test the rekeying frequency is extremely high and nevertheless this requires some optimization, but that is another story...)



*************************** Normal phase ***************************
Dec 13 02:19:30-342949 cadet-p2p-1374 DEBUG core wait time 591 µs
Dec 13 02:19:30-343078 cadet-p2p-1374 DEBUG QQQ queue length: 0
--
Dec 13 02:19:30-352436 cadet-p2p-1374 DEBUG core wait time 674 µs
Dec 13 02:19:30-352568 cadet-p2p-1374 DEBUG QQQ queue length: 0
--
Dec 13 02:19:30-359621 cadet-p2p-1374 DEBUG core wait time 627 µs
Dec 13 02:19:30-359748 cadet-p2p-1374 DEBUG QQQ queue length: 0
--
Dec 13 02:19:30-372446 cadet-p2p-1374 DEBUG core wait time 666 µs
Dec 13 02:19:30-372573 cadet-p2p-1374 DEBUG QQQ queue length: 0
--
Dec 13 02:19:30-379735 cadet-p2p-1374 DEBUG core wait time 645 µs
Dec 13 02:19:30-379854 cadet-p2p-1374 DEBUG QQQ queue length: 0
TagsNo tags attached.

Relationships

related to 0003652 closedChristian Grothoff inbound quota is enforced, but ignored by sender, leading to connectivity issues 

Activities

Christian Grothoff

2014-12-13 21:37

manager   ~0008674

Haven't tried to run anything yet, but looking at the code, I suspect the routine to look at closely is gnunet-service-core_sessions.c::try_transmission(),
which is just one BIG heuristic that tries to pack high-priority messages (knapsack!), decide on what requests to solicit from clients (based on size and priority). It doesn't do any logging (yet), but I suspect if we add some, we might see something interesting.

The rest of CORE doesn't really make decisions about when to give clients a chance to transmit, so it must be this one.

Bart Polot

2014-12-13 23:38

manager   ~0008675

Just to clarify, in the testcase there are two separate peers, so two core processes, each one with just the one client.

Christian Grothoff

2014-12-14 00:52

manager   ~0008676

Dec 13 02:19:38-080846 cadet-p2p-1374 DEBUG core wait time 200473 µs

The line to log that message doesn't even exist in SVN HEAD. So how do you plan for me to reproduce this exactly? Keeping your code in a private repo is not that helpful...

Christian Grothoff

2014-12-14 00:53

manager   ~0008677

When I run that test, I get a core dump with the following:

warning: Could not load shared library symbols for linux-vdso.so.1.
Do you need "set solib-search-path" or "set sysroot"?
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/grothoff/lib//gnunet/libexec/gnunet-service-cadet -c /tmp/testbedWqp5Bk/1'.
Program terminated with signal 11, Segmentation fault.
#0 0x000000000040e816 in handle_pong (t=0x21a1590, msg=0x7fff11792378) at gnunet-service-cadet_tunnel.c:1960
1960 if (challenge != t->kx_ctx->challenge)
(gdb) ba
#0 0x000000000040e816 in handle_pong (t=0x21a1590, msg=0x7fff11792378) at gnunet-service-cadet_tunnel.c:1960
#1 0x000000000040f6e0 in GCT_handle_kx (t=0x21a1590, message=0x7fff11792378) at gnunet-service-cadet_tunnel.c:2126
#2 0x0000000000424300 in handle_cadet_kx (peer=0x7fff11792334, msg=0x7fff11792354) at gnunet-service-cadet_connection.c:2173
#3 0x0000000000424657 in GCC_handle_kx (cls=0x0, peer=0x7fff11792334, message=0x7fff11792354) at gnunet-service-cadet_connection.c:2220
#4 0x00007f31e33d0ea3 in main_notify_handler (cls=0x217c690, msg=0x7fff11792330) at core_api.c:967
#5 0x00007f31e382846c in receive_task (cls=0x217c020, tc=0x7fff11792450) at client.c:595
#6 0x00007f31e3892a02 in run_ready (rs=0x217f280, ws=0x217f310) at scheduler.c:595
#7 0x00007f31e3893a91 in GNUNET_SCHEDULER_run (task=0x7f31e38a938d <service_task>, task_cls=0x7fff117927f0) at scheduler.c:817
#8 0x00007f31e38ac840 in GNUNET_SERVICE_run (argc=3, argv=0x7fff11792a88, service_name=0x45caaa "cadet",
    options=GNUNET_SERVICE_OPTION_NONE, task=0x454670 <run>, task_cls=0x0) at service.c:1498
#9 0x0000000000454d8e in main (argc=3, argv=0x7fff11792a88) at gnunet-service-cadet.c:174
(gdb) print t->kx_ctx
$2 = (struct CadetTunnelKXCtx *) 0x0

Bart Polot

2014-12-15 08:24

manager   ~0008690

I pushed the cherry-picked stable-ish changes to svn, many crashes should be fixed, the core timing code is there as well.

Christian Grothoff

2014-12-15 10:57

manager   ~0008692

Last edited: 2014-12-15 10:59

View 2 revisions

Re-running with SVN HEAD, I now get:
Dec 15 10:35:54-832434 cadet-tun-6896 WARNING Wrong PONG challenge on 6YNB

The peak latency I see (with 20000) is something like:
Dec 15 10:35:54-843662 cadet-p2p-6896 DEBUG core wait time 23098 µs

which IMO is not indicative of an issue.

However, the test aborted early:

Dec 15 10:35:54-854712 cadet-tun-6905 WARNING Tunnel 6DEV is already scheduled for destruction. Tunnel debug dump:
Dec 15 10:35:54-854757 cadet-tun-6905 WARNING TTT DEBUG TUNNEL TOWARDS 6DEV
Dec 15 10:35:54-854773 cadet-tun-6905 WARNING TTT cstate CADET_TUNNEL_READY, estate CADET_TUNNEL_KEY_REKEY
Dec 15 10:35:54-854788 cadet-tun-6905 WARNING TTT kx_ctx 0x1f1c1f0, rekey_task 311991, finish task 0
Dec 15 10:35:54-854811 cadet-tun-6905 WARNING TTT my EPHM R3D165M3
Dec 15 10:35:54-854831 cadet-tun-6905 WARNING TTT peers EPHM: Y3WVDNPH
Dec 15 10:35:54-854851 cadet-tun-6905 WARNING TTT ENC key: FRDB24EA
Dec 15 10:35:54-854870 cadet-tun-6905 WARNING TTT DEC key: RXNWKWED
Dec 15 10:35:54-854889 cadet-tun-6905 WARNING TTT OLD ENC key: 416QZ6EX
Dec 15 10:35:54-854910 cadet-tun-6905 WARNING TTT OLD DEC key: QDY8XAX9
Dec 15 10:35:54-854930 cadet-tun-6905 WARNING TTT tq_head (nil), tq_tail (nil)
Dec 15 10:35:54-854944 cadet-tun-6905 WARNING TTT destroy 312013
Dec 15 10:35:54-854957 cadet-tun-6905 WARNING TTT channels:
Dec 15 10:35:54-854970 cadet-tun-6905 WARNING TTT connections:
Dec 15 10:35:54-854991 cadet-con-6905 WARNING CCC DEBUG CONNECTION CFN1N735 (->6DEV)
Dec 15 10:35:54-855010 cadet-con-6905 WARNING CCC path 6DEV 6YNB, own pos: 1
Dec 15 10:35:54-855025 cadet-con-6905 WARNING CCC state: CADET_CONNECTION_READY, destroy: 0
Dec 15 10:35:54-855039 cadet-con-6905 WARNING CCC pending messages: 3
Dec 15 10:35:54-855053 cadet-con-6905 WARNING CCC FWD flow control:
Dec 15 10:35:54-855066 cadet-con-6905 WARNING CCC queue: 0/3
Dec 15 10:35:54-855080 cadet-con-6905 WARNING CCC last PID sent: 4294967295, recv: 4294967295
Dec 15 10:35:54-855094 cadet-con-6905 WARNING CCC last ACK sent: 0, recv: 0
Dec 15 10:35:54-855109 cadet-con-6905 WARNING CCC POLL: task 0, msg (nil), msg_ack (nil))
Dec 15 10:35:54-855124 cadet-con-6905 WARNING CCC BCK flow control:
Dec 15 10:35:54-855137 cadet-con-6905 WARNING CCC queue: 0/3
Dec 15 10:35:54-855148 cadet-con-6905 WARNING CCC last PID sent: 416, recv: 20154
Dec 15 10:35:54-855162 cadet-con-6905 WARNING CCC last ACK sent: 20218, recv: 480
Dec 15 10:35:54-855175 cadet-con-6905 WARNING CCC POLL: task 0, msg (nil), msg_ack (nil))
Dec 15 10:35:54-855189 cadet-con-6905 WARNING CCC DEBUG CONNECTION END
Dec 15 10:35:54-855202 cadet-tun-6905 WARNING TTT DEBUG TUNNEL END
Dec 15 10:35:54-855218 cadet-6905 ERROR Assertion failed at gnunet-service-cadet_tunnel.c:2618.

Caboom.

Christian Grothoff

2014-12-15 10:58

manager   ~0008693

Next run, I pretty much immediately got:

Dec 15 10:54:59-322897 cadet-con-7217 WARNING PID 1 not expected (3+), dropping!
Dec 15 10:54:59-322933 test_cadet_small-7190 ERROR Assertion failed at cadet_api.c:495.

Christian Grothoff

2014-12-18 08:53

manager   ~0008706

Now test fails with:

Dec 18 08:45:00-790074 cadet-con-29694 WARNING PID 1581 not expected (1583+), dropping!
Dec 18 08:45:00-790145 test_cadet_small-29656 ERROR payload 1576, expected: 1575
Dec 18 08:45:00-790587 test_cadet_small-29656 ERROR payload 1577, expected: 1576
Dec 18 08:45:00-790917 cadet-con-29694 WARNING PID 1584 not expected (1586+), dropping!
Dec 18 08:45:00-790931 test_cadet_small-29656 ERROR payload 1579, expected: 1577
Dec 18 08:45:00-791235 test_cadet_small-29656 ERROR payload 1580, expected: 1578
Dec 18 08:45:00-791779 test_cadet_small-29656 ERROR payload 1582, expected: 1579
Dec 18 08:45:00-791818 cadet-con-29694 WARNING PID 1587 not expected (1589+), dropping!
Dec 18 08:45:00-792303 test_cadet_small-29656 ERROR payload 1583, expected: 1580
Dec 18 08:45:00-793162 cadet-con-29694 WARNING PID 1590 not expected (1592+), dropping!
Dec 18 08:45:00-793177 test_cadet_small-29656 ERROR payload 1585, expected: 1581
Dec 18 08:45:00-793765 test_cadet_small-29656 ERROR payload 1586, expected: 1582

Christian Grothoff

2015-02-11 20:18

manager   ~0008874

This might related to 0003652.

Christian Grothoff

2015-03-08 16:59

manager   ~0009006

0003652 is now fixed, need to test this one again.

Christian Grothoff

2015-03-08 17:14

manager   ~0009007

With SVN HEAD, I see two things:
1) core wait time still increases for one of the 2 peers slowly but steadily.
2) eventually things go really bad, causing the wrong messages to be delivered
   to the application (testcase failure!)

Starts harmless with a few of these:
Mar 08 17:08:15-452419 cadet-tun-18042 WARNING Wrong PONG challenge on 6DEV

Then starts to go worse with checksum errors and protocol violations:
Mar 08 17:08:17-368174 cadet-p2p-18039 DEBUG core wait time 38737 µs
Mar 08 17:08:17-369617 cadet-p2p-18039 DEBUG core wait time 1393 µs
Mar 08 17:08:17-376794 cadet-p2p-18042 DEBUG core wait time 7456 µs
Mar 08 17:08:17-377162 cadet-p2p-18042 DEBUG core wait time 345 µs
Mar 08 17:08:17-377841 cadet-tun-18039 WARNING Wrong PONG challenge on 6YNB
Mar 08 17:08:17-388043 cadet-p2p-18039 DEBUG core wait time 18390 µs
Mar 08 17:08:17-388896 cadet-p2p-18039 DEBUG core wait time 809 µs
Mar 08 17:08:17-389853 cadet-18042 ERROR Failed checksum validation on tunnel 6DEV with KX
Mar 08 17:08:17-389883 cadet-18042 WARNING External protocol violation detected at gnunet-service-cadet_tunnel.c:2140.

And then seems to recover (the above errors stop), except things go haywire on the application-side:

Mar 08 17:08:29-566535 cadet-18042 ERROR Failed checksum validation on tunnel 6DEV with KX
Mar 08 17:08:29-566557 cadet-18042 WARNING External protocol violation detected at gnunet-service-cadet_tunnel.c:2140.
Mar 08 17:08:29-566673 cadet-tun-18042 WARNING Wrong PONG challenge on 6DEV
Mar 08 17:08:29-566800 cadet-p2p-18042 DEBUG core wait time 225 µs
Mar 08 17:08:29-585120 cadet-p2p-18039 DEBUG core wait time 19365 µs
Mar 08 17:08:29-586246 cadet-p2p-18042 DEBUG core wait time 156 µs
Mar 08 17:08:29-586211 test_cadet_small-18020 ERROR payload 19641, expected: 19023
Mar 08 17:08:29-599637 cadet-p2p-18042 DEBUG core wait time 172 µs
Mar 08 17:08:29-606131 test_cadet_small-18020 ERROR payload 19642, expected: 19024
Mar 08 17:08:29-606147 cadet-p2p-18042 DEBUG core wait time 141 µs
Mar 08 17:08:29-606459 cadet-p2p-18039 DEBUG core wait time 21293 µs
Mar 08 17:08:29-625060 cadet-p2p-18039 DEBUG core wait time 18564 µs
Mar 08 17:08:29-625896 cadet-p2p-18042 DEBUG core wait time 72 µs
Mar 08 17:08:29-625932 test_cadet_small-18020 ERROR payload 19643, expected: 19025
Mar 08 17:08:29-645160 cadet-p2p-18039 DEBUG core wait time 20055 µs
Mar 08 17:08:29-646623 test_cadet_small-18020 ERROR payload 19644, expected: 19026
Mar 08 17:08:29-646718 cadet-p2p-18042 DEBUG core wait time 225 µs

Eventually, that ends:

Mar 08 17:08:36-807864 test_cadet_small-18020 ERROR payload 19998, expected: 19380
Mar 08 17:08:36-807900 cadet-p2p-18042 DEBUG core wait time 141 µs
Mar 08 17:08:36-828189 test_cadet_small-18020 ERROR payload 19999, expected: 19381
Mar 08 17:08:36-828234 cadet-p2p-18042 DEBUG core wait time 148 µs
Mar 08 17:08:37-054200 cadet-p2p-18042 DEBUG core wait time 219 µs
Mar 08 17:08:37-055602 cadet-p2p-18039 DEBUG core wait time 264 µs
Mar 08 17:08:37-119840 cadet-p2p-18039 DEBUG core wait time 251 µs
Mar 08 17:08:37-121396 cadet-p2p-18042 DEBUG core wait time 196 µs
Mar 08 17:08:37-387752 cadet-p2p-18042 DEBUG core wait time 112 µs
Mar 08 17:08:37-388761 cadet-p2p-18039 DEBUG core wait time 193 µs
Mar 08 17:08:37-453400 cadet-p2p-18039 DEBUG core wait time 174 µs
Mar 08 17:08:37-466031 cadet-p2p-18042 DEBUG core wait time 134 µs
Mar 08 17:08:37-485125 cadet-p2p-18042 DEBUG core wait time 172 µs
Mar 08 17:08:37-509867 cadet-p2p-18039 DEBUG core wait time 16566 µs
Mar 08 17:08:37-518340 cadet-p2p-18042 DEBUG core wait time 253 µs
Mar 08 17:08:37-518696 cadet-p2p-18042 DEBUG core wait time 326 µs
Mar 08 17:08:37-519032 cadet-tun-18039 WARNING Wrong PONG challenge on 6YNB
Mar 08 17:08:37-524932 cadet-p2p-18039 DEBUG core wait time 5791 µs
Mar 08 17:08:37-531316 cadet-tun-18042 WARNING Wrong PONG challenge on 6DEV
Mar 08 17:08:37-531430 cadet-p2p-18042 DEBUG core wait time 209 µs

Christian Grothoff

2018-06-07 00:38

manager   ~0012988

Closing, as recent fixes to core should improve this, and the rest is most likely to be blamed on transport (which needs a major rewrite).

Issue History

Date Modified Username Field Change
2014-12-13 03:02 Bart Polot New Issue
2014-12-13 03:02 Bart Polot Status new => assigned
2014-12-13 03:02 Bart Polot Assigned To => Christian Grothoff
2014-12-13 21:37 Christian Grothoff Note Added: 0008674
2014-12-13 23:38 Bart Polot Note Added: 0008675
2014-12-14 00:52 Christian Grothoff Note Added: 0008676
2014-12-14 00:53 Christian Grothoff Note Added: 0008677
2014-12-14 00:54 Christian Grothoff Assigned To Christian Grothoff => Bart Polot
2014-12-15 08:24 Bart Polot Note Added: 0008690
2014-12-15 10:57 Christian Grothoff Note Added: 0008692
2014-12-15 10:58 Christian Grothoff Note Added: 0008693
2014-12-15 10:59 Christian Grothoff Note Edited: 0008692 View Revisions
2014-12-18 08:53 Christian Grothoff Note Added: 0008706
2015-02-11 20:18 Christian Grothoff Note Added: 0008874
2015-02-11 20:18 Christian Grothoff Relationship added related to 0003652
2015-03-08 16:59 Christian Grothoff Note Added: 0009006
2015-03-08 17:14 Christian Grothoff Note Added: 0009007
2015-03-08 17:16 Christian Grothoff Additional Information Updated View Revisions
2015-03-20 21:32 Christian Grothoff Target Version 0.11.0pre66 =>
2015-03-21 00:37 Christian Grothoff Target Version => 0.11.0
2018-06-07 00:38 Christian Grothoff Status assigned => closed
2018-06-07 00:38 Christian Grothoff Resolution open => fixed
2018-06-07 00:38 Christian Grothoff Fixed in Version => 0.11.0pre66
2018-06-07 00:38 Christian Grothoff Note Added: 0012988