View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0003863||GNUnet||transport service||public||2015-06-26 04:38||2018-06-07 00:24|
|Reporter||Bart Polot||Assigned To||Christian Grothoff|
|Product Version||SVN HEAD|
|Target Version||0.11.0pre66||Fixed in Version||0.11.0pre66|
|Summary||0003863: TRANSPORT (and indirectly CORE) takes unreasonably large amounts of time to call back.|
|Description||CADET times how much CORE takes to call back after a notify_transmit_ready. Sometimes running a peer (default settings) this exceeds a minute and a half:|
Jun 26 04:25:17-007658 cadet-p2p-25737 ERROR core wait time 95906486 µs > 1 second
Recent versions of CADET complain of any delay bigger than 1 second.
Updated CORE also now complains about TRANSPORT taking forever.
|Steps To Reproduce||Run a peer normally connected to the network.|
Run a CADET testcase enabling valgrind for CADET. Valgrind does not cause CADET to block and delay the delivery, the peer logs activity while waiting for CORE (timeouts, keepalive tasks, etc).
|Additional Information||When running a testcase with unlimited bandwidth and very small messages (< 2kb) this is the delay distribution:|
Notice how most messages are sent immediately, but there are two bumps around the 0.25s and 1.2s delay.
|Tags||No tags attached.|
Running a peer in the public network with:
- CADET under valgrind
- unlimited quotas for all ats categories
- fs turned off
Also causes this behavior.
Steps to generate graph:
- run with CADET loglevel DEBUG at least for function "queue_send" > LOGFILE
- grep "core wait time.*µs" LOGFILE | sed -e "s/.*core wait time //" | cut -d' ' -f 1 > core_wait.dat
#set term png
#set output "plot.png"
set xlabel "Delay in seconds"
set ylabel "# of repetitions"
set boxwidth 0.05 absolute
set style fill solid 1.0 noborder
bin_width = 50000;
bin_number(x) = floor(x/bin_width)
rounded(x) = bin_width * ( bin_number(x) + 0.5 )
plot "core_wait.dat" u (rounded($1)/1000000):(1) with boxes ti "" smooth frequency
||Which testcase (with 2k messages, etc.) is it that I should run to reproduce?|
I've put in some extra instrumentation for diagnostics. Most will show up at level WARNING (delays > 1s), alternatively one could do stuff like:
to see every delay *below* core coming up from transport enforcing quotas.
In my little tests, I did see CADET delays > 1s, but that was with small quotas causing congestion from transport on 32k messages (transmitted at 3k/s) to bubble up to CADET.
Looking for feedback on the new diagnostics.
Right now, running the peer gives me tons of crappy CADET logs... :-(
Seems that most cadet complaints are caused by transport, as core complains as well:
Oct 06 17:04:31-430277 cadet-p2p-838 WARNING VMDC: core wait time 94 s (> 1 second) for 204 bytes
Oct 06 17:04:31-430578 core-847 WARNING Client waited 90 s for permission to transmit to `DSTJ' (priority 3)
Oct 06 17:04:31-430630 core-847 WARNING Client waited 95 s for permission to transmit to `60JY' (priority 3)
Oct 06 17:04:31-430716 cadet-p2p-838 WARNING E48H: core wait time 94 s (> 1 second) for 40 bytes
Oct 06 17:04:31-430794 cadet-p2p-838 WARNING GN10: core wait time 7 s (> 1 second) for 120 bytes
Oct 06 17:04:31-432059 core-847 WARNING Client waited 94 s for transmission of 2460 bytes to `VMDC' (corked), CORE queue is 0 entries
Oct 06 17:04:31-432767 core-847 WARNING Client waited 94 s for transmission of 148 bytes to `E48H' (corked), CORE queue is 0 entries
Oct 06 17:04:31-433870 core-847 WARNING Client waited 93 s for transmission of 1525 bytes to `VMDC' (corked), CORE queue is 1 entries
Transport shows up sporadically in the logs with quite restrictive bandwidth quotas (1024B/s, few kB):
Oct 06 17:07:14-646699 transport-api-847 WARNING At bandwidth 1024 byte/s next transmission to EGEF in 15 s
Oct 06 17:10:47-334590 transport-api-847 WARNING At bandwidth 1024 byte/s next transmission to EGEF in 14 s
Oct 06 17:10:50-544423 transport-api-847 WARNING At bandwidth 1024 byte/s next transmission to E48H in 16 s
There are two weird things:
- All quotas are requested to be removed in the config file:
UNSPECIFIED_QUOTA_IN = unlimited
UNSPECIFIED_QUOTA_OUT = unlimited
LOOPBACK_QUOTA_IN = unlimited
LOOPBACK_QUOTA_OUT = unlimited
LAN_QUOTA_IN = unlimited
LAN_QUOTA_OUT = unlimited
WAN_QUOTA_IN = unlimited
WAN_QUOTA_OUT = unlimited
WLAN_QUOTA_IN = unlimited
WLAN_QUOTA_OUT = unlimited
BLUETOOTH_QUOTA_IN = unlimited
BLUETOOTH_QUOTA_OUT = unlimited
Is the remote peer imposing a quota?
- Core mentions that transmissions are corked, while cadet always calls core_transmit_ready with a hardcoded GNUNET_NO in the cork parameter.
Looking a bit more at the log I can see that traffic between two peers that I control is also delayed, up to 90 (!) seconds:
Oct 06 17:01:45-263949 core-847 WARNING Client waited 4101 ms for permission to transmit to `BART' (priority 3)
Oct 06 17:01:45-269429 core-847 WARNING Client waited 4106 ms for transmission of 108 bytes to `BART' (corked), CORE queue is 3 entries
Oct 06 17:01:45-272691 core-847 WARNING Client waited 12 s for permission to transmit to `BART' (priority 1)
Oct 06 17:01:45-275059 core-847 WARNING Client waited 12 s for transmission of 8324 bytes to `BART' (corked), CORE queue is 3 entries
Oct 06 17:01:47-199779 core-847 WARNING Client waited 1541 ms for permission to transmit to `BART' (priority 1)
Oct 06 17:01:47-200848 core-847 WARNING Client waited 1542 ms for transmission of 1859 bytes to `BART' (corked), CORE queue is 1 entries
Oct 06 17:03:01-675691 core-847 WARNING Client waited 4963 ms for permission to transmit to `BART' (priority 3)
Oct 06 17:03:01-676209 core-847 WARNING Client waited 4964 ms for transmission of 272 bytes to `BART' (corked), CORE queue is 1 entries
Oct 06 17:04:31-647891 core-847 WARNING Client waited 93 s for permission to transmit to `BART' (priority 1)
Oct 06 17:04:31-661017 core-847 WARNING Client waited 93 s for transmission of 2100 bytes to `BART' (corked), CORE queue is 0 entries
Oct 06 17:09:20-851005 core-847 WARNING Client waited 3606 ms for permission to transmit to `BART' (priority 3)
Oct 06 17:09:20-853940 core-847 WARNING Client waited 4594 ms for permission to transmit to `BART' (priority 1)
Oct 06 17:09:20-856972 core-847 WARNING Client waited 3612 ms for transmission of 272 bytes to `BART' (corked), CORE queue is 0 entries
Oct 06 17:09:20-858242 core-847 WARNING Client waited 4598 ms for transmission of 3644 bytes to `BART' (corked), CORE queue is 1 entries
Oct 06 17:11:01-482479 core-847 WARNING Client waited 7 s for permission to transmit to `BART' (priority 3)
Oct 06 17:11:01-483440 core-847 WARNING Client waited 7 s for permission to transmit to `BART' (priority 1)
Oct 06 17:11:01-483871 core-847 WARNING Client waited 7 s for transmission of 4218 bytes to `BART' (corked), CORE queue is 0 entries
Oct 06 17:11:01-485820 core-847 WARNING Client waited 7 s for transmission of 500 bytes to `BART' (corked), CORE queue is 1 entries
Oct 06 17:11:38-627376 core-847 WARNING Client waited 5 s for permission to transmit to `BART' (priority 2)
Oct 06 17:11:38-630576 core-847 WARNING Client waited 3978 ms for permission to transmit to `BART' (priority 1)
Oct 06 17:11:38-630675 core-847 WARNING Client waited 5 s for transmission of 272 bytes to `BART' (corked), CORE queue is 0 entries
Oct 06 17:11:38-633386 core-847 WARNING Client waited 3981 ms for transmission of 4108 bytes to `BART' (corked), CORE queue is 1 entries
Oct 06 17:11:52-434109 core-847 WARNING Client waited 2948 ms for permission to transmit to `BART' (priority 3)
I am positive that both these peers have quotas set to unlimited in their config files, are not running with memory instrumentation and one of them even has [fs] disabled.
When transport establishes a connection, it starts (!) with a quota of 1kb/s until the other peer can announce the real quota. This is necessary because we don't want to send a peer more than it can handle. Naturally, the peer tries to let us know about its quota ASAP, but this is where you may get (brief) 1kb/s quotas. This may also happen on transport-level reconnects.
Anyway, this isn't a good answer, but a first hint as to what might be going on.
||What is clear is that we should put similar logging also into TRANSPORT, to see where the high delay ultimately comes from. And clearly additional scaffolding to diagnose latency issues is warranted -- I'd not have imagined core waiting 90+s for transmission and still ever succeeding. So something is badly broken.|
||Updated bug title, category and summary to reflect latest insights.|
Just FYI: I still see delays between two of my unlimited peers:
Oct 06 21:29:14-161458 core-847 WARNING Client waited 5 s for transmission of 3461 bytes to `BART' (corked), CORE queue is 1 entries
They have been running for a at least 5 hours, so they should have had time to tell each other about their quotas...
`BART' is running on a gigabit datacenter connection, the other one is running at home on a 100/10Mbps connection, and both are almost idle (i.e, no congestion going on).
Oct 06 22:00:24-597194 cadet-p2p-838 WARNING STRN: core wait time 17 m (> 1 second) for 108 bytes
Oct 06 22:00:24-597352 cadet-p2p-838 WARNING EGEF: core wait time 17 m (> 1 second) for 100 bytes
Oct 06 22:00:24-597413 cadet-p2p-838 WARNING DSTJ: core wait time 16 m (> 1 second) for 100 bytes
Oct 06 22:00:24-597474 cadet-p2p-838 WARNING 60JY: core wait time 16 m (> 1 second) for 120 bytes
Oct 06 22:00:24-597547 cadet-p2p-838 WARNING BART: core wait time 15 m (> 1 second) for 100 bytes
Oct 06 22:00:24-597603 cadet-p2p-838 WARNING RNGR: core wait time 15 m (> 1 second) for 120 bytes
Oct 06 22:00:24-597661 cadet-p2p-838 WARNING VMDC: core wait time 13 m (> 1 second) for 120 bytes
Oct 06 22:00:24-597738 cadet-p2p-838 WARNING GN10: core wait time 13 m (> 1 second) for 120 bytes
Oct 06 22:00:24-597800 cadet-p2p-838 WARNING PH0B: core wait time 12 m (> 1 second) for 120 bytes
Oct 06 22:00:24-597860 cadet-p2p-838 WARNING T03N: core wait time 87 s (> 1 second) for 164 bytes
||Clearly we need to add logging to transport to see where the delay really comes from. But 15m is in sane, as CORE should disconnect before then. Do you get CORE 'delayed' messages for the same peer here? I assume you'd not log 'core wait time' if you got a disconnect, right?|
The message is logged from the callback given to CORE_notify_transmit_ready and after checking for (buffer == NULL or size == 0), which IIUC would indicate a timeout or disconnection.
Moreover, the core_disconnect callback was not activated.
If CORE doesn't *also* log, I wonder if this issue relates to 0003910, where somehow we manage to have a disagreement between core and the lib as to which peers are connected. What does 'gnunet-core -m' say about connected peers in the meantime (i.e. before you get to send a message after 15m)? (I wonder if we got disconnected, reconnected after 10-15m and then it seems to CADET like it was just waiting forever...).
That doesn't explain the multi-second latencies on transport, but maybe we have two issues interacting here.
SVN 36492 fixes another related issue where an application that requests transmission of a high-priority (> background) message then fails to send anything (returns 0) might cause CORE to block all other lower-priority traffic (or even equal-priority) until the application sends *another* message, as CORE believes it's still waiting for that message.
This was fixed by having core_api send a 0-byte message which in CORE now triggers cleaning up the high-priority block, allowing lower-priority messages to go through.
36492 also fixes some messages dropped due to 'timeout' (so API behaves more like it was documented).
I'm cautiosly optimistic that it could be solved. I'm still seeing delays of up to a minute but only to two peers (T03N, BBCW), caused by transport_api announcing a BW of 1k/s.
I will run the regex and cadet profilers later and see if the delays are gone there.
Haven't gotten to run the profilers yet, however I still see some delays, much milder than before:
Oct 09 00:37:52-655593 transport-api-16003 WARNING Added 840 bytes of payload message for BART after 1032 ms delay at 672500315 b/s
Oct 09 00:37:52-660048 transport-api-16003 WARNING Added 624 bytes of payload message for RNGR after 1020 ms delay at 1130554465 b/s
Oct 09 00:37:52-660161 core-16003 WARNING Client waited 1013 ms for permission to transmit to `RNGR' (priority 1)
Oct 09 00:37:52-660335 core-api-15997 WARNING Transmitting overdue 1954 bytes to `RNGR' at priority 1 with 1013 ms delay
Oct 09 00:37:52-660505 core-16003 WARNING Client waited 1013 ms for transmission of 1954 bytes to `RNGR' (corked), CORE queue is 1 entries
Oct 09 00:37:52-661296 core-16003 WARNING Client waited 1016 ms for permission to transmit to `BART' (priority 1)
Oct 09 00:37:52-661889 core-api-15997 WARNING Transmitting overdue 1706 bytes to `BART' at priority 1 with 1019 ms delay
Oct 09 00:37:52-663928 core-16003 WARNING Client waited 1019 ms for transmission of 1706 bytes to `BART' (corked), CORE queue is 2 entries
Peer RNGR is connected via gigabit ethernet, all quotas off.
Another bad delay:
Oct 09 01:15:35-239148 core-api-15996 WARNING Transmitting overdue 360 bytes to `RNGR' at priority 2 with 13 m delay
Oct 09 01:15:35-239686 core-api-15997 WARNING Transmitting overdue 0 bytes to `BART' at priority 1 with 9 m delay
Oct 09 01:15:36-439496 transport-api-16003 WARNING Added 160 bytes of payload message for BART after 1201 ms delay at 683651645 b/s
||If it says "corked", it means some application set a flag that allows core to delay up to 1s. So in that case, some 'delay' is to be expected. So please also check which options may be leading to the warnings...|
||Found a UDP-specific issue where UDP could (worst-case) delay packets 60x longer than desired, so instead of say 1s it might take 60s because UDP decided to throttle (more than necessary due to a failure to consider fragmentation). So that ought to be fixed in SVN 36521.|
||Bart, is the situation better now? I still see warnings, but those seem to be largely explicable by UDP+low bandwidth simply taking a few secs sometimes.|
||Short of additional feedback, I think I can declare victory on this.|
|2015-06-26 04:38||Bart Polot||New Issue|
|2015-06-26 04:38||Bart Polot||Status||new => assigned|
|2015-06-26 04:38||Bart Polot||Assigned To||=> Christian Grothoff|
|2015-06-26 13:20||Bart Polot||Note Added: 0009342|
|2015-06-26 13:52||Bart Polot||Note Added: 0009343|
|2015-06-26 13:56||Bart Polot||Note Edited: 0009343||View Revisions|
|2015-06-28 15:37||Christian Grothoff||Note Added: 0009360|
|2015-10-05 18:37||Christian Grothoff||Note Added: 0009697|
|2015-10-05 18:37||Christian Grothoff||Note Added: 0009698|
|2015-10-05 18:37||Christian Grothoff||Assigned To||Christian Grothoff => Bart Polot|
|2015-10-05 18:37||Christian Grothoff||Status||assigned => feedback|
|2015-10-06 17:13||Bart Polot||Note Added: 0009701|
|2015-10-06 17:13||Bart Polot||Status||feedback => assigned|
|2015-10-06 17:20||Bart Polot||Note Added: 0009702|
|2015-10-06 21:40||Christian Grothoff||Note Added: 0009704|
|2015-10-06 21:43||Christian Grothoff||Note Added: 0009705|
|2015-10-06 21:43||Christian Grothoff||Assigned To||Bart Polot =>|
|2015-10-06 21:43||Christian Grothoff||Status||assigned => confirmed|
|2015-10-06 21:43||Christian Grothoff||Category||core service => transport service|
|2015-10-06 21:43||Christian Grothoff||Target Version||=> 0.11.0pre66|
|2015-10-06 21:44||Christian Grothoff||Note Added: 0009706|
|2015-10-06 21:44||Christian Grothoff||Summary||CORE takes unreasonably large amounts of time to call back. => TRANSPORT (and indirectly CORE) takes unreasonably large amounts of time to call back.|
|2015-10-06 21:44||Christian Grothoff||Description Updated||View Revisions|
|2015-10-06 21:48||Bart Polot||Note Added: 0009707|
|2015-10-06 22:02||Bart Polot||Note Added: 0009708|
|2015-10-06 23:57||Christian Grothoff||Note Added: 0009709|
|2015-10-07 00:11||Bart Polot||Note Added: 0009710|
|2015-10-07 00:25||Christian Grothoff||Note Added: 0009711|
|2015-10-08 19:12||Christian Grothoff||Note Added: 0009713|
|2015-10-08 19:57||Bart Polot||Note Added: 0009714|
|2015-10-09 00:53||Bart Polot||Note Added: 0009716|
|2015-10-09 01:18||Bart Polot||Note Added: 0009717|
|2015-10-09 01:19||Bart Polot||Note Edited: 0009717||View Revisions|
|2015-10-09 06:43||Christian Grothoff||Note Added: 0009718|
|2015-10-17 03:57||Christian Grothoff||Relationship added||parent of 0004009|
|2015-10-17 22:29||Christian Grothoff||Note Added: 0009738|
|2015-10-22 22:06||Bart Polot||Relationship added||related to 0004019|
|2015-10-26 16:47||Christian Grothoff||Note Added: 0009794|
|2015-10-29 14:33||Christian Grothoff||Assigned To||=> Christian Grothoff|
|2015-10-29 14:33||Christian Grothoff||Status||confirmed => assigned|
|2015-10-29 14:33||Christian Grothoff||Note Added: 0009854|
|2015-10-29 14:33||Christian Grothoff||Status||assigned => resolved|
|2015-10-29 14:33||Christian Grothoff||Fixed in Version||=> 0.11.0pre66|
|2015-10-29 14:33||Christian Grothoff||Resolution||open => fixed|
|2018-06-07 00:24||Christian Grothoff||Status||resolved => closed|