0003863: TRANSPORT (and indirectly CORE) takes unreasonably large amounts of time to call back.

ID	Project	Category	View Status	Date Submitted	Last Update

0003863	GNUnet	transport service	public	2015-06-26 04:38	2018-06-07 00:24

Reporter	Bart Polot	Assigned To	Christian Grothoff
Priority	normal	Severity	major	Reproducibility	sometimes
Status	closed	Resolution	fixed
Product Version	Git master
Target Version	0.11.0pre66	Fixed in Version	0.11.0pre66

Summary	0003863: TRANSPORT (and indirectly CORE) takes unreasonably large amounts of time to call back.
Description	CADET times how much CORE takes to call back after a notify_transmit_ready. Sometimes running a peer (default settings) this exceeds a minute and a half: Jun 26 04:25:17-007658 cadet-p2p-25737 ERROR core wait time 95906486 µs > 1 second Recent versions of CADET complain of any delay bigger than 1 second. Updated CORE also now complains about TRANSPORT taking forever.
Steps To Reproduce	Run a peer normally connected to the network. or Run a CADET testcase enabling valgrind for CADET. Valgrind does not cause CADET to block and delay the delivery, the peer logs activity while waiting for CORE (timeouts, keepalive tasks, etc).
Additional Information	When running a testcase with unlimited bandwidth and very small messages (< 2kb) this is the delay distribution: http://imgur.com/9Q2CNVz.png Notice how most messages are sent immediately, but there are two bumps around the 0.25s and 1.2s delay.
Tags	No tags attached.

Bart Polot 2015-06-26 13:20 reporter ~0009342	Running a peer in the public network with: - CADET under valgrind - unlimited quotas for all ats categories - fs turned off Also causes this behavior.

Bart Polot 2015-06-26 13:52 reporter ~0009343 Last edited: 2015-06-26 13:56	Steps to generate graph: - run with CADET loglevel DEBUG at least for function "queue_send" > LOGFILE - grep "core wait time.µs" LOGFILE \| sed -e "s/.core wait time //" \| cut -d' ' -f 1 > core_wait.dat GNUplot script: #set term png #set output "plot.png" set xlabel "Delay in seconds" set ylabel "# of repetitions" set boxwidth 0.05 absolute set style fill solid 1.0 noborder bin_width = 50000; bin_number(x) = floor(x/bin_width) rounded(x) = bin_width * ( bin_number(x) + 0.5 ) plot "core_wait.dat" u (rounded($1)/1000000):(1) with boxes ti "" smooth frequency

Christian Grothoff 2015-06-28 15:37 manager ~0009360	Which testcase (with 2k messages, etc.) is it that I should run to reproduce?

Christian Grothoff 2015-10-05 18:37 manager ~0009697	I've put in some extra instrumentation for diagnostics. Most will show up at level WARNING (delays > 1s), alternatively one could do stuff like: export GNUNET_FORCE_LOG=";transport_api.c;GNUNET_TRANSPORT_notify_transmit_ready;;DEBUG/" to see every delay below core coming up from transport enforcing quotas. In my little tests, I did see CADET delays > 1s, but that was with small quotas causing congestion from transport on 32k messages (transmitted at 3k/s) to bubble up to CADET.

Christian Grothoff 2015-10-05 18:37 manager ~0009698	Looking for feedback on the new diagnostics. Right now, running the peer gives me tons of crappy CADET logs... :-(

Bart Polot 2015-10-06 17:13 reporter ~0009701	Seems that most cadet complaints are caused by transport, as core complains as well: Oct 06 17:04:31-430277 cadet-p2p-838 WARNING VMDC: core wait time 94 s (> 1 second) for 204 bytes Oct 06 17:04:31-430578 core-847 WARNING Client waited 90 s for permission to transmit to `DSTJ' (priority 3) Oct 06 17:04:31-430630 core-847 WARNING Client waited 95 s for permission to transmit to `60JY' (priority 3) Oct 06 17:04:31-430716 cadet-p2p-838 WARNING E48H: core wait time 94 s (> 1 second) for 40 bytes Oct 06 17:04:31-430794 cadet-p2p-838 WARNING GN10: core wait time 7 s (> 1 second) for 120 bytes Oct 06 17:04:31-432059 core-847 WARNING Client waited 94 s for transmission of 2460 bytes to `VMDC' (corked), CORE queue is 0 entries Oct 06 17:04:31-432767 core-847 WARNING Client waited 94 s for transmission of 148 bytes to `E48H' (corked), CORE queue is 0 entries Oct 06 17:04:31-433870 core-847 WARNING Client waited 93 s for transmission of 1525 bytes to `VMDC' (corked), CORE queue is 1 entries Transport shows up sporadically in the logs with quite restrictive bandwidth quotas (1024B/s, few kB): Oct 06 17:07:14-646699 transport-api-847 WARNING At bandwidth 1024 byte/s next transmission to EGEF in 15 s [...] Oct 06 17:10:47-334590 transport-api-847 WARNING At bandwidth 1024 byte/s next transmission to EGEF in 14 s Oct 06 17:10:50-544423 transport-api-847 WARNING At bandwidth 1024 byte/s next transmission to E48H in 16 s ----------------------------------------------------------- There are two weird things: - All quotas are requested to be removed in the config file: [ats] UNSPECIFIED_QUOTA_IN = unlimited UNSPECIFIED_QUOTA_OUT = unlimited LOOPBACK_QUOTA_IN = unlimited LOOPBACK_QUOTA_OUT = unlimited LAN_QUOTA_IN = unlimited LAN_QUOTA_OUT = unlimited WAN_QUOTA_IN = unlimited WAN_QUOTA_OUT = unlimited WLAN_QUOTA_IN = unlimited WLAN_QUOTA_OUT = unlimited BLUETOOTH_QUOTA_IN = unlimited BLUETOOTH_QUOTA_OUT = unlimited Is the remote peer imposing a quota? - Core mentions that transmissions are corked, while cadet always calls core_transmit_ready with a hardcoded GNUNET_NO in the cork parameter.

Bart Polot 2015-10-06 17:20 reporter ~0009702	Looking a bit more at the log I can see that traffic between two peers that I control is also delayed, up to 90 (!) seconds: Oct 06 17:01:45-263949 core-847 WARNING Client waited 4101 ms for permission to transmit to `BART' (priority 3) Oct 06 17:01:45-269429 core-847 WARNING Client waited 4106 ms for transmission of 108 bytes to `BART' (corked), CORE queue is 3 entries Oct 06 17:01:45-272691 core-847 WARNING Client waited 12 s for permission to transmit to `BART' (priority 1) Oct 06 17:01:45-275059 core-847 WARNING Client waited 12 s for transmission of 8324 bytes to `BART' (corked), CORE queue is 3 entries Oct 06 17:01:47-199779 core-847 WARNING Client waited 1541 ms for permission to transmit to `BART' (priority 1) Oct 06 17:01:47-200848 core-847 WARNING Client waited 1542 ms for transmission of 1859 bytes to `BART' (corked), CORE queue is 1 entries [...] Oct 06 17:03:01-675691 core-847 WARNING Client waited 4963 ms for permission to transmit to `BART' (priority 3) Oct 06 17:03:01-676209 core-847 WARNING Client waited 4964 ms for transmission of 272 bytes to `BART' (corked), CORE queue is 1 entries Oct 06 17:04:31-647891 core-847 WARNING Client waited 93 s for permission to transmit to `BART' (priority 1) Oct 06 17:04:31-661017 core-847 WARNING Client waited 93 s for transmission of 2100 bytes to `BART' (corked), CORE queue is 0 entries Oct 06 17:09:20-851005 core-847 WARNING Client waited 3606 ms for permission to transmit to `BART' (priority 3) Oct 06 17:09:20-853940 core-847 WARNING Client waited 4594 ms for permission to transmit to `BART' (priority 1) Oct 06 17:09:20-856972 core-847 WARNING Client waited 3612 ms for transmission of 272 bytes to `BART' (corked), CORE queue is 0 entries Oct 06 17:09:20-858242 core-847 WARNING Client waited 4598 ms for transmission of 3644 bytes to `BART' (corked), CORE queue is 1 entries Oct 06 17:11:01-482479 core-847 WARNING Client waited 7 s for permission to transmit to `BART' (priority 3) Oct 06 17:11:01-483440 core-847 WARNING Client waited 7 s for permission to transmit to `BART' (priority 1) Oct 06 17:11:01-483871 core-847 WARNING Client waited 7 s for transmission of 4218 bytes to `BART' (corked), CORE queue is 0 entries Oct 06 17:11:01-485820 core-847 WARNING Client waited 7 s for transmission of 500 bytes to `BART' (corked), CORE queue is 1 entries Oct 06 17:11:38-627376 core-847 WARNING Client waited 5 s for permission to transmit to `BART' (priority 2) Oct 06 17:11:38-630576 core-847 WARNING Client waited 3978 ms for permission to transmit to `BART' (priority 1) Oct 06 17:11:38-630675 core-847 WARNING Client waited 5 s for transmission of 272 bytes to `BART' (corked), CORE queue is 0 entries Oct 06 17:11:38-633386 core-847 WARNING Client waited 3981 ms for transmission of 4108 bytes to `BART' (corked), CORE queue is 1 entries Oct 06 17:11:52-434109 core-847 WARNING Client waited 2948 ms for permission to transmit to `BART' (priority 3) I am positive that both these peers have quotas set to unlimited in their config files, are not running with memory instrumentation and one of them even has [fs] disabled.

Christian Grothoff 2015-10-06 21:40 manager ~0009704	When transport establishes a connection, it starts (!) with a quota of 1kb/s until the other peer can announce the real quota. This is necessary because we don't want to send a peer more than it can handle. Naturally, the peer tries to let us know about its quota ASAP, but this is where you may get (brief) 1kb/s quotas. This may also happen on transport-level reconnects. Anyway, this isn't a good answer, but a first hint as to what might be going on.

Christian Grothoff 2015-10-06 21:43 manager ~0009705	What is clear is that we should put similar logging also into TRANSPORT, to see where the high delay ultimately comes from. And clearly additional scaffolding to diagnose latency issues is warranted -- I'd not have imagined core waiting 90+s for transmission and still ever succeeding. So something is badly broken.

Christian Grothoff 2015-10-06 21:44 manager ~0009706	Updated bug title, category and summary to reflect latest insights.

Bart Polot 2015-10-06 21:48 reporter ~0009707	Just FYI: I still see delays between two of my unlimited peers: Oct 06 21:29:14-161458 core-847 WARNING Client waited 5 s for transmission of 3461 bytes to `BART' (corked), CORE queue is 1 entries They have been running for a at least 5 hours, so they should have had time to tell each other about their quotas... `BART' is running on a gigabit datacenter connection, the other one is running at home on a 100/10Mbps connection, and both are almost idle (i.e, no congestion going on).

Bart Polot 2015-10-06 22:02 reporter ~0009708	New records: Oct 06 22:00:24-597194 cadet-p2p-838 WARNING STRN: core wait time 17 m (> 1 second) for 108 bytes Oct 06 22:00:24-597352 cadet-p2p-838 WARNING EGEF: core wait time 17 m (> 1 second) for 100 bytes Oct 06 22:00:24-597413 cadet-p2p-838 WARNING DSTJ: core wait time 16 m (> 1 second) for 100 bytes Oct 06 22:00:24-597474 cadet-p2p-838 WARNING 60JY: core wait time 16 m (> 1 second) for 120 bytes Oct 06 22:00:24-597547 cadet-p2p-838 WARNING BART: core wait time 15 m (> 1 second) for 100 bytes Oct 06 22:00:24-597603 cadet-p2p-838 WARNING RNGR: core wait time 15 m (> 1 second) for 120 bytes Oct 06 22:00:24-597661 cadet-p2p-838 WARNING VMDC: core wait time 13 m (> 1 second) for 120 bytes Oct 06 22:00:24-597738 cadet-p2p-838 WARNING GN10: core wait time 13 m (> 1 second) for 120 bytes Oct 06 22:00:24-597800 cadet-p2p-838 WARNING PH0B: core wait time 12 m (> 1 second) for 120 bytes Oct 06 22:00:24-597860 cadet-p2p-838 WARNING T03N: core wait time 87 s (> 1 second) for 164 bytes

Christian Grothoff 2015-10-06 23:57 manager ~0009709	Clearly we need to add logging to transport to see where the delay really comes from. But 15m is in sane, as CORE should disconnect before then. Do you get CORE 'delayed' messages for the same peer here? I assume you'd not log 'core wait time' if you got a disconnect, right?

Bart Polot 2015-10-07 00:11 reporter ~0009710	The message is logged from the callback given to CORE_notify_transmit_ready and after checking for (buffer == NULL or size == 0), which IIUC would indicate a timeout or disconnection. Moreover, the core_disconnect callback was not activated.

Christian Grothoff 2015-10-07 00:25 manager ~0009711	If CORE doesn't also log, I wonder if this issue relates to 0003910, where somehow we manage to have a disagreement between core and the lib as to which peers are connected. What does 'gnunet-core -m' say about connected peers in the meantime (i.e. before you get to send a message after 15m)? (I wonder if we got disconnected, reconnected after 10-15m and then it seems to CADET like it was just waiting forever...). That doesn't explain the multi-second latencies on transport, but maybe we have two issues interacting here.

Christian Grothoff 2015-10-08 19:12 manager ~0009713	SVN 36492 fixes another related issue where an application that requests transmission of a high-priority (> background) message then fails to send anything (returns 0) might cause CORE to block all other lower-priority traffic (or even equal-priority) until the application sends another message, as CORE believes it's still waiting for that message. This was fixed by having core_api send a 0-byte message which in CORE now triggers cleaning up the high-priority block, allowing lower-priority messages to go through. 36492 also fixes some messages dropped due to 'timeout' (so API behaves more like it was documented).

Bart Polot 2015-10-08 19:57 reporter ~0009714	I'm cautiosly optimistic that it could be solved. I'm still seeing delays of up to a minute but only to two peers (T03N, BBCW), caused by transport_api announcing a BW of 1k/s. I will run the regex and cadet profilers later and see if the delays are gone there.

Bart Polot 2015-10-09 00:53 reporter ~0009716	Haven't gotten to run the profilers yet, however I still see some delays, much milder than before: Oct 09 00:37:52-655593 transport-api-16003 WARNING Added 840 bytes of payload message for BART after 1032 ms delay at 672500315 b/s Oct 09 00:37:52-660048 transport-api-16003 WARNING Added 624 bytes of payload message for RNGR after 1020 ms delay at 1130554465 b/s Oct 09 00:37:52-660161 core-16003 WARNING Client waited 1013 ms for permission to transmit to `RNGR' (priority 1) Oct 09 00:37:52-660335 core-api-15997 WARNING Transmitting overdue 1954 bytes to `RNGR' at priority 1 with 1013 ms delay Oct 09 00:37:52-660505 core-16003 WARNING Client waited 1013 ms for transmission of 1954 bytes to `RNGR' (corked), CORE queue is 1 entries Oct 09 00:37:52-661296 core-16003 WARNING Client waited 1016 ms for permission to transmit to `BART' (priority 1) Oct 09 00:37:52-661889 core-api-15997 WARNING Transmitting overdue 1706 bytes to `BART' at priority 1 with 1019 ms delay Oct 09 00:37:52-663928 core-16003 WARNING Client waited 1019 ms for transmission of 1706 bytes to `BART' (corked), CORE queue is 2 entries Peer RNGR is connected via gigabit ethernet, all quotas off.

Bart Polot 2015-10-09 01:18 reporter ~0009717 Last edited: 2015-10-09 01:19	Another bad delay: Oct 09 01:15:35-239148 core-api-15996 WARNING Transmitting overdue 360 bytes to `RNGR' at priority 2 with 13 m delay Oct 09 01:15:35-239686 core-api-15997 WARNING Transmitting overdue 0 bytes to `BART' at priority 1 with 9 m delay [...] Oct 09 01:15:36-439496 transport-api-16003 WARNING Added 160 bytes of payload message for BART after 1201 ms delay at 683651645 b/s :/

Christian Grothoff 2015-10-09 06:43 manager ~0009718	If it says "corked", it means some application set a flag that allows core to delay up to 1s. So in that case, some 'delay' is to be expected. So please also check which options may be leading to the warnings...

Christian Grothoff 2015-10-17 22:29 manager ~0009738	Found a UDP-specific issue where UDP could (worst-case) delay packets 60x longer than desired, so instead of say 1s it might take 60s because UDP decided to throttle (more than necessary due to a failure to consider fragmentation). So that ought to be fixed in SVN 36521.

Christian Grothoff 2015-10-26 16:47 manager ~0009794	Bart, is the situation better now? I still see warnings, but those seem to be largely explicable by UDP+low bandwidth simply taking a few secs sometimes.

Christian Grothoff 2015-10-29 14:33 manager ~0009854	Short of additional feedback, I think I can declare victory on this.

Date Modified	Username	Field	Change
2015-06-26 04:38	Bart Polot	New Issue
2015-06-26 04:38	Bart Polot	Status	new => assigned
2015-06-26 04:38	Bart Polot	Assigned To	=> Christian Grothoff
2015-06-26 13:20	Bart Polot	Note Added: 0009342
2015-06-26 13:52	Bart Polot	Note Added: 0009343
2015-06-26 13:56	Bart Polot	Note Edited: 0009343
2015-06-28 15:37	Christian Grothoff	Note Added: 0009360
2015-10-05 18:37	Christian Grothoff	Note Added: 0009697
2015-10-05 18:37	Christian Grothoff	Note Added: 0009698
2015-10-05 18:37	Christian Grothoff	Assigned To	Christian Grothoff => Bart Polot
2015-10-05 18:37	Christian Grothoff	Status	assigned => feedback
2015-10-06 17:13	Bart Polot	Note Added: 0009701
2015-10-06 17:13	Bart Polot	Status	feedback => assigned
2015-10-06 17:20	Bart Polot	Note Added: 0009702
2015-10-06 21:40	Christian Grothoff	Note Added: 0009704
2015-10-06 21:43	Christian Grothoff	Note Added: 0009705
2015-10-06 21:43	Christian Grothoff	Assigned To	Bart Polot =>
2015-10-06 21:43	Christian Grothoff	Status	assigned => confirmed
2015-10-06 21:43	Christian Grothoff	Category	core service => transport service
2015-10-06 21:43	Christian Grothoff	Target Version	=> 0.11.0pre66
2015-10-06 21:44	Christian Grothoff	Note Added: 0009706
2015-10-06 21:44	Christian Grothoff	Summary	CORE takes unreasonably large amounts of time to call back. => TRANSPORT (and indirectly CORE) takes unreasonably large amounts of time to call back.
2015-10-06 21:44	Christian Grothoff	Description Updated
2015-10-06 21:48	Bart Polot	Note Added: 0009707
2015-10-06 22:02	Bart Polot	Note Added: 0009708
2015-10-06 23:57	Christian Grothoff	Note Added: 0009709
2015-10-07 00:11	Bart Polot	Note Added: 0009710
2015-10-07 00:25	Christian Grothoff	Note Added: 0009711
2015-10-08 19:12	Christian Grothoff	Note Added: 0009713
2015-10-08 19:57	Bart Polot	Note Added: 0009714
2015-10-09 00:53	Bart Polot	Note Added: 0009716
2015-10-09 01:18	Bart Polot	Note Added: 0009717
2015-10-09 01:19	Bart Polot	Note Edited: 0009717
2015-10-09 06:43	Christian Grothoff	Note Added: 0009718
2015-10-17 03:57	Christian Grothoff	Relationship added	parent of 0004009
2015-10-17 22:29	Christian Grothoff	Note Added: 0009738
2015-10-22 22:06	Bart Polot	Relationship added	related to 0004019
2015-10-26 16:47	Christian Grothoff	Note Added: 0009794
2015-10-29 14:33	Christian Grothoff	Assigned To	=> Christian Grothoff
2015-10-29 14:33	Christian Grothoff	Status	confirmed => assigned
2015-10-29 14:33	Christian Grothoff	Note Added: 0009854
2015-10-29 14:33	Christian Grothoff	Status	assigned => resolved
2015-10-29 14:33	Christian Grothoff	Fixed in Version	=> 0.11.0pre66
2015-10-29 14:33	Christian Grothoff	Resolution	open => fixed
2018-06-07 00:24	Christian Grothoff	Status	resolved => closed

View Issue Details

Relationships

Activities

Issue History

parent of	0004009	closed	Christian Grothoff	bandwidth disagreement between ATS and transport lib
related to	0004019	closed	Christian Grothoff	CORE compains about mixed identities