View Issue Details

IDProjectCategoryView StatusLast Update
0003863GNUnettransport servicepublic2018-06-07 00:24
ReporterBart Polot Assigned ToChristian Grothoff  
PrioritynormalSeveritymajorReproducibilitysometimes
Status closedResolutionfixed 
Product VersionGit master 
Target Version0.11.0pre66Fixed in Version0.11.0pre66 
Summary0003863: TRANSPORT (and indirectly CORE) takes unreasonably large amounts of time to call back.
DescriptionCADET times how much CORE takes to call back after a notify_transmit_ready. Sometimes running a peer (default settings) this exceeds a minute and a half:

Jun 26 04:25:17-007658 cadet-p2p-25737 ERROR core wait time 95906486 µs > 1 second


Recent versions of CADET complain of any delay bigger than 1 second.

Updated CORE also now complains about TRANSPORT taking forever.
Steps To ReproduceRun a peer normally connected to the network.

or

Run a CADET testcase enabling valgrind for CADET. Valgrind does not cause CADET to block and delay the delivery, the peer logs activity while waiting for CORE (timeouts, keepalive tasks, etc).
Additional InformationWhen running a testcase with unlimited bandwidth and very small messages (< 2kb) this is the delay distribution:

http://imgur.com/9Q2CNVz.png

Notice how most messages are sent immediately, but there are two bumps around the 0.25s and 1.2s delay.
TagsNo tags attached.

Relationships

parent of 0004009 closedChristian Grothoff bandwidth disagreement between ATS and transport lib 
related to 0004019 closedChristian Grothoff CORE compains about mixed identities 

Activities

Bart Polot

2015-06-26 13:20

manager   ~0009342

Running a peer in the public network with:
- CADET under valgrind
- unlimited quotas for all ats categories
- fs turned off
Also causes this behavior.

Bart Polot

2015-06-26 13:52

manager   ~0009343

Last edited: 2015-06-26 13:56

Steps to generate graph:

- run with CADET loglevel DEBUG at least for function "queue_send" > LOGFILE
- grep "core wait time.*µs" LOGFILE | sed -e "s/.*core wait time //" | cut -d' ' -f 1 > core_wait.dat

GNUplot script:

#set term png
#set output "plot.png"
set xlabel "Delay in seconds"
set ylabel "# of repetitions"

set boxwidth 0.05 absolute
set style fill solid 1.0 noborder

bin_width = 50000;
bin_number(x) = floor(x/bin_width)
rounded(x) = bin_width * ( bin_number(x) + 0.5 )

plot "core_wait.dat" u (rounded($1)/1000000):(1) with boxes ti "" smooth frequency

Christian Grothoff

2015-06-28 15:37

manager   ~0009360

Which testcase (with 2k messages, etc.) is it that I should run to reproduce?

Christian Grothoff

2015-10-05 18:37

manager   ~0009697

I've put in some extra instrumentation for diagnostics. Most will show up at level WARNING (delays > 1s), alternatively one could do stuff like:

export GNUNET_FORCE_LOG=";transport_api.c;GNUNET_TRANSPORT_notify_transmit_ready;;DEBUG/"

to see every delay *below* core coming up from transport enforcing quotas.

In my little tests, I did see CADET delays > 1s, but that was with small quotas causing congestion from transport on 32k messages (transmitted at 3k/s) to bubble up to CADET.

Christian Grothoff

2015-10-05 18:37

manager   ~0009698

Looking for feedback on the new diagnostics.

Right now, running the peer gives me tons of crappy CADET logs... :-(

Bart Polot

2015-10-06 17:13

manager   ~0009701

Seems that most cadet complaints are caused by transport, as core complains as well:
Oct 06 17:04:31-430277 cadet-p2p-838 WARNING VMDC: core wait time 94 s (> 1 second) for 204 bytes
Oct 06 17:04:31-430578 core-847 WARNING Client waited 90 s for permission to transmit to `DSTJ' (priority 3)
Oct 06 17:04:31-430630 core-847 WARNING Client waited 95 s for permission to transmit to `60JY' (priority 3)
Oct 06 17:04:31-430716 cadet-p2p-838 WARNING E48H: core wait time 94 s (> 1 second) for 40 bytes
Oct 06 17:04:31-430794 cadet-p2p-838 WARNING GN10: core wait time 7 s (> 1 second) for 120 bytes
Oct 06 17:04:31-432059 core-847 WARNING Client waited 94 s for transmission of 2460 bytes to `VMDC' (corked), CORE queue is 0 entries
Oct 06 17:04:31-432767 core-847 WARNING Client waited 94 s for transmission of 148 bytes to `E48H' (corked), CORE queue is 0 entries
Oct 06 17:04:31-433870 core-847 WARNING Client waited 93 s for transmission of 1525 bytes to `VMDC' (corked), CORE queue is 1 entries

Transport shows up sporadically in the logs with quite restrictive bandwidth quotas (1024B/s, few kB):
Oct 06 17:07:14-646699 transport-api-847 WARNING At bandwidth 1024 byte/s next transmission to EGEF in 15 s
[...]
Oct 06 17:10:47-334590 transport-api-847 WARNING At bandwidth 1024 byte/s next transmission to EGEF in 14 s
Oct 06 17:10:50-544423 transport-api-847 WARNING At bandwidth 1024 byte/s next transmission to E48H in 16 s

-----------------------------------------------------------

There are two weird things:

- All quotas are requested to be removed in the config file:

[ats]
UNSPECIFIED_QUOTA_IN = unlimited
UNSPECIFIED_QUOTA_OUT = unlimited

LOOPBACK_QUOTA_IN = unlimited
LOOPBACK_QUOTA_OUT = unlimited

LAN_QUOTA_IN = unlimited
LAN_QUOTA_OUT = unlimited

WAN_QUOTA_IN = unlimited
WAN_QUOTA_OUT = unlimited

WLAN_QUOTA_IN = unlimited
WLAN_QUOTA_OUT = unlimited

BLUETOOTH_QUOTA_IN = unlimited
BLUETOOTH_QUOTA_OUT = unlimited

Is the remote peer imposing a quota?

- Core mentions that transmissions are corked, while cadet always calls core_transmit_ready with a hardcoded GNUNET_NO in the cork parameter.

Bart Polot

2015-10-06 17:20

manager   ~0009702

Looking a bit more at the log I can see that traffic between two peers that I control is also delayed, up to 90 (!) seconds:

Oct 06 17:01:45-263949 core-847 WARNING Client waited 4101 ms for permission to transmit to `BART' (priority 3)
Oct 06 17:01:45-269429 core-847 WARNING Client waited 4106 ms for transmission of 108 bytes to `BART' (corked), CORE queue is 3 entries
Oct 06 17:01:45-272691 core-847 WARNING Client waited 12 s for permission to transmit to `BART' (priority 1)
Oct 06 17:01:45-275059 core-847 WARNING Client waited 12 s for transmission of 8324 bytes to `BART' (corked), CORE queue is 3 entries
Oct 06 17:01:47-199779 core-847 WARNING Client waited 1541 ms for permission to transmit to `BART' (priority 1)
Oct 06 17:01:47-200848 core-847 WARNING Client waited 1542 ms for transmission of 1859 bytes to `BART' (corked), CORE queue is 1 entries
[...]
Oct 06 17:03:01-675691 core-847 WARNING Client waited 4963 ms for permission to transmit to `BART' (priority 3)
Oct 06 17:03:01-676209 core-847 WARNING Client waited 4964 ms for transmission of 272 bytes to `BART' (corked), CORE queue is 1 entries
Oct 06 17:04:31-647891 core-847 WARNING Client waited 93 s for permission to transmit to `BART' (priority 1)
Oct 06 17:04:31-661017 core-847 WARNING Client waited 93 s for transmission of 2100 bytes to `BART' (corked), CORE queue is 0 entries
Oct 06 17:09:20-851005 core-847 WARNING Client waited 3606 ms for permission to transmit to `BART' (priority 3)
Oct 06 17:09:20-853940 core-847 WARNING Client waited 4594 ms for permission to transmit to `BART' (priority 1)
Oct 06 17:09:20-856972 core-847 WARNING Client waited 3612 ms for transmission of 272 bytes to `BART' (corked), CORE queue is 0 entries
Oct 06 17:09:20-858242 core-847 WARNING Client waited 4598 ms for transmission of 3644 bytes to `BART' (corked), CORE queue is 1 entries
Oct 06 17:11:01-482479 core-847 WARNING Client waited 7 s for permission to transmit to `BART' (priority 3)
Oct 06 17:11:01-483440 core-847 WARNING Client waited 7 s for permission to transmit to `BART' (priority 1)
Oct 06 17:11:01-483871 core-847 WARNING Client waited 7 s for transmission of 4218 bytes to `BART' (corked), CORE queue is 0 entries
Oct 06 17:11:01-485820 core-847 WARNING Client waited 7 s for transmission of 500 bytes to `BART' (corked), CORE queue is 1 entries
Oct 06 17:11:38-627376 core-847 WARNING Client waited 5 s for permission to transmit to `BART' (priority 2)
Oct 06 17:11:38-630576 core-847 WARNING Client waited 3978 ms for permission to transmit to `BART' (priority 1)
Oct 06 17:11:38-630675 core-847 WARNING Client waited 5 s for transmission of 272 bytes to `BART' (corked), CORE queue is 0 entries
Oct 06 17:11:38-633386 core-847 WARNING Client waited 3981 ms for transmission of 4108 bytes to `BART' (corked), CORE queue is 1 entries
Oct 06 17:11:52-434109 core-847 WARNING Client waited 2948 ms for permission to transmit to `BART' (priority 3)

I am positive that both these peers have quotas set to unlimited in their config files, are not running with memory instrumentation and one of them even has [fs] disabled.

Christian Grothoff

2015-10-06 21:40

manager   ~0009704

When transport establishes a connection, it starts (!) with a quota of 1kb/s until the other peer can announce the real quota. This is necessary because we don't want to send a peer more than it can handle. Naturally, the peer tries to let us know about its quota ASAP, but this is where you may get (brief) 1kb/s quotas. This may also happen on transport-level reconnects.

Anyway, this isn't a good answer, but a first hint as to what might be going on.

Christian Grothoff

2015-10-06 21:43

manager   ~0009705

What is clear is that we should put similar logging also into TRANSPORT, to see where the high delay ultimately comes from. And clearly additional scaffolding to diagnose latency issues is warranted -- I'd not have imagined core waiting 90+s for transmission and still ever succeeding. So something is badly broken.

Christian Grothoff

2015-10-06 21:44

manager   ~0009706

Updated bug title, category and summary to reflect latest insights.

Bart Polot

2015-10-06 21:48

manager   ~0009707

Just FYI: I still see delays between two of my unlimited peers:

Oct 06 21:29:14-161458 core-847 WARNING Client waited 5 s for transmission of 3461 bytes to `BART' (corked), CORE queue is 1 entries

They have been running for a at least 5 hours, so they should have had time to tell each other about their quotas...

`BART' is running on a gigabit datacenter connection, the other one is running at home on a 100/10Mbps connection, and both are almost idle (i.e, no congestion going on).

Bart Polot

2015-10-06 22:02

manager   ~0009708

New records:

Oct 06 22:00:24-597194 cadet-p2p-838 WARNING STRN: core wait time 17 m (> 1 second) for 108 bytes
Oct 06 22:00:24-597352 cadet-p2p-838 WARNING EGEF: core wait time 17 m (> 1 second) for 100 bytes
Oct 06 22:00:24-597413 cadet-p2p-838 WARNING DSTJ: core wait time 16 m (> 1 second) for 100 bytes
Oct 06 22:00:24-597474 cadet-p2p-838 WARNING 60JY: core wait time 16 m (> 1 second) for 120 bytes
Oct 06 22:00:24-597547 cadet-p2p-838 WARNING BART: core wait time 15 m (> 1 second) for 100 bytes
Oct 06 22:00:24-597603 cadet-p2p-838 WARNING RNGR: core wait time 15 m (> 1 second) for 120 bytes
Oct 06 22:00:24-597661 cadet-p2p-838 WARNING VMDC: core wait time 13 m (> 1 second) for 120 bytes
Oct 06 22:00:24-597738 cadet-p2p-838 WARNING GN10: core wait time 13 m (> 1 second) for 120 bytes
Oct 06 22:00:24-597800 cadet-p2p-838 WARNING PH0B: core wait time 12 m (> 1 second) for 120 bytes
Oct 06 22:00:24-597860 cadet-p2p-838 WARNING T03N: core wait time 87 s (> 1 second) for 164 bytes

Christian Grothoff

2015-10-06 23:57

manager   ~0009709

Clearly we need to add logging to transport to see where the delay really comes from. But 15m is in sane, as CORE should disconnect before then. Do you get CORE 'delayed' messages for the same peer here? I assume you'd not log 'core wait time' if you got a disconnect, right?

Bart Polot

2015-10-07 00:11

manager   ~0009710

The message is logged from the callback given to CORE_notify_transmit_ready and after checking for (buffer == NULL or size == 0), which IIUC would indicate a timeout or disconnection.

Moreover, the core_disconnect callback was not activated.

Christian Grothoff

2015-10-07 00:25

manager   ~0009711

If CORE doesn't *also* log, I wonder if this issue relates to 0003910, where somehow we manage to have a disagreement between core and the lib as to which peers are connected. What does 'gnunet-core -m' say about connected peers in the meantime (i.e. before you get to send a message after 15m)? (I wonder if we got disconnected, reconnected after 10-15m and then it seems to CADET like it was just waiting forever...).

That doesn't explain the multi-second latencies on transport, but maybe we have two issues interacting here.

Christian Grothoff

2015-10-08 19:12

manager   ~0009713

SVN 36492 fixes another related issue where an application that requests transmission of a high-priority (> background) message then fails to send anything (returns 0) might cause CORE to block all other lower-priority traffic (or even equal-priority) until the application sends *another* message, as CORE believes it's still waiting for that message.

This was fixed by having core_api send a 0-byte message which in CORE now triggers cleaning up the high-priority block, allowing lower-priority messages to go through.

36492 also fixes some messages dropped due to 'timeout' (so API behaves more like it was documented).

Bart Polot

2015-10-08 19:57

manager   ~0009714

I'm cautiosly optimistic that it could be solved. I'm still seeing delays of up to a minute but only to two peers (T03N, BBCW), caused by transport_api announcing a BW of 1k/s.

I will run the regex and cadet profilers later and see if the delays are gone there.

Bart Polot

2015-10-09 00:53

manager   ~0009716

Haven't gotten to run the profilers yet, however I still see some delays, much milder than before:

Oct 09 00:37:52-655593 transport-api-16003 WARNING Added 840 bytes of payload message for BART after 1032 ms delay at 672500315 b/s
Oct 09 00:37:52-660048 transport-api-16003 WARNING Added 624 bytes of payload message for RNGR after 1020 ms delay at 1130554465 b/s
Oct 09 00:37:52-660161 core-16003 WARNING Client waited 1013 ms for permission to transmit to `RNGR' (priority 1)
Oct 09 00:37:52-660335 core-api-15997 WARNING Transmitting overdue 1954 bytes to `RNGR' at priority 1 with 1013 ms delay
Oct 09 00:37:52-660505 core-16003 WARNING Client waited 1013 ms for transmission of 1954 bytes to `RNGR' (corked), CORE queue is 1 entries
Oct 09 00:37:52-661296 core-16003 WARNING Client waited 1016 ms for permission to transmit to `BART' (priority 1)
Oct 09 00:37:52-661889 core-api-15997 WARNING Transmitting overdue 1706 bytes to `BART' at priority 1 with 1019 ms delay
Oct 09 00:37:52-663928 core-16003 WARNING Client waited 1019 ms for transmission of 1706 bytes to `BART' (corked), CORE queue is 2 entries

Peer RNGR is connected via gigabit ethernet, all quotas off.

Bart Polot

2015-10-09 01:18

manager   ~0009717

Last edited: 2015-10-09 01:19

Another bad delay:

Oct 09 01:15:35-239148 core-api-15996 WARNING Transmitting overdue 360 bytes to `RNGR' at priority 2 with 13 m delay
Oct 09 01:15:35-239686 core-api-15997 WARNING Transmitting overdue 0 bytes to `BART' at priority 1 with 9 m delay
[...]
Oct 09 01:15:36-439496 transport-api-16003 WARNING Added 160 bytes of payload message for BART after 1201 ms delay at 683651645 b/s

:/

Christian Grothoff

2015-10-09 06:43

manager   ~0009718

If it says "corked", it means some application set a flag that allows core to delay up to 1s. So in that case, some 'delay' is to be expected. So please also check which options may be leading to the warnings...

Christian Grothoff

2015-10-17 22:29

manager   ~0009738

Found a UDP-specific issue where UDP could (worst-case) delay packets 60x longer than desired, so instead of say 1s it might take 60s because UDP decided to throttle (more than necessary due to a failure to consider fragmentation). So that ought to be fixed in SVN 36521.

Christian Grothoff

2015-10-26 16:47

manager   ~0009794

Bart, is the situation better now? I still see warnings, but those seem to be largely explicable by UDP+low bandwidth simply taking a few secs sometimes.

Christian Grothoff

2015-10-29 14:33

manager   ~0009854

Short of additional feedback, I think I can declare victory on this.

Issue History

Date Modified Username Field Change
2015-06-26 04:38 Bart Polot New Issue
2015-06-26 04:38 Bart Polot Status new => assigned
2015-06-26 04:38 Bart Polot Assigned To => Christian Grothoff
2015-06-26 13:20 Bart Polot Note Added: 0009342
2015-06-26 13:52 Bart Polot Note Added: 0009343
2015-06-26 13:56 Bart Polot Note Edited: 0009343
2015-06-28 15:37 Christian Grothoff Note Added: 0009360
2015-10-05 18:37 Christian Grothoff Note Added: 0009697
2015-10-05 18:37 Christian Grothoff Note Added: 0009698
2015-10-05 18:37 Christian Grothoff Assigned To Christian Grothoff => Bart Polot
2015-10-05 18:37 Christian Grothoff Status assigned => feedback
2015-10-06 17:13 Bart Polot Note Added: 0009701
2015-10-06 17:13 Bart Polot Status feedback => assigned
2015-10-06 17:20 Bart Polot Note Added: 0009702
2015-10-06 21:40 Christian Grothoff Note Added: 0009704
2015-10-06 21:43 Christian Grothoff Note Added: 0009705
2015-10-06 21:43 Christian Grothoff Assigned To Bart Polot =>
2015-10-06 21:43 Christian Grothoff Status assigned => confirmed
2015-10-06 21:43 Christian Grothoff Category core service => transport service
2015-10-06 21:43 Christian Grothoff Target Version => 0.11.0pre66
2015-10-06 21:44 Christian Grothoff Note Added: 0009706
2015-10-06 21:44 Christian Grothoff Summary CORE takes unreasonably large amounts of time to call back. => TRANSPORT (and indirectly CORE) takes unreasonably large amounts of time to call back.
2015-10-06 21:44 Christian Grothoff Description Updated
2015-10-06 21:48 Bart Polot Note Added: 0009707
2015-10-06 22:02 Bart Polot Note Added: 0009708
2015-10-06 23:57 Christian Grothoff Note Added: 0009709
2015-10-07 00:11 Bart Polot Note Added: 0009710
2015-10-07 00:25 Christian Grothoff Note Added: 0009711
2015-10-08 19:12 Christian Grothoff Note Added: 0009713
2015-10-08 19:57 Bart Polot Note Added: 0009714
2015-10-09 00:53 Bart Polot Note Added: 0009716
2015-10-09 01:18 Bart Polot Note Added: 0009717
2015-10-09 01:19 Bart Polot Note Edited: 0009717
2015-10-09 06:43 Christian Grothoff Note Added: 0009718
2015-10-17 03:57 Christian Grothoff Relationship added parent of 0004009
2015-10-17 22:29 Christian Grothoff Note Added: 0009738
2015-10-22 22:06 Bart Polot Relationship added related to 0004019
2015-10-26 16:47 Christian Grothoff Note Added: 0009794
2015-10-29 14:33 Christian Grothoff Assigned To => Christian Grothoff
2015-10-29 14:33 Christian Grothoff Status confirmed => assigned
2015-10-29 14:33 Christian Grothoff Note Added: 0009854
2015-10-29 14:33 Christian Grothoff Status assigned => resolved
2015-10-29 14:33 Christian Grothoff Fixed in Version => 0.11.0pre66
2015-10-29 14:33 Christian Grothoff Resolution open => fixed
2018-06-07 00:24 Christian Grothoff Status resolved => closed