View Issue Details

IDProjectCategoryView StatusLast Update
0003319GNUnetcadet servicepublic2014-04-08 16:41
ReporterFlorian Dold Assigned ToBart Polot  
PriorityhighSeveritymajorReproducibilitysometimes
Status closedResolutionunable to reproduce 
Product VersionGit master 
Target Version0.10.1 
Summary0003319: mesh gets randomly stuck for 10 minutes
DescriptionThe consensus service uses mesh heavily. Sometimes a message (even a small one that is only the header with no body) will get stuck, and mesh takes about 10 minutes to deliver it.
Steps To ReproduceRun consensus with a huge timeout (so that rounds won't be aborted by mesh) and a large number of peers (e.g. 15)

./gnunet-consensus-profiler -c test_consensus.conf -t 10h -x 10 -n15
Additional InformationI sometimes get this assertion:
mesh-9066 WARNING External protocol violation detected at gnunet-service-mesh_connection.c:1749.

Here's the log with the 10 minute gap (which is probably not very useful).

[...]
Feb 10 21:01:08-018067 consensus-9034 DEBUG P5: set result from P4: done
Feb 10 21:01:08-018093 consensus-9034 DEBUG P5: all reconciliations of subround done
Feb 10 21:01:08-018119 consensus-9034 DEBUG subround timeout: 1125000 ms
Feb 10 21:01:08-018541 consensus-9034 DEBUG P5: doing exp-round, r=3, sub=2, in: 9, out: -1
---- notice the large gap in time -----
Feb 10 21:10:49-451960 set-9052 DEBUG dispatching mesh message (type: 586)
Feb 10 21:10:49-452040 set-9052 DEBUG received p2p message (t: 586, s: 4)
Feb 10 21:10:49-452067 set-9052 DEBUG got DONE, sending final DONE after elements
[...]
TagsNo tags attached.
Attached Files
log.tar.bz2 (1,482,001 bytes)

Activities

Florian Dold

2014-02-11 09:51

developer   ~0008086

Issue seems very similar to 0002967, which was fixed a while ago.

Bart Polot

2014-02-11 14:57

reporter   ~0008093

Could you try to run it with MESH DEBUG log messages and send me the log file?

I a munable to reproduce it:
$ time ./gnunet-consensus-profiler -c test_consensus.conf -t 10h -x 20 -n30 |& tee log
[...] 58 MB of log messages
Feb 11 14:56:27-463776 mesh-26172 DEBUG shutting down
Feb 11 14:56:27-463842 mesh-tun-26172 DEBUG destroying tunnel G3FJ
Feb 11 14:56:27-463859 mesh-tun-26172 DEBUG destroying tunnel 158Q
Feb 11 14:56:27-463870 mesh-tun-26172 DEBUG destroying tunnel 2IU4
Feb 11 14:56:27-463881 mesh-tun-26172 DEBUG destroying tunnel K83O
Feb 11 14:56:27-463892 mesh-tun-26172 DEBUG destroying tunnel 4PQD
Feb 11 14:56:27-463902 mesh-tun-26172 DEBUG destroying tunnel 54PM
Feb 11 14:56:27-463912 mesh-tun-26172 DEBUG destroying tunnel 5H2I
Feb 11 14:56:27-463922 mesh-tun-26172 DEBUG destroying tunnel MFO9
Feb 11 14:56:27-463933 mesh-tun-26172 DEBUG destroying tunnel 7TVP
Feb 11 14:56:27-463943 mesh-tun-26172 DEBUG destroying tunnel 800D
Feb 11 14:56:27-463953 mesh-tun-26172 DEBUG destroying tunnel QU5I
Feb 11 14:56:27-463963 mesh-tun-26172 DEBUG destroying tunnel B90S
Feb 11 14:56:27-463973 mesh-tun-26172 DEBUG destroying tunnel CB87
Feb 11 14:56:27-463984 mesh-tun-26172 DEBUG destroying tunnel D9VC
Feb 11 14:56:27-463994 mesh-tun-26172 DEBUG destroying tunnel V1SK
Feb 11 14:56:27-464043 mesh-26172 DEBUG shut down
Feb 11 14:56:27-464065 mesh-loc-26172 DEBUG client disconnected: (nil)
Feb 11 14:56:27-464075 mesh-loc-26172 DEBUG (SERVER DOWN)
Feb 11 14:56:27-464083 mesh-loc-26172 DEBUG client connected: (nil)
P0 got 20 of 20 elements
P1 got 20 of 20 elements
P2 got 20 of 20 elements
P3 got 20 of 20 elements
P4 got 20 of 20 elements
P5 got 20 of 20 elements
P6 got 20 of 20 elements
P7 got 20 of 20 elements
P8 got 20 of 20 elements
P9 got 20 of 20 elements
P10 got 20 of 20 elements
P11 got 20 of 20 elements
P12 got 20 of 20 elements
P13 got 20 of 20 elements
P14 got 20 of 20 elements
P15 got 20 of 20 elements
P16 got 20 of 20 elements
P17 got 20 of 20 elements
P18 got 20 of 20 elements
P19 got 20 of 20 elements
P20 got 20 of 20 elements
P21 got 20 of 20 elements
P22 got 20 of 20 elements
P23 got 20 of 20 elements
P24 got 20 of 20 elements
P25 got 20 of 20 elements
P26 got 20 of 20 elements
P27 got 20 of 20 elements
P28 got 20 of 20 elements
P29 got 20 of 20 elements

real 0m31.090s
user 1m56.560s
sys 0m23.130s

Bart Polot

2014-02-11 15:00

reporter   ~0008095

BTW, no errors or warnings:
[bart@saturn ~/g/src/consensus] (master *% u=)$ grep WARNING log
[bart@saturn ~/g/src/consensus] (master *% u=)$ grep ERROR log
[bart@saturn ~/g/src/consensus] (master *% u=)$

Florian Dold

2014-02-17 15:46

developer   ~0008118

This is kind of hard to reproduce, I almost thought it does not happen anymore, but it still does.

I have uploaded a log file, notice the gap between set logs, with
only stuff from mesh inbetween:

Feb 17 15:25:19-646562 set-api-6711 DEBUG processed op request from service
[...]
Feb 17 15:35:03-938307 set-6709 DEBUG dispatching mesh message (type: 585)

The log was produced with
dold@dold-x1c:~/repos/gnunet/src/consensus$ GNUNET_FORCE_LOG="consensus;;;;DEBUG/mesh;;;;DEBUG/set;;;;DEBUG" ./gnunet-consensus-profiler -c test_consensus.conf -x 20 -n 15 -t 50h |& tee log

Bart Polot

2014-02-17 18:10

reporter   ~0008120

Last messages before break are:

Feb 17 15:25:19-646475 set-api-6711 DEBUG processing operation request
Feb 17 15:25:19-646516 consensus-6711 DEBUG P2 got set request from P13
Feb 17 15:25:19-646540 consensus-6711 DEBUG P2 delaying set request from P13
Feb 17 15:25:19-646562 set-api-6711 DEBUG processed op request from service

"P2 delaying set request from P13"
- How big is this delay?
- Who is P2 and P13?


Other than that I cannot see any mesh delays. This is waht happens at 15:25:19.
+508ms 2I46 gets a message from QU5I with a channel creation request (~TCP SYN)
+567ms QU5I gets a message from 2I46 confirming a channel creation (~TCP SYNACK)
+611ms 2I46 gets a message from QU5I with a channel confirmation (~TCP ACK)
+614ms 2I46 gets a message from QU5I with data type 581, activating set and consensus callbacks
+684ms QU5I gets a message from 2I46 with an ACK for the data message

No other traffic between those peers is requested, so I don't seen how this should be mesh's fault.

Would it be possible to improve set's and consensus' debug messages with info about PEER-ID's?

Florian Dold

2014-02-17 18:26

developer   ~0008121

Delay here only means that the peer is in a lower round, and only will accept the request once other currently active requests have finished. There is no delay in the sense of a task scheduled with delay.

I am mainly wondering where the message after 10 minutes comes from. Is it possible to track down from the logs when the message was sent?

Bart Polot

2014-04-07 13:01

reporter   ~0008198

Looks like a core bandwidth issue. During the gap, mesh is trying to restransmit the second message of the channel, but the core transmit ready callback is never run.

Was the bandwidth unlocked in the ATS section of test_consensus.conf file?

Florian Dold

2014-04-07 17:09

developer   ~0008202

By unlocked bandwidth, do you mean the *_QUOTA_* options in ATS?

Bart Polot

2014-04-07 17:10

reporter   ~0008203

Yep

Florian Dold

2014-04-07 17:15

developer   ~0008204

I did not modify these options. It seems that in the default config, LOOPBACK_QUOTA_* is unlimited, which may be why I could not reproduce the bug when using the unix transport.

Now the question is, does LOOPBACK_QUOTA_* or LAN_QUOTA_* apply when using the tcp or udp transport?

Bart Polot

2014-04-07 20:38

reporter   ~0008209

Seems to be an ATS bandwidth limiting issue.

Issue History

Date Modified Username Field Change
2014-02-10 21:51 Florian Dold New Issue
2014-02-10 21:51 Florian Dold Status new => assigned
2014-02-10 21:51 Florian Dold Assigned To => Bart Polot
2014-02-11 09:51 Florian Dold Note Added: 0008086
2014-02-11 14:57 Bart Polot Note Added: 0008093
2014-02-11 14:57 Bart Polot Status assigned => feedback
2014-02-11 15:00 Bart Polot Note Added: 0008095
2014-02-17 15:45 Florian Dold File Added: log.tar.bz2
2014-02-17 15:46 Florian Dold Note Added: 0008118
2014-02-17 15:46 Florian Dold Status feedback => assigned
2014-02-17 18:10 Bart Polot Note Added: 0008120
2014-02-17 18:10 Bart Polot Status assigned => feedback
2014-02-17 18:26 Florian Dold Note Added: 0008121
2014-02-17 18:26 Florian Dold Status feedback => assigned
2014-03-11 23:43 Christian Grothoff Target Version => 0.10.1
2014-04-07 13:01 Bart Polot Note Added: 0008198
2014-04-07 13:01 Bart Polot Status assigned => feedback
2014-04-07 17:09 Florian Dold Note Added: 0008202
2014-04-07 17:09 Florian Dold Status feedback => assigned
2014-04-07 17:10 Bart Polot Note Added: 0008203
2014-04-07 17:15 Florian Dold Note Added: 0008204
2014-04-07 20:38 Bart Polot Note Added: 0008209
2014-04-07 20:38 Bart Polot Status assigned => resolved
2014-04-07 20:38 Bart Polot Resolution open => unable to reproduce
2014-04-08 16:41 Christian Grothoff Status resolved => closed
2014-05-09 18:34 Christian Grothoff Category mesh service => cadet service