View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0003319 | GNUnet | cadet service | public | 2014-02-10 21:51 | 2014-04-08 16:41 |
| Reporter | Florian Dold | Assigned To | Bart Polot | ||
| Priority | high | Severity | major | Reproducibility | sometimes |
| Status | closed | Resolution | unable to reproduce | ||
| Product Version | Git master | ||||
| Target Version | 0.10.1 | ||||
| Summary | 0003319: mesh gets randomly stuck for 10 minutes | ||||
| Description | The consensus service uses mesh heavily. Sometimes a message (even a small one that is only the header with no body) will get stuck, and mesh takes about 10 minutes to deliver it. | ||||
| Steps To Reproduce | Run consensus with a huge timeout (so that rounds won't be aborted by mesh) and a large number of peers (e.g. 15) ./gnunet-consensus-profiler -c test_consensus.conf -t 10h -x 10 -n15 | ||||
| Additional Information | I sometimes get this assertion: mesh-9066 WARNING External protocol violation detected at gnunet-service-mesh_connection.c:1749. Here's the log with the 10 minute gap (which is probably not very useful). [...] Feb 10 21:01:08-018067 consensus-9034 DEBUG P5: set result from P4: done Feb 10 21:01:08-018093 consensus-9034 DEBUG P5: all reconciliations of subround done Feb 10 21:01:08-018119 consensus-9034 DEBUG subround timeout: 1125000 ms Feb 10 21:01:08-018541 consensus-9034 DEBUG P5: doing exp-round, r=3, sub=2, in: 9, out: -1 ---- notice the large gap in time ----- Feb 10 21:10:49-451960 set-9052 DEBUG dispatching mesh message (type: 586) Feb 10 21:10:49-452040 set-9052 DEBUG received p2p message (t: 586, s: 4) Feb 10 21:10:49-452067 set-9052 DEBUG got DONE, sending final DONE after elements [...] | ||||
| Tags | No tags attached. | ||||
| Attached Files | |||||
|
|
Issue seems very similar to 0002967, which was fixed a while ago. |
|
|
Could you try to run it with MESH DEBUG log messages and send me the log file? I a munable to reproduce it: $ time ./gnunet-consensus-profiler -c test_consensus.conf -t 10h -x 20 -n30 |& tee log [...] 58 MB of log messages Feb 11 14:56:27-463776 mesh-26172 DEBUG shutting down Feb 11 14:56:27-463842 mesh-tun-26172 DEBUG destroying tunnel G3FJ Feb 11 14:56:27-463859 mesh-tun-26172 DEBUG destroying tunnel 158Q Feb 11 14:56:27-463870 mesh-tun-26172 DEBUG destroying tunnel 2IU4 Feb 11 14:56:27-463881 mesh-tun-26172 DEBUG destroying tunnel K83O Feb 11 14:56:27-463892 mesh-tun-26172 DEBUG destroying tunnel 4PQD Feb 11 14:56:27-463902 mesh-tun-26172 DEBUG destroying tunnel 54PM Feb 11 14:56:27-463912 mesh-tun-26172 DEBUG destroying tunnel 5H2I Feb 11 14:56:27-463922 mesh-tun-26172 DEBUG destroying tunnel MFO9 Feb 11 14:56:27-463933 mesh-tun-26172 DEBUG destroying tunnel 7TVP Feb 11 14:56:27-463943 mesh-tun-26172 DEBUG destroying tunnel 800D Feb 11 14:56:27-463953 mesh-tun-26172 DEBUG destroying tunnel QU5I Feb 11 14:56:27-463963 mesh-tun-26172 DEBUG destroying tunnel B90S Feb 11 14:56:27-463973 mesh-tun-26172 DEBUG destroying tunnel CB87 Feb 11 14:56:27-463984 mesh-tun-26172 DEBUG destroying tunnel D9VC Feb 11 14:56:27-463994 mesh-tun-26172 DEBUG destroying tunnel V1SK Feb 11 14:56:27-464043 mesh-26172 DEBUG shut down Feb 11 14:56:27-464065 mesh-loc-26172 DEBUG client disconnected: (nil) Feb 11 14:56:27-464075 mesh-loc-26172 DEBUG (SERVER DOWN) Feb 11 14:56:27-464083 mesh-loc-26172 DEBUG client connected: (nil) P0 got 20 of 20 elements P1 got 20 of 20 elements P2 got 20 of 20 elements P3 got 20 of 20 elements P4 got 20 of 20 elements P5 got 20 of 20 elements P6 got 20 of 20 elements P7 got 20 of 20 elements P8 got 20 of 20 elements P9 got 20 of 20 elements P10 got 20 of 20 elements P11 got 20 of 20 elements P12 got 20 of 20 elements P13 got 20 of 20 elements P14 got 20 of 20 elements P15 got 20 of 20 elements P16 got 20 of 20 elements P17 got 20 of 20 elements P18 got 20 of 20 elements P19 got 20 of 20 elements P20 got 20 of 20 elements P21 got 20 of 20 elements P22 got 20 of 20 elements P23 got 20 of 20 elements P24 got 20 of 20 elements P25 got 20 of 20 elements P26 got 20 of 20 elements P27 got 20 of 20 elements P28 got 20 of 20 elements P29 got 20 of 20 elements real 0m31.090s user 1m56.560s sys 0m23.130s |
|
|
BTW, no errors or warnings: [bart@saturn ~/g/src/consensus] (master *% u=)$ grep WARNING log [bart@saturn ~/g/src/consensus] (master *% u=)$ grep ERROR log [bart@saturn ~/g/src/consensus] (master *% u=)$ |
|
|
This is kind of hard to reproduce, I almost thought it does not happen anymore, but it still does. I have uploaded a log file, notice the gap between set logs, with only stuff from mesh inbetween: Feb 17 15:25:19-646562 set-api-6711 DEBUG processed op request from service [...] Feb 17 15:35:03-938307 set-6709 DEBUG dispatching mesh message (type: 585) The log was produced with dold@dold-x1c:~/repos/gnunet/src/consensus$ GNUNET_FORCE_LOG="consensus;;;;DEBUG/mesh;;;;DEBUG/set;;;;DEBUG" ./gnunet-consensus-profiler -c test_consensus.conf -x 20 -n 15 -t 50h |& tee log |
|
|
Last messages before break are: Feb 17 15:25:19-646475 set-api-6711 DEBUG processing operation request Feb 17 15:25:19-646516 consensus-6711 DEBUG P2 got set request from P13 Feb 17 15:25:19-646540 consensus-6711 DEBUG P2 delaying set request from P13 Feb 17 15:25:19-646562 set-api-6711 DEBUG processed op request from service "P2 delaying set request from P13" - How big is this delay? - Who is P2 and P13? Other than that I cannot see any mesh delays. This is waht happens at 15:25:19. +508ms 2I46 gets a message from QU5I with a channel creation request (~TCP SYN) +567ms QU5I gets a message from 2I46 confirming a channel creation (~TCP SYNACK) +611ms 2I46 gets a message from QU5I with a channel confirmation (~TCP ACK) +614ms 2I46 gets a message from QU5I with data type 581, activating set and consensus callbacks +684ms QU5I gets a message from 2I46 with an ACK for the data message No other traffic between those peers is requested, so I don't seen how this should be mesh's fault. Would it be possible to improve set's and consensus' debug messages with info about PEER-ID's? |
|
|
Delay here only means that the peer is in a lower round, and only will accept the request once other currently active requests have finished. There is no delay in the sense of a task scheduled with delay. I am mainly wondering where the message after 10 minutes comes from. Is it possible to track down from the logs when the message was sent? |
|
|
Looks like a core bandwidth issue. During the gap, mesh is trying to restransmit the second message of the channel, but the core transmit ready callback is never run. Was the bandwidth unlocked in the ATS section of test_consensus.conf file? |
|
|
By unlocked bandwidth, do you mean the *_QUOTA_* options in ATS? |
|
|
Yep |
|
|
I did not modify these options. It seems that in the default config, LOOPBACK_QUOTA_* is unlimited, which may be why I could not reproduce the bug when using the unix transport. Now the question is, does LOOPBACK_QUOTA_* or LAN_QUOTA_* apply when using the tcp or udp transport? |
|
|
Seems to be an ATS bandwidth limiting issue. |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2014-02-10 21:51 | Florian Dold | New Issue | |
| 2014-02-10 21:51 | Florian Dold | Status | new => assigned |
| 2014-02-10 21:51 | Florian Dold | Assigned To | => Bart Polot |
| 2014-02-11 09:51 | Florian Dold | Note Added: 0008086 | |
| 2014-02-11 14:57 | Bart Polot | Note Added: 0008093 | |
| 2014-02-11 14:57 | Bart Polot | Status | assigned => feedback |
| 2014-02-11 15:00 | Bart Polot | Note Added: 0008095 | |
| 2014-02-17 15:45 | Florian Dold | File Added: log.tar.bz2 | |
| 2014-02-17 15:46 | Florian Dold | Note Added: 0008118 | |
| 2014-02-17 15:46 | Florian Dold | Status | feedback => assigned |
| 2014-02-17 18:10 | Bart Polot | Note Added: 0008120 | |
| 2014-02-17 18:10 | Bart Polot | Status | assigned => feedback |
| 2014-02-17 18:26 | Florian Dold | Note Added: 0008121 | |
| 2014-02-17 18:26 | Florian Dold | Status | feedback => assigned |
| 2014-03-11 23:43 | Christian Grothoff | Target Version | => 0.10.1 |
| 2014-04-07 13:01 | Bart Polot | Note Added: 0008198 | |
| 2014-04-07 13:01 | Bart Polot | Status | assigned => feedback |
| 2014-04-07 17:09 | Florian Dold | Note Added: 0008202 | |
| 2014-04-07 17:09 | Florian Dold | Status | feedback => assigned |
| 2014-04-07 17:10 | Bart Polot | Note Added: 0008203 | |
| 2014-04-07 17:15 | Florian Dold | Note Added: 0008204 | |
| 2014-04-07 20:38 | Bart Polot | Note Added: 0008209 | |
| 2014-04-07 20:38 | Bart Polot | Status | assigned => resolved |
| 2014-04-07 20:38 | Bart Polot | Resolution | open => unable to reproduce |
| 2014-04-08 16:41 | Christian Grothoff | Status | resolved => closed |
| 2014-05-09 18:34 | Christian Grothoff | Category | mesh service => cadet service |