View Issue Details

IDProjectCategoryView StatusLast Update
0001973GNUnetcadet servicepublic2012-02-28 11:05
ReporterBart Polot Assigned ToBart Polot  
PriorityurgentSeveritycrashReproducibilityhave not tried
Status closedResolutionunable to reproduce 
Product Version0.9.0 
Target Version0.9.2Fixed in Version0.9.2 
Summary0001973: Crash on W32
DescriptionFrom IRC log Dec 1st 2011:

[05:21:14] <LRN> there's a crash, with GNUNET_CONNECTION_TransmitHandle pointer pointing to a region of memory filled with 0x00s and 0xffs. Again, it's difficult to make sens of it, because GNUNET_CONNECTION_TransmitHandle is a pointer to a data member of GNUNET_CONNECTION_Handle. No way to go back to connection, no way to track, when exactly was it freed.
[05:21:34] <LRN> (crash manifests in test_mesh_local_1, if you run it a few dozen times in a loop)
[05:50:22] * ndurner1 has joined #gnunet
[05:51:27] * ndurner has quit (Ping timeout: 248 seconds)
[05:52:20] <LRN> s/sens/\0e/
[06:02:46] <LRN> yeah, GNUNET_CLIENT_notify_transmit_ready_cancel is called twice somehow
[06:12:33] <LRN> interesting. Mesh triggers ******* RECONNECT *******
[06:12:41] <LRN> and that is when everything goes bonkers
[06:14:30] * eridu has quit (Quit: Leaving)
[06:16:56] <LRN> i think that gnunet-service-arm closes the socket that the testcase THINKS connects to the mesh service
[06:17:00] <LRN> not sure why
[07:50:39] <LRN> i think that this is what happens:
[07:50:59] <LRN> test initiates 2 connections to the mesh service (pretending to be two peers?)
[07:51:11] <LRN> arm start, then begins to listen on mesh service port
[07:52:03] <LRN> one of the connections succeeds (actually, i think both succeed), arm accepts it, then closes the listening socket without accepting the other connection
[07:52:59] <LRN> test gets "success" on first connection, and it works, and also on the second connection, and it does NOT work, gets "Connection reset by peer"
[07:53:27] <LRN> mesh API initiates reconnect, and the testcase is incapable of handling it correctly
[07:53:37] <LRN> which is why it times out after ~20 seconds
[07:54:12] <LRN> After that it begins to clean up, and cancels transmit twice for the same connection
[08:00:13] * xlk has quit (Ping timeout: 240 seconds)
[08:01:12] * xlk has joined #gnunet
[08:15:30] <LRN> this is rare. Usually only one connection ever succeeds, the other just waits there until mesh service actually runs
[08:17:13] <LRN> lemme check the size of backlog
[08:20:47] <LRN> aha! backlog size is 5
[08:43:56] * mifritscher has joined #gnunet
[08:56:19] * nil has joined #gnunet
[09:01:18] <nil> LRN: in the error handlers
[09:04:30] * mifritscher has quit (Ping timeout: 260 seconds)
[09:13:44] * mifritscher has joined #gnunet
[09:17:59] * mifritscher has quit (Ping timeout: 245 seconds)
[09:32:40] * mifritscher has joined #gnunet
[10:20:33] <LRN> \o/ at VEH
[10:21:26] <grothoff-office> I hope you're not disappointed that I didn't put this in 0.9.0 -- I didn't think that was the right kind of change just before the release.
[10:21:45] <grothoff-office> (not that it should break anything, but one never knows...)
[10:21:45] <LRN> nah, it's only useful for debugging anyway
[10:21:53] <grothoff-office> Exactly.
[10:23:18] <LRN> what do you think about mesh & arm problems i've had?
[10:23:48] <LRN> (by the way, an update: changing backlog to 1 didn't fix anything, although i'm making a full re-build to re-check that)
[10:25:32] <grothoff-office> I think that's likely W32-specific, maybe because of the ARM interceptor. In any case, it does sound like a bug in the mesh API that Bart & I should look into.
[10:25:40] * ndurner1 has quit (Read error: Connection timed out)
[10:25:44] <grothoff-office> You should file those things to mantis though ;-)
TagsNo tags attached.

Activities

LRN

2011-12-02 03:01

reporter   ~0005005

This issue depends on 0001975 . At least, 0001975 will fix the underlying arm-related problem. Whether to fix the test to handle reconnects correctly or not, and how to do that, is up to the assignee.

LRN

2011-12-02 03:03

reporter   ~0005006

You could try to reproduce this condition by rigging the service and the testcase to run the mesh service (for real), make two connections, then kill one of them to initiate reconnect.

Bart Polot

2012-02-24 15:16

reporter   ~0005500

- Unable to reproduce on linux
- 0001975 which was related is fixed
- All calls to GNUNET_CLIENT_notify_transmit_ready_cancel in mesh_api are guarded by NULL checks/sets.
- Reconnect code recently reworked.

Issue History

Date Modified Username Field Change
2011-12-01 17:00 Bart Polot New Issue
2011-12-01 17:00 Bart Polot Status new => assigned
2011-12-01 17:00 Bart Polot Assigned To => Bart Polot
2011-12-01 17:03 Bart Polot Severity minor => crash
2011-12-02 03:01 LRN Note Added: 0005005
2011-12-02 03:03 LRN Note Added: 0005006
2011-12-19 14:25 Christian Grothoff Target Version => 0.9.1
2011-12-19 14:26 Christian Grothoff Priority normal => urgent
2011-12-23 11:00 Christian Grothoff Target Version 0.9.1 => 0.9.2
2012-02-23 14:32 Christian Grothoff Target Version 0.9.2 => 0.9.3
2012-02-24 15:16 Bart Polot Note Added: 0005500
2012-02-24 15:16 Bart Polot Status assigned => resolved
2012-02-24 15:16 Bart Polot Fixed in Version => Git master
2012-02-24 15:16 Bart Polot Resolution open => unable to reproduce
2012-02-24 20:40 Christian Grothoff Fixed in Version Git master => 0.9.2
2012-02-24 20:40 Christian Grothoff Target Version 0.9.3 => 0.9.2
2012-02-28 11:05 Christian Grothoff Status resolved => closed
2014-05-09 18:34 Christian Grothoff Category mesh service => cadet service