0001973: Crash on W32 - MantisBT

ID	Project	Category	View Status	Date Submitted	Last Update

0001973	GNUnet	cadet service	public	2011-12-01 17:00	2012-02-28 11:05

Reporter	Bart Polot	Assigned To	Bart Polot
Priority	urgent	Severity	crash	Reproducibility	have not tried
Status	closed	Resolution	unable to reproduce
Product Version	0.9.0
Target Version	0.9.2	Fixed in Version	0.9.2

Summary	0001973: Crash on W32
Description	From IRC log Dec 1st 2011: [05:21:14] <LRN> there's a crash, with GNUNET_CONNECTION_TransmitHandle pointer pointing to a region of memory filled with 0x00s and 0xffs. Again, it's difficult to make sens of it, because GNUNET_CONNECTION_TransmitHandle is a pointer to a data member of GNUNET_CONNECTION_Handle. No way to go back to connection, no way to track, when exactly was it freed. [05:21:34] <LRN> (crash manifests in test_mesh_local_1, if you run it a few dozen times in a loop) [05:50:22] * ndurner1 has joined #gnunet [05:51:27] * ndurner has quit (Ping timeout: 248 seconds) [05:52:20] <LRN> s/sens/\0e/ [06:02:46] <LRN> yeah, GNUNET_CLIENT_notify_transmit_ready_cancel is called twice somehow [06:12:33] <LRN> interesting. Mesh triggers ***** RECONNECT ***** [06:12:41] <LRN> and that is when everything goes bonkers [06:14:30] * eridu has quit (Quit: Leaving) [06:16:56] <LRN> i think that gnunet-service-arm closes the socket that the testcase THINKS connects to the mesh service [06:17:00] <LRN> not sure why [07:50:39] <LRN> i think that this is what happens: [07:50:59] <LRN> test initiates 2 connections to the mesh service (pretending to be two peers?) [07:51:11] <LRN> arm start, then begins to listen on mesh service port [07:52:03] <LRN> one of the connections succeeds (actually, i think both succeed), arm accepts it, then closes the listening socket without accepting the other connection [07:52:59] <LRN> test gets "success" on first connection, and it works, and also on the second connection, and it does NOT work, gets "Connection reset by peer" [07:53:27] <LRN> mesh API initiates reconnect, and the testcase is incapable of handling it correctly [07:53:37] <LRN> which is why it times out after ~20 seconds [07:54:12] <LRN> After that it begins to clean up, and cancels transmit twice for the same connection [08:00:13] * xlk has quit (Ping timeout: 240 seconds) [08:01:12] * xlk has joined #gnunet [08:15:30] <LRN> this is rare. Usually only one connection ever succeeds, the other just waits there until mesh service actually runs [08:17:13] <LRN> lemme check the size of backlog [08:20:47] <LRN> aha! backlog size is 5 [08:43:56] * mifritscher has joined #gnunet [08:56:19] * nil has joined #gnunet [09:01:18] <nil> LRN: in the error handlers [09:04:30] * mifritscher has quit (Ping timeout: 260 seconds) [09:13:44] * mifritscher has joined #gnunet [09:17:59] * mifritscher has quit (Ping timeout: 245 seconds) [09:32:40] * mifritscher has joined #gnunet [10:20:33] <LRN> \o/ at VEH [10:21:26] <grothoff-office> I hope you're not disappointed that I didn't put this in 0.9.0 -- I didn't think that was the right kind of change just before the release. [10:21:45] <grothoff-office> (not that it should break anything, but one never knows...) [10:21:45] <LRN> nah, it's only useful for debugging anyway [10:21:53] <grothoff-office> Exactly. [10:23:18] <LRN> what do you think about mesh & arm problems i've had? [10:23:48] <LRN> (by the way, an update: changing backlog to 1 didn't fix anything, although i'm making a full re-build to re-check that) [10:25:32] <grothoff-office> I think that's likely W32-specific, maybe because of the ARM interceptor. In any case, it does sound like a bug in the mesh API that Bart & I should look into. [10:25:40] * ndurner1 has quit (Read error: Connection timed out) [10:25:44] <grothoff-office> You should file those things to mantis though ;-)
Tags	No tags attached.

LRN 2011-12-02 03:01 reporter ~0005005	This issue depends on 0001975 . At least, 0001975 will fix the underlying arm-related problem. Whether to fix the test to handle reconnects correctly or not, and how to do that, is up to the assignee.

LRN 2011-12-02 03:03 reporter ~0005006	You could try to reproduce this condition by rigging the service and the testcase to run the mesh service (for real), make two connections, then kill one of them to initiate reconnect.

Bart Polot 2012-02-24 15:16 reporter ~0005500	- Unable to reproduce on linux - 0001975 which was related is fixed - All calls to GNUNET_CLIENT_notify_transmit_ready_cancel in mesh_api are guarded by NULL checks/sets. - Reconnect code recently reworked.

Date Modified	Username	Field	Change
2011-12-01 17:00	Bart Polot	New Issue
2011-12-01 17:00	Bart Polot	Status	new => assigned
2011-12-01 17:00	Bart Polot	Assigned To	=> Bart Polot
2011-12-01 17:03	Bart Polot	Severity	minor => crash
2011-12-02 03:01	LRN	Note Added: 0005005
2011-12-02 03:03	LRN	Note Added: 0005006
2011-12-19 14:25	Christian Grothoff	Target Version	=> 0.9.1
2011-12-19 14:26	Christian Grothoff	Priority	normal => urgent
2011-12-23 11:00	Christian Grothoff	Target Version	0.9.1 => 0.9.2
2012-02-23 14:32	Christian Grothoff	Target Version	0.9.2 => 0.9.3
2012-02-24 15:16	Bart Polot	Note Added: 0005500
2012-02-24 15:16	Bart Polot	Status	assigned => resolved
2012-02-24 15:16	Bart Polot	Fixed in Version	=> Git master
2012-02-24 15:16	Bart Polot	Resolution	open => unable to reproduce
2012-02-24 20:40	Christian Grothoff	Fixed in Version	Git master => 0.9.2
2012-02-24 20:40	Christian Grothoff	Target Version	0.9.3 => 0.9.2
2012-02-28 11:05	Christian Grothoff	Status	resolved => closed
2014-05-09 18:34	Christian Grothoff	Category	mesh service => cadet service