View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0001973 | GNUnet | cadet service | public | 2011-12-01 17:00 | 2012-02-28 11:05 |
| Reporter | Bart Polot | Assigned To | Bart Polot | ||
| Priority | urgent | Severity | crash | Reproducibility | have not tried |
| Status | closed | Resolution | unable to reproduce | ||
| Product Version | 0.9.0 | ||||
| Target Version | 0.9.2 | Fixed in Version | 0.9.2 | ||
| Summary | 0001973: Crash on W32 | ||||
| Description | From IRC log Dec 1st 2011: [05:21:14] <LRN> there's a crash, with GNUNET_CONNECTION_TransmitHandle pointer pointing to a region of memory filled with 0x00s and 0xffs. Again, it's difficult to make sens of it, because GNUNET_CONNECTION_TransmitHandle is a pointer to a data member of GNUNET_CONNECTION_Handle. No way to go back to connection, no way to track, when exactly was it freed. [05:21:34] <LRN> (crash manifests in test_mesh_local_1, if you run it a few dozen times in a loop) [05:50:22] * ndurner1 has joined #gnunet [05:51:27] * ndurner has quit (Ping timeout: 248 seconds) [05:52:20] <LRN> s/sens/\0e/ [06:02:46] <LRN> yeah, GNUNET_CLIENT_notify_transmit_ready_cancel is called twice somehow [06:12:33] <LRN> interesting. Mesh triggers ******* RECONNECT ******* [06:12:41] <LRN> and that is when everything goes bonkers [06:14:30] * eridu has quit (Quit: Leaving) [06:16:56] <LRN> i think that gnunet-service-arm closes the socket that the testcase THINKS connects to the mesh service [06:17:00] <LRN> not sure why [07:50:39] <LRN> i think that this is what happens: [07:50:59] <LRN> test initiates 2 connections to the mesh service (pretending to be two peers?) [07:51:11] <LRN> arm start, then begins to listen on mesh service port [07:52:03] <LRN> one of the connections succeeds (actually, i think both succeed), arm accepts it, then closes the listening socket without accepting the other connection [07:52:59] <LRN> test gets "success" on first connection, and it works, and also on the second connection, and it does NOT work, gets "Connection reset by peer" [07:53:27] <LRN> mesh API initiates reconnect, and the testcase is incapable of handling it correctly [07:53:37] <LRN> which is why it times out after ~20 seconds [07:54:12] <LRN> After that it begins to clean up, and cancels transmit twice for the same connection [08:00:13] * xlk has quit (Ping timeout: 240 seconds) [08:01:12] * xlk has joined #gnunet [08:15:30] <LRN> this is rare. Usually only one connection ever succeeds, the other just waits there until mesh service actually runs [08:17:13] <LRN> lemme check the size of backlog [08:20:47] <LRN> aha! backlog size is 5 [08:43:56] * mifritscher has joined #gnunet [08:56:19] * nil has joined #gnunet [09:01:18] <nil> LRN: in the error handlers [09:04:30] * mifritscher has quit (Ping timeout: 260 seconds) [09:13:44] * mifritscher has joined #gnunet [09:17:59] * mifritscher has quit (Ping timeout: 245 seconds) [09:32:40] * mifritscher has joined #gnunet [10:20:33] <LRN> \o/ at VEH [10:21:26] <grothoff-office> I hope you're not disappointed that I didn't put this in 0.9.0 -- I didn't think that was the right kind of change just before the release. [10:21:45] <grothoff-office> (not that it should break anything, but one never knows...) [10:21:45] <LRN> nah, it's only useful for debugging anyway [10:21:53] <grothoff-office> Exactly. [10:23:18] <LRN> what do you think about mesh & arm problems i've had? [10:23:48] <LRN> (by the way, an update: changing backlog to 1 didn't fix anything, although i'm making a full re-build to re-check that) [10:25:32] <grothoff-office> I think that's likely W32-specific, maybe because of the ARM interceptor. In any case, it does sound like a bug in the mesh API that Bart & I should look into. [10:25:40] * ndurner1 has quit (Read error: Connection timed out) [10:25:44] <grothoff-office> You should file those things to mantis though ;-) | ||||
| Tags | No tags attached. | ||||
|
|
This issue depends on 0001975 . At least, 0001975 will fix the underlying arm-related problem. Whether to fix the test to handle reconnects correctly or not, and how to do that, is up to the assignee. |
|
|
You could try to reproduce this condition by rigging the service and the testcase to run the mesh service (for real), make two connections, then kill one of them to initiate reconnect. |
|
|
- Unable to reproduce on linux - 0001975 which was related is fixed - All calls to GNUNET_CLIENT_notify_transmit_ready_cancel in mesh_api are guarded by NULL checks/sets. - Reconnect code recently reworked. |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2011-12-01 17:00 | Bart Polot | New Issue | |
| 2011-12-01 17:00 | Bart Polot | Status | new => assigned |
| 2011-12-01 17:00 | Bart Polot | Assigned To | => Bart Polot |
| 2011-12-01 17:03 | Bart Polot | Severity | minor => crash |
| 2011-12-02 03:01 | LRN | Note Added: 0005005 | |
| 2011-12-02 03:03 | LRN | Note Added: 0005006 | |
| 2011-12-19 14:25 | Christian Grothoff | Target Version | => 0.9.1 |
| 2011-12-19 14:26 | Christian Grothoff | Priority | normal => urgent |
| 2011-12-23 11:00 | Christian Grothoff | Target Version | 0.9.1 => 0.9.2 |
| 2012-02-23 14:32 | Christian Grothoff | Target Version | 0.9.2 => 0.9.3 |
| 2012-02-24 15:16 | Bart Polot | Note Added: 0005500 | |
| 2012-02-24 15:16 | Bart Polot | Status | assigned => resolved |
| 2012-02-24 15:16 | Bart Polot | Fixed in Version | => Git master |
| 2012-02-24 15:16 | Bart Polot | Resolution | open => unable to reproduce |
| 2012-02-24 20:40 | Christian Grothoff | Fixed in Version | Git master => 0.9.2 |
| 2012-02-24 20:40 | Christian Grothoff | Target Version | 0.9.3 => 0.9.2 |
| 2012-02-28 11:05 | Christian Grothoff | Status | resolved => closed |
| 2014-05-09 18:34 | Christian Grothoff | Category | mesh service => cadet service |