0005597: Deadlock for non reliable channel in case of missing message

ID	Project	Category	View Status	Date Submitted	Last Update

0005597	GNUnet	cadet service	public	2019-02-22 09:22	2019-02-28 12:24

Reporter	t3sserakt	Assigned To	t3sserakt
Priority	normal	Severity	major	Reproducibility	random
Status	closed	Resolution	fixed
Product Version	Git master
Fixed in Version	0.11.0

Summary	0005597: Deadlock for non reliable channel in case of missing message
Description	In case of a non reliable channel - which is the case for a channel with default channel option GNUNET_CADET_OPTION_DEFAULT, I encountered a deadlock at the receiving node, if a message is not delivered to that receiving node. This error can be triggered if someone send messages (I send about 20 in under 10 seconds) to a node quite fast in a row. What happens can be understood if you look into method GCCH_handle_channel_plaintext_data in gnunet-service-cadet_channel.c. There is a check if the MID (message ID) of the received message equals the MID the channel is expecting. If the client is ready and the received MID equals the expected MID the message is send to the client. If not the message is queued. Before queuing the message the queue is checked if we reached the maximum number of messages in the queue. This value is hard code to 4. Apart from that this number should be configurable, I think the resulting behavior is odd. Because we are dropping the oldest queued message in favor of an even older message. Moreover, if this message never appears the channel is blocked. Proposal how to change that behavior: We will not drop the oldest message in the queue, but we send as much messages from the queue as we have messages with consecutive MIDs. After that the queue is empty, or we again wait for the message that is missing now. Means we have to set the expected MID to that MID, because we are now waiting for another message to arrive.
Steps To Reproduce	This error can be triggered if someone send messages (I send about 20 in under 10 seconds) to a node quite fast in a row.
Tags	No tags attached.

t3sserakt 2019-02-22 22:11 developer ~0013950	I like to have feedback on the proposed change of behavior.

Christian Grothoff 2019-02-24 21:49 manager ~0014021	4 was hard-coded to make exactly these types of errors easier to spot for now. Later it should indeed be either adaptive, configurable, or both. I agree it makes no sense to drop a messsage in favor of an even older message. For unreliable channels, I think the desired behavior should be to queue the most recent THRESHOLD (=4, for now) messages. Can you fix this very soon (as in, before 28.2?)

schanzen 2019-02-24 22:12 administrator ~0014022	I don't think I agree with this "We will not drop the oldest message in the queue, but we send as much messages from the queue as we have messages with consecutive MIDs. After that the queue is empty, or we again wait for the message that is missing now. Means we have to set the expected MID to that MID, because we are now waiting for another message to arrive." if it means delivering messages out of order just because we have a long sequence in the queue. If you want them out of order, you should use another option.

t3sserakt 2019-02-25 08:58 developer ~0014023	No, no messages will be delivered out of order. We just drop message the message we are waiting for, and not the message we already received.

schanzen 2019-02-25 13:09 administrator ~0014024	I think we all agree on at least changing the queue eviction behaviour. So you can go ahead and do that. If that improves you test already and also passes a make check just put it in. The change wrt bulk message sending to the client is a bigger change for which I see a few complications, in particular a change in how messages would be delivered to the client. So maybe do the first now and if that is sufficient for your tests delay other, more significant changes for after 0.11?

t3sserakt 2019-02-25 17:25 developer ~0014026	That does not help. The channel would still be blocked, because no messages would be send to the client. If no message is send to the client, no ACK is coming from the client, and no bulk transmission of the queued messages will be triggered. It would at least be necessary to send the message with the lowest MID in the queue to the client.

schanzen 2019-02-25 18:30 administrator ~0014029	I am sorry I fail to see the situation in which this can happen. Either (a) we receive the expected next message ID, in which case the "send to client"/"ack" tandem will automatically collect all messages in order from the queue until the next missing message (see lines 1309ff). OR (b) We received a message ID > expected message ID, in which case we add it sorted to the queue. There is nothing to send in this case. There is a (c) where we receive a message ID < expected message, but in that case we can drop. Now what I do see is that if in 1309 the client is not ready to receive any messages it might block if the client does not indicate the he is ready again (idk if that is happening). But that has nothing to do with the above. Am I missing something?

schanzen 2019-02-25 18:32 administrator ~0014030	Apparently in the case of client_ready == GNUNET_NO, the client will eventually trigger GCCH_handle_local_ack, in which case the queue is send to client.

t3sserakt 2019-02-26 07:51 developer ~0014038	Your a, b and c is perfectly fine, but ... ;-) The missing puzzle peace is that if the client is ready, what he always will be, because he will not get a message, because we are waiting for a message with a certain MID... We can not break that cycle. We are ending up with (b) over and over again. Maybe we should introduce a timeout for waiting for a certain message. But if we did not reached that timeout already, and if we hit the max_pending_messages we actually drop messages we already received. But I see now that in GCCH_handle_local_ack the queue is not send in bulk. Is that what you tried to tell me? :-) Then I would change my proposal to just send the next message in the queue like in GCCH_handle_local_ack, setting the MID the channel is expecting to ID of the message we just send to the client plus one. In GCCH_handle_local_ack this is not done if the channel is reliable, but we are talking about non reliable channels here. By the way, the logging "Got LOCAL_ACK, %s-%X ready to receive more data (but next one is out-of-order %u vs. %u)!\n" in GCCH_handle_local_ack is not correct. Should be "Got LOCAL_ACK, %s-%X ready to receive more data (but next one is not the expected one in order %u vs. %u)!\n" as in the comment after the return some lines later.

schanzen 2019-02-26 09:46 administrator ~0014042	"The missing puzzle peace is that if the client is ready, what he always will be, because he will not get a message, because we are waiting for a message with a certain MID... We can not break that cycle. We are ending up with (b) over and over again." If the client is ready, the message will either be sent if it is the expected MID or it will not if it isn't in line 1320. This will trigger the ack and the rest of the queue is sent. We cannot send any other message(es) because of the in-order requirement. If you are thinking of a case where the MIDs 1,2,4,5 are in the queue and we receive MID 3 then I think this is constructed as if you think about it this can never happen.

schanzen 2019-02-26 10:03 administrator ~0014044	Can you maybe locally fix the queue eviction for replace the message with the highest MID instead of the lowest and run your tests again? See if it solves it?

Christian Grothoff 2019-02-26 16:06 manager ~0014065	schanzen: I'm confused. I would have thought we should drop the _oldest_ message in the queue and only keep the "4" most recent ones. Why do you propose to drop the highest MID? That seems wrong on many levels.

schanzen 2019-02-26 16:16 administrator ~0014069 Last edited: 2019-02-26 16:17	Well because if we get 3 -> 4 -> 5 -> 6 and then 2 -> 1 we will have the following queue: 3,4,5,6 (2 incoming) 2,4,5,6 ( 1 incoming ) 1,2,4,5,6 (triggers 1 and 2 to be sent) If we evict the highest MID, we could have sent messages 1 through 5. If you think that the above behaviour is better in general, then no change is needed. If you want to evict the "oldest" message, we would have to keep track in what order the messages arrived in addition to sorting by MID. But I do not see any benefit in that.

Christian Grothoff 2019-02-26 16:22 manager ~0014070	Eh, what I suggested was to drop 2&1 in this case (as they are older than 3-6). Clearly 3-6 have higher MIDs than 1&2, right? So if we have a max of 4, we first keep 3,4,5,6 then get 1 (discard, ancient), get 2 (discard, ancient); still deliver 3,4,5,6 to client. If the client processes a message in between, we get: 3,4,5,6 then get 1 (discard, ancient), client processes 3 (head-of-queue): 4,5,6 we receive 2 (too old, no longer in order, so also discard): 4,5,6 If OTOH we get 1,4,5,6 first: 1,4,5,6 Then we get 2, so we discard 1: 2,4,5,6 if now client processes 2, we have 4,5,6 if we now receive 3, we ideally put it in front: 3,4,5,6 Right?

schanzen 2019-02-26 18:16 administrator ~0014085	Can you please test cadet in b34bcbb78..4e8df946b. I tried to fix the deadlock.

Date Modified	Username	Field	Change
2019-02-22 09:22	t3sserakt	New Issue
2019-02-22 09:22	t3sserakt	Status	new => assigned
2019-02-22 09:22	t3sserakt	Assigned To	=> t3sserakt
2019-02-22 22:11	t3sserakt	Status	assigned => feedback
2019-02-22 22:11	t3sserakt	Note Added: 0013950
2019-02-24 21:49	Christian Grothoff	Note Added: 0014021
2019-02-24 22:12	schanzen	Note Added: 0014022
2019-02-25 08:58	t3sserakt	Note Added: 0014023
2019-02-25 13:09	schanzen	Note Added: 0014024
2019-02-25 17:25	t3sserakt	Note Added: 0014026
2019-02-25 18:30	schanzen	Note Added: 0014029
2019-02-25 18:32	schanzen	Note Added: 0014030
2019-02-26 07:51	t3sserakt	Note Added: 0014038
2019-02-26 09:46	schanzen	Note Added: 0014042
2019-02-26 10:03	schanzen	Note Added: 0014044
2019-02-26 16:06	Christian Grothoff	Note Added: 0014065
2019-02-26 16:16	schanzen	Note Added: 0014069
2019-02-26 16:17	schanzen	Note Edited: 0014069
2019-02-26 16:22	Christian Grothoff	Note Added: 0014070
2019-02-26 18:16	schanzen	Note Added: 0014085
2019-02-28 12:24	Christian Grothoff	Status	feedback => resolved
2019-02-28 12:24	Christian Grothoff	Resolution	open => fixed
2019-02-28 12:24	Christian Grothoff	Fixed in Version	=> 0.11.0
2019-02-28 12:24	Christian Grothoff	Status	resolved => closed

View Issue Details

Activities

Issue History