View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0005597||GNUnet||cadet service||public||2019-02-22 09:22||2019-02-28 12:24|
|Product Version||SVN HEAD|
|Target Version||Fixed in Version||0.11.0|
|Summary||0005597: Deadlock for non reliable channel in case of missing message|
|Description||In case of a non reliable channel - which is the case for a channel with default channel option GNUNET_CADET_OPTION_DEFAULT, I encountered a deadlock at the receiving node, if a message is not delivered to that receiving node.|
This error can be triggered if someone send messages (I send about 20 in under 10 seconds) to a node quite fast in a row.
What happens can be understood if you look into method GCCH_handle_channel_plaintext_data in gnunet-service-cadet_channel.c.
There is a check if the MID (message ID) of the received message equals the MID the channel is expecting.
If the client is ready and the received MID equals the expected MID the message is send to the client.
If not the message is queued. Before queuing the message the queue is checked if we reached the maximum number of messages in the queue.
This value is hard code to 4. Apart from that this number should be configurable, I think the resulting behavior is odd.
Because we are dropping the oldest queued message in favor of an even older message. Moreover, if this message never appears the channel is blocked.
Proposal how to change that behavior:
We will not drop the oldest message in the queue, but we send as much messages from the queue as we have messages with consecutive MIDs. After that the queue is empty, or we again wait for the message that is missing now. Means we have to set the expected MID to that MID, because we are now waiting for another message to arrive.
|Steps To Reproduce||This error can be triggered if someone send messages (I send about 20 in under 10 seconds) to a node quite fast in a row.|
|Tags||No tags attached.|
||I like to have feedback on the proposed change of behavior.|
4 was hard-coded to make exactly these types of errors easier to spot for now. Later it should indeed be either adaptive, configurable, or both.
I agree it makes no sense to drop a messsage in favor of an even older message.
For unreliable channels, I think the desired behavior should be to queue the most recent THRESHOLD (=4, for now) messages. Can you fix this very soon (as in, before 28.2?)
I don't think I agree with this
"We will not drop the oldest message in the queue, but we send as much messages from the queue as we have messages with consecutive MIDs. After that the queue is empty, or we again wait for the message that is missing now. Means we have to set the expected MID to that MID, because we are now waiting for another message to arrive."
if it means delivering messages out of order just because we have a long sequence in the queue. If you want them out of order, you should use another option.
||No, no messages will be delivered out of order. We just drop message the message we are waiting for, and not the message we already received.|
I think we all agree on at least changing the queue eviction behaviour. So you can go ahead and do that. If that improves you test already and also passes a make check just put it in.
The change wrt bulk message sending to the client is a bigger change for which I see a few complications, in particular a change in how messages would be delivered to the client.
So maybe do the first now and if that is sufficient for your tests delay other, more significant changes for after 0.11?
||That does not help. The channel would still be blocked, because no messages would be send to the client. If no message is send to the client, no ACK is coming from the client, and no bulk transmission of the queued messages will be triggered. It would at least be necessary to send the message with the lowest MID in the queue to the client.|
I am sorry I fail to see the situation in which this can happen.
(a) we receive the expected next message ID, in which case the "send to client"/"ack" tandem will automatically collect all messages in order from the queue until the next missing message (see lines 1309ff).
(b) We received a message ID > expected message ID, in which case we add it sorted to the queue. There is nothing to send in this case.
There is a (c) where we receive a message ID < expected message, but in that case we can drop.
Now what I do see is that if in 1309 the client is not ready to receive any messages it might block if the client does not indicate the he is ready again (idk if that is happening). But that has nothing to do with the above.
Am I missing something?
||Apparently in the case of client_ready == GNUNET_NO, the client will eventually trigger GCCH_handle_local_ack, in which case the queue is send to client.|
Your a, b and c is perfectly fine, but ... ;-)
The missing puzzle peace is that if the client is ready, what he always will be, because he will not get a message, because we are waiting for a message with a certain MID... We can not break that cycle.
We are ending up with (b) over and over again.
Maybe we should introduce a timeout for waiting for a certain message. But if we did not reached that timeout already, and if we hit the max_pending_messages we actually drop messages we already received.
But I see now that in GCCH_handle_local_ack the queue is not send in bulk. Is that what you tried to tell me? :-)
Then I would change my proposal to just send the next message in the queue like in GCCH_handle_local_ack, setting the MID the channel is expecting to ID of the message we just send to the client plus one.
In GCCH_handle_local_ack this is not done if the channel is reliable, but we are talking about non reliable channels here.
By the way, the logging
"Got LOCAL_ACK, %s-%X ready to receive more data (but next one is out-of-order %u vs. %u)!\n"
in GCCH_handle_local_ack is not correct. Should be
"Got LOCAL_ACK, %s-%X ready to receive more data (but next one is not the expected one in order %u vs. %u)!\n" as in the comment after the return some lines later.
"The missing puzzle peace is that if the client is ready, what he always will be, because he will not get a message, because we are waiting for a message with a certain MID... We can not break that cycle.
We are ending up with (b) over and over again."
If the client is ready, the message will either be sent if it is the expected MID or it will not if it isn't in line 1320.
This will trigger the ack and the rest of the queue is sent.
We cannot send any other message(es) because of the in-order requirement.
If you are thinking of a case where the MIDs 1,2,4,5 are in the queue and we receive MID 3 then I think this is constructed as if you think about it this can never happen.
||Can you maybe locally fix the queue eviction for replace the message with the highest MID instead of the lowest and run your tests again? See if it solves it?|
||schanzen: I'm confused. I would have thought we should drop the _oldest_ message in the queue and only keep the "4" most recent ones. Why do you propose to drop the highest MID? That seems wrong on many levels.|
Well because if we get 3 -> 4 -> 5 -> 6 and then 2 -> 1 we will have the following queue:
( 1 incoming )
1,2,4,5,6 (triggers 1 and 2 to be sent)
If we evict the highest MID, we could have sent messages 1 through 5.
If you think that the above behaviour is better in general, then no change is needed.
If you want to evict the "oldest" message, we would have to keep track in what order the messages arrived in addition to sorting by MID. But I do not see any benefit in that.
Eh, what I suggested was to drop 2&1 in this case (as they are older than 3-6). Clearly 3-6 have higher MIDs than 1&2, right? So if we have a max of 4, we first keep
then get 1 (discard, ancient), get 2 (discard, ancient); still deliver 3,4,5,6 to client. If the client processes a message in between, we get:
then get 1 (discard, ancient), client processes 3 (head-of-queue):
we receive 2 (too old, no longer in order, so also discard):
If OTOH we get 1,4,5,6 first:
Then we get 2, so we discard 1:
if now client processes 2, we have
if we now receive 3, we ideally put it in front:
||Can you please test cadet in b34bcbb78..4e8df946b. I tried to fix the deadlock.|
|2019-02-22 09:22||t3sserakt||New Issue|
|2019-02-22 09:22||t3sserakt||Status||new => assigned|
|2019-02-22 09:22||t3sserakt||Assigned To||=> t3sserakt|
|2019-02-22 22:11||t3sserakt||Status||assigned => feedback|
|2019-02-22 22:11||t3sserakt||Note Added: 0013950|
|2019-02-24 21:49||Christian Grothoff||Note Added: 0014021|
|2019-02-24 22:12||schanzen||Note Added: 0014022|
|2019-02-25 08:58||t3sserakt||Note Added: 0014023|
|2019-02-25 13:09||schanzen||Note Added: 0014024|
|2019-02-25 17:25||t3sserakt||Note Added: 0014026|
|2019-02-25 18:30||schanzen||Note Added: 0014029|
|2019-02-25 18:32||schanzen||Note Added: 0014030|
|2019-02-26 07:51||t3sserakt||Note Added: 0014038|
|2019-02-26 09:46||schanzen||Note Added: 0014042|
|2019-02-26 10:03||schanzen||Note Added: 0014044|
|2019-02-26 16:06||Christian Grothoff||Note Added: 0014065|
|2019-02-26 16:16||schanzen||Note Added: 0014069|
|2019-02-26 16:17||schanzen||Note Edited: 0014069||View Revisions|
|2019-02-26 16:22||Christian Grothoff||Note Added: 0014070|
|2019-02-26 18:16||schanzen||Note Added: 0014085|
|2019-02-28 12:24||Christian Grothoff||Status||feedback => resolved|
|2019-02-28 12:24||Christian Grothoff||Resolution||open => fixed|
|2019-02-28 12:24||Christian Grothoff||Fixed in Version||=> 0.11.0|
|2019-02-28 12:24||Christian Grothoff||Status||resolved => closed|