View Issue Details

IDProjectCategoryView StatusLast Update
0005597GNUnetcadet servicepublic2019-02-28 12:24
Reportert3sserakt Assigned Tot3sserakt  
PrioritynormalSeveritymajorReproducibilityrandom
Status closedResolutionfixed 
Product VersionGit master 
Fixed in Version0.11.0 
Summary0005597: Deadlock for non reliable channel in case of missing message
DescriptionIn case of a non reliable channel - which is the case for a channel with default channel option GNUNET_CADET_OPTION_DEFAULT, I encountered a deadlock at the receiving node, if a message is not delivered to that receiving node.

This error can be triggered if someone send messages (I send about 20 in under 10 seconds) to a node quite fast in a row.

What happens can be understood if you look into method GCCH_handle_channel_plaintext_data in gnunet-service-cadet_channel.c.

There is a check if the MID (message ID) of the received message equals the MID the channel is expecting.

If the client is ready and the received MID equals the expected MID the message is send to the client.

If not the message is queued. Before queuing the message the queue is checked if we reached the maximum number of messages in the queue.

This value is hard code to 4. Apart from that this number should be configurable, I think the resulting behavior is odd.

Because we are dropping the oldest queued message in favor of an even older message. Moreover, if this message never appears the channel is blocked.

Proposal how to change that behavior:

We will not drop the oldest message in the queue, but we send as much messages from the queue as we have messages with consecutive MIDs. After that the queue is empty, or we again wait for the message that is missing now. Means we have to set the expected MID to that MID, because we are now waiting for another message to arrive.
Steps To ReproduceThis error can be triggered if someone send messages (I send about 20 in under 10 seconds) to a node quite fast in a row.
TagsNo tags attached.

Activities

t3sserakt

2019-02-22 22:11

developer   ~0013950

I like to have feedback on the proposed change of behavior.

Christian Grothoff

2019-02-24 21:49

manager   ~0014021

4 was hard-coded to make exactly these types of errors easier to spot for now. Later it should indeed be either adaptive, configurable, or both.

I agree it makes no sense to drop a messsage in favor of an even older message.

For unreliable channels, I think the desired behavior should be to queue the most recent THRESHOLD (=4, for now) messages. Can you fix this very soon (as in, before 28.2?)

schanzen

2019-02-24 22:12

administrator   ~0014022

I don't think I agree with this

"We will not drop the oldest message in the queue, but we send as much messages from the queue as we have messages with consecutive MIDs. After that the queue is empty, or we again wait for the message that is missing now. Means we have to set the expected MID to that MID, because we are now waiting for another message to arrive."

if it means delivering messages out of order just because we have a long sequence in the queue. If you want them out of order, you should use another option.

t3sserakt

2019-02-25 08:58

developer   ~0014023

No, no messages will be delivered out of order. We just drop message the message we are waiting for, and not the message we already received.

schanzen

2019-02-25 13:09

administrator   ~0014024

I think we all agree on at least changing the queue eviction behaviour. So you can go ahead and do that. If that improves you test already and also passes a make check just put it in.
The change wrt bulk message sending to the client is a bigger change for which I see a few complications, in particular a change in how messages would be delivered to the client.
So maybe do the first now and if that is sufficient for your tests delay other, more significant changes for after 0.11?

t3sserakt

2019-02-25 17:25

developer   ~0014026

That does not help. The channel would still be blocked, because no messages would be send to the client. If no message is send to the client, no ACK is coming from the client, and no bulk transmission of the queued messages will be triggered. It would at least be necessary to send the message with the lowest MID in the queue to the client.

schanzen

2019-02-25 18:30

administrator   ~0014029

I am sorry I fail to see the situation in which this can happen.
Either
(a) we receive the expected next message ID, in which case the "send to client"/"ack" tandem will automatically collect all messages in order from the queue until the next missing message (see lines 1309ff).
OR
(b) We received a message ID > expected message ID, in which case we add it sorted to the queue. There is nothing to send in this case.

There is a (c) where we receive a message ID < expected message, but in that case we can drop.

Now what I do see is that if in 1309 the client is not ready to receive any messages it might block if the client does not indicate the he is ready again (idk if that is happening). But that has nothing to do with the above.
Am I missing something?

schanzen

2019-02-25 18:32

administrator   ~0014030

Apparently in the case of client_ready == GNUNET_NO, the client will eventually trigger GCCH_handle_local_ack, in which case the queue is send to client.

t3sserakt

2019-02-26 07:51

developer   ~0014038

Your a, b and c is perfectly fine, but ... ;-)

The missing puzzle peace is that if the client is ready, what he always will be, because he will not get a message, because we are waiting for a message with a certain MID... We can not break that cycle.

We are ending up with (b) over and over again.

Maybe we should introduce a timeout for waiting for a certain message. But if we did not reached that timeout already, and if we hit the max_pending_messages we actually drop messages we already received.

But I see now that in GCCH_handle_local_ack the queue is not send in bulk. Is that what you tried to tell me? :-)

Then I would change my proposal to just send the next message in the queue like in GCCH_handle_local_ack, setting the MID the channel is expecting to ID of the message we just send to the client plus one.

In GCCH_handle_local_ack this is not done if the channel is reliable, but we are talking about non reliable channels here.

By the way, the logging

"Got LOCAL_ACK, %s-%X ready to receive more data (but next one is out-of-order %u vs. %u)!\n"

in GCCH_handle_local_ack is not correct. Should be

"Got LOCAL_ACK, %s-%X ready to receive more data (but next one is not the expected one in order %u vs. %u)!\n" as in the comment after the return some lines later.

schanzen

2019-02-26 09:46

administrator   ~0014042

"The missing puzzle peace is that if the client is ready, what he always will be, because he will not get a message, because we are waiting for a message with a certain MID... We can not break that cycle.

We are ending up with (b) over and over again."

If the client is ready, the message will either be sent if it is the expected MID or it will not if it isn't in line 1320.
This will trigger the ack and the rest of the queue is sent.
We cannot send any other message(es) because of the in-order requirement.
If you are thinking of a case where the MIDs 1,2,4,5 are in the queue and we receive MID 3 then I think this is constructed as if you think about it this can never happen.

schanzen

2019-02-26 10:03

administrator   ~0014044

Can you maybe locally fix the queue eviction for replace the message with the highest MID instead of the lowest and run your tests again? See if it solves it?

Christian Grothoff

2019-02-26 16:06

manager   ~0014065

schanzen: I'm confused. I would have thought we should drop the _oldest_ message in the queue and only keep the "4" most recent ones. Why do you propose to drop the highest MID? That seems wrong on many levels.

schanzen

2019-02-26 16:16

administrator   ~0014069

Last edited: 2019-02-26 16:17

Well because if we get 3 -> 4 -> 5 -> 6 and then 2 -> 1 we will have the following queue:

3,4,5,6
(2 incoming)
2,4,5,6
( 1 incoming )
1,2,4,5,6 (triggers 1 and 2 to be sent)

If we evict the highest MID, we could have sent messages 1 through 5.

If you think that the above behaviour is better in general, then no change is needed.

If you want to evict the "oldest" message, we would have to keep track in what order the messages arrived in addition to sorting by MID. But I do not see any benefit in that.

Christian Grothoff

2019-02-26 16:22

manager   ~0014070

Eh, what I suggested was to drop 2&1 in this case (as they are older than 3-6). Clearly 3-6 have higher MIDs than 1&2, right? So if we have a max of 4, we first keep
3,4,5,6
then get 1 (discard, ancient), get 2 (discard, ancient); still deliver 3,4,5,6 to client. If the client processes a message in between, we get:
3,4,5,6
then get 1 (discard, ancient), client processes 3 (head-of-queue):
4,5,6
we receive 2 (too old, no longer in order, so also discard):
4,5,6

If OTOH we get 1,4,5,6 first:
1,4,5,6
Then we get 2, so we discard 1:
2,4,5,6
if now client processes 2, we have
4,5,6
if we now receive 3, we ideally put it in front:
3,4,5,6
Right?

schanzen

2019-02-26 18:16

administrator   ~0014085

Can you please test cadet in b34bcbb78..4e8df946b. I tried to fix the deadlock.

Issue History

Date Modified Username Field Change
2019-02-22 09:22 t3sserakt New Issue
2019-02-22 09:22 t3sserakt Status new => assigned
2019-02-22 09:22 t3sserakt Assigned To => t3sserakt
2019-02-22 22:11 t3sserakt Status assigned => feedback
2019-02-22 22:11 t3sserakt Note Added: 0013950
2019-02-24 21:49 Christian Grothoff Note Added: 0014021
2019-02-24 22:12 schanzen Note Added: 0014022
2019-02-25 08:58 t3sserakt Note Added: 0014023
2019-02-25 13:09 schanzen Note Added: 0014024
2019-02-25 17:25 t3sserakt Note Added: 0014026
2019-02-25 18:30 schanzen Note Added: 0014029
2019-02-25 18:32 schanzen Note Added: 0014030
2019-02-26 07:51 t3sserakt Note Added: 0014038
2019-02-26 09:46 schanzen Note Added: 0014042
2019-02-26 10:03 schanzen Note Added: 0014044
2019-02-26 16:06 Christian Grothoff Note Added: 0014065
2019-02-26 16:16 schanzen Note Added: 0014069
2019-02-26 16:17 schanzen Note Edited: 0014069
2019-02-26 16:22 Christian Grothoff Note Added: 0014070
2019-02-26 18:16 schanzen Note Added: 0014085
2019-02-28 12:24 Christian Grothoff Status feedback => resolved
2019-02-28 12:24 Christian Grothoff Resolution open => fixed
2019-02-28 12:24 Christian Grothoff Fixed in Version => 0.11.0
2019-02-28 12:24 Christian Grothoff Status resolved => closed