Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

webrtc Flag not being transferred #5877

Open
DougAnderson444 opened this issue Feb 20, 2025 · 6 comments
Open

webrtc Flag not being transferred #5877

DougAnderson444 opened this issue Feb 20, 2025 · 6 comments

Comments

@DougAnderson444
Copy link
Contributor

DougAnderson444 commented Feb 20, 2025

Summary

When a webrtc stream is closed, a Flag::FIN is sent, but the remote never gets the flag. Same with Flag::RESET, it's never received by the remote peer.

In fact, I don't see any Flags being received.

This results in a webrtc-utils stream being closed on one end, but left to timeout on the other end.

Specifically this is a problem right after handshake, where the noise DataChannel is left to timeout after 10 seconds, then a new data channel is established and streams can flow.

The impact is that WebRtc can only be used after this 10 second timeout until the new datachannel is made to replace the closed one.

Expected behavior

It's expected that when

self.io.start_send_unpin(Message {
flag: Some(Flag::FIN),
message: None,
})?;

is called, that the remote received the Flag

Actual behavior

No inbound Flag (FIN, RESET) appears to be received by the remote. In fact, this code never seems to be called:

pub(crate) fn handle_inbound_flag(&mut self, flag: Flag, buffer: &mut Bytes) {
let current = *self;
match (current, flag) {
(Self::Open, Flag::FIN) => {
*self = Self::ReadClosed;
}
(Self::WriteClosed, Flag::FIN) => {
*self = Self::BothClosed { reset: false };
}
(Self::Open, Flag::STOP_SENDING) => {
*self = Self::WriteClosed;
}
(Self::ReadClosed, Flag::STOP_SENDING) => {
*self = Self::BothClosed { reset: false };
}
(_, Flag::RESET) => {
buffer.clear();
*self = Self::BothClosed { reset: true };
}
_ => {}
}
}

Relevant log output

Possible Solution

I can't tell why the FIN and RESET flags are not being received. The code looks like it should work fine, yet we have an absence of these Flags.

Version

0.54.1

Would you like to work on fixing this bug?

I'll try

@elenaf9
Copy link
Contributor

elenaf9 commented Feb 20, 2025

Thanks @DougAnderson444 for debugging this!

@DougAnderson444
Copy link
Contributor Author

This is a subtle bug and I really only caught it when I wanted to use the WebRTC Channel right away but noticed this 10 second delay before I could start.

It becomes really obvious when we set the ping interval to 1 sec! Anyway, here's what I've got:

The issue

This sequence is the culprit:

webrtc-utils noise inbound:

channel.close().await?;

webrtc-utils noise outbound:

channel.close().await?;

Each .close() is trying to:

  • send Flag::FIN
  • flush
  • close self

...at exactly the same time.

see poll_close

fn poll_close(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<io::Result<()>> {
loop {
match self.state.close_write_barrier()? {
Some(Closing::Requested) => {
ready!(self.io.poll_ready_unpin(cx))?;
self.io.start_send_unpin(Message {
flag: Some(Flag::FIN),
message: None,
})?;
self.state.close_write_message_sent();
continue;
}
Some(Closing::MessageSent) => {
ready!(self.io.poll_flush_unpin(cx))?;
self.state.write_closed();
let _ = self
.drop_notifier
.take()
.expect("to not close twice")
.send(GracefullyClosed {});
return Poll::Ready(Ok(()));
}
None => return Poll::Ready(Ok(())),
}
}

No wonder Flag::FIN never arrives at the stream, it's closed / being closed!

This puts each webrtc noise data handshake channel in an undefined state and the stream remains open until it timeout 10 seconds later. At which point, a regular data channel is opened to replace it and normal operations resume.

Background

Recall that there a 2 types of data channels:

  1. Noise handshake data channels (negotiated: true)
  2. Regular data channels (negotiated: false)

We should keep the close process on the regular channels, but we can skip it on the handshake channel once the handshake is successful. Trying to send data on a channel where both ends are closing at the same time seems to break webrtc.

I see 2 options:

Option A

In order to send Flag:FIN on regular channels but not handshake channels, we would have to develop a complex way of determining whether the stream is a webrtc noise data channel (there is no api for it in web-sys), then skip FIN and flush in stream poll_close for the handshake only. This would take a large effort and introduce a lot of complexity. Not recommended.

Option B (Recommended)

We simply Drop the outbound webrtc noise handshake channel instead of closing it.

// misc/webrtc-utils/src/noise.rs
// async fn outbound<T>()
// ..
//  channel.close().await?; <= remove close
drop(channel); // <== skip straight to drop

If we skip the outbound close and just Drop it instead, no further data is sent (nor is it required, because the inbound side is already closing itself), the remote closes the Noise handshake channel immediately, and the regular data channel picks up as desired.

Recommendation

I think we should implement Option B and just drop the outbound noise channel instead of calling close. It works and introduces the least amount of change and complexity.

Happy to hear your thoughts

@dariusc93
Copy link
Member

I am not completely well versed in the webrtc implementation in libp2p (and only know a bit about the specs) but based on the information provided, it sounds like option B might be a better choice. Option A would likely be complex but could probably end up introducing something that isnt in spec (unless this is something we should probably address in the spec itself as well?). Im just curious on any possible implications from just dropping the channel without closing it.

@DougAnderson444
Copy link
Contributor Author

As far as I can tell, the only impact to not closing the channel is not sending Flag::FIN nor doing a flush on those outgoing bytes ([2, 8, 0]).

But since the remote is closing anyway as the handshake is done, this seems redundant. Plus the exchanged FINs seem to get stuck in the data channel on the read side(?), keeping the webrtc open, but unusable (since it's trying to close). This is what it looks like is happening.

When the outbound handshake channel is directly dropped, the next channel is opened and the stream continues on to exchange bytes.

DougAnderson444 added a commit to DougAnderson444/beetswap that referenced this issue Feb 21, 2025
@elenaf9
Copy link
Contributor

elenaf9 commented Mar 3, 2025

Couple of questions to make sure I understand the issue correctly. I am not super familiar with the WebRTC implementation.

Specifically this is a problem right after handshake, where the noise DataChannel is left to timeout after 10 seconds,

Where does this 10s timeout come from? Looking through the code, I can only find a 10s timeout for creating the RTCDataChannel for the noise handshake, not the handshake itself.

Trying to send data on a channel where both ends are closing at the same time seems to break webrtc.

What do you mean by that? Where do they try do send data on the channel (that is used for the noise handshake) after it is closed?

No inbound Flag (FIN, RESET) appears to be received by the remote. In fact, this code never seems to be called:

handle_inbound_flag is only called when reading from a stream, or when writing to a stream where the reading side is already closed. As you already mentioned both sides call close after writing the last data, and because they don't continue reading afterwards they also won't read any incoming flag anymore.
Still, this shouldn't cause any blocking issues because as far as I can tell stream.close() doesn't wait or depend on the remote peer parsing the FIN flag. Am i missing something? Where does the actual timeout happen?

As far as I can tell, the only impact to not closing the channel is not sending Flag::FIN nor doing a flush on those outgoing bytes ([2, 8, 0]).

But since the remote is closing anyway as the handshake is done, this seems redundant. Plus the exchanged FINs seem to get stuck in the data channel on the read side(?), keeping the webrtc open, but unusable (since it's trying to close). This is what it looks like is happening.

I think it generally ensures that any previously written bytes are flushed, which is otherwise not guaranteed, no?
In noise.upgrade_outbound the last call is send_identity. If I understand it correctly, the upgrade returns as soon as these bytes have been queued, but it doesn't mean that they have been sent. Without flushing I think there is no guarantee that they are actually being sent.

@DougAnderson444
Copy link
Contributor Author

Couple of questions to make sure I understand the issue correctly. I am not super familiar with the WebRTC implementation.

Sure, welcome to the party! 🎉

Where does this 10s timeout come from? Looking through the code, I can only find a 10s timeout for creating the RTCDataChannel for the noise handshake, not the handshake itself.

I believe this is the swarm idle connection timeout, not specific to webrtc. I could be wrong.

What do you mean by that? Where do they try do send data on the channel (that is used for the noise handshake) after it is closed?

It's happening here:

channel.close().await?;

Both sides of the noise handshake call .close() within nanoseconds of each other. As .close() sends the FIN flag and flushes, this is where both sides are closing themselves and sending to the other side at the same time.

.close() eventually leads to this send here:

self.io.start_send_unpin(Message {
flag: Some(Flag::FIN),
message: None,
})?;

handle_inbound_flag is only called when reading from a stream, or when writing to a stream where the reading side is already closed. As you already mentioned both sides call close after writing the last data, and because they don't continue reading afterwards they also won't read any incoming flag anymore. Still, this shouldn't cause any blocking issues because as far as I can tell stream.close() doesn't wait or depend on the remote peer parsing the FIN flag. Am i missing something? Where does the actual timeout happen?

I suspect what is happening is that FIN message gets stuck in the webrtc-rs datachannel somehow, preventing it from actually closing. Then I think the swarm times out the stream and creates a new one. I haven't dug deep enough into the upstream deps to confirm this though.

I think it generally ensures that any previously written bytes are flushed, which is otherwise not guaranteed, no? In noise.upgrade_outbound the last call is send_identity. If I understand it correctly, the upgrade returns as soon as these bytes have been queued, but it doesn't mean that they have been sent. Without flushing I think there is no guarantee that they are actually being sent.

Yes I think we understand the same thing, but I suspect it's these last bytes that are sent and flushed that actually prevent the other side's underlying data channel from closing. The data channel has these bytes in it, but the stream is closed... so it's neither closed nor usable. This stays stuck until the swarm times out the stream. I've done a bit of trouble shooting to try to trace those bytes, and it's as if the FIN bytes arrive on the data channel but never reach the stream level? Likely because the stream is closing / dropped?

I'm definitely open to other theories but this is what appears to be happening. 🤷 It may be a webrtc-rs issue where the channel may not close if there are unread incoming bytes. I haven't had the capacity to chase that theory down.

At the end of the day, the handshake is over at this point of closing the noise handshake channel. I realize the "proper" thing to do would be close the channel, but I'm not sure the design was to have 2 channels try to close themselves and the opposite side at the same time. Maybe we need to finish reading all channel bytes before closing too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants