WebRTC Interaction of 2 Browsers During a Voice/Video Calls
When many years ago an Adobe Flash has stopped a description not only game was affected. Flash has traditionally been strong in voice and video calls: direct access to a microphone, camera, speakers, the ability to work with UDP-packages. In HTML5 the replacement was a technology called WebRTC. The one that several months ago finally landed in Safari and Edge. Now you can call from a web page opened on the iPhone to another web page, for example, open in Firefox Quantum on Linux.
One of the benefits of using a WebRTC over an Adobe Flash is the possibility of P2P connections between browsers. But for peer-to-peer to work, the programmer will have to suffer. About how browsers agree where to send UDP-packages, and what should the developer do – you will find out in this article.
Signaling – something that they try not to talk about
Most of the tutorials on WebRTC are a story about the cool replacement of Flash, voice and video calls from browsers, beautiful story about peer-to-peer and ten-megabyte video stream without delays when video calling from your iPhone to a Windows laptop, provided they are connected to the same WiFi. In the same code, usually show a few lines of JavaScript, convincingly demonstrating how easy it is.
The trick is that the wrapper over the WebRTC is usually displayed. And besides hiding from the developer of intestines RTCPeerConnection and MediaDevices.getUserMedia, such wrappers hide from the developer all communications between the two browsers, using for this purpose their own cloud and a stack of technologies: whether it’s PubNub, Twilio or Voximplant.
Doing the work for the developer is good and right. But by simplifying the stack of technologies, we often put a delayed-action mine behind ourselves when a misunderstanding of the “under the hood” processes leads to the disruption of the terms working through the solutions and “technical problems” about which the technical support is so fond.
This story is about signaling in WebRTC, how we and other companies do it, and how you can do it if you want to create your solution from scratch and without using ready-made services.
Why do you need a server with a P2P call?
Hearing the phrase “peer-to-peer”, we usually remember torrents. They seem to have no central server. What is “signaling” in WebRTC and where does it have a server?
Suppose you created a Web page with WebRTC and JavaScript code. You opened it on three laptops connected to your WiFi and want the first laptop to make a video call for the third one. How does WebRTC on the first laptop find out that it is necessary to connect to the third one? How would we act on the site of WebRTC developers?
- The first method that comes to mind is to transfer the WebRTC of the first laptop to the IP address of the third laptop and let it send out UDP packets. But this method will work only if both devices are connected to the same network and this network allows them to receive packets from each other (surprise – public WiFi in hotels and on conference sites often does not allow). And what if we do not have one, but three WiFi access points? And all three laptops are connected to different access points and have the same virtual IP address, for example, “192.168.0.5”. Where does the browser running on the first laptop send packets?
We can assume that in this situation there will not be a call, and in any case, we need an external server with a “real” IP address through which browsers on both laptops can communicate with each other. But the authors of WebRTC considered that voice and video are traffic-consuming communications, and if millions of Skype for Web users or Google Hangouts call through public servers, these servers will burst. The creators of WebRTC endowed the technology with the ability to “pierce” NAT and establish P2P connections, even if both devices have virtual IP addresses and can not directly exchange packets. The payment was the same “alarm”. The developer can not simply transfer the WebRTC IP address of the second device or external server. He needs to help both browsers carefully examine the network and negotiate with each other. And for this, he needs his Signaling Server.
Offer, Answer, ICE candidates and other scary words
So, what does the video call between the two browsers look like from the developer’s point of view?
- After all the preliminary preparation and creation of the necessary JavaScript objects on the first browser, the WebRTC method createOffer () is called, which returns a text package in SDP format (or, in the future, a JSON-serializable object if the UTC version of the API fetches “classic”). This package contains information about what the developer wants for communication: voice, video or send data, what codecs are there – all this story
- And now – the alarm. The developer must somehow (really, in the specification and written!) To transfer this text package to the second browser. For example, using your own server on the Internet and a WebSocket connection from both browsers
- After receiving the offer on the second browser, the developer passes it to the WebRTC using the setRemoteDescription () method. Then it calls the createAnswer () method, which returns the same text package in SDP format, but for the second browser and taking into account the received package from the first
- The signaling continues: the developer sends the text packet back to the first browser
- After receiving the answer on the first browser, the developer passes it to the WebRTC using the already mentioned setRemoteDescription () method, after which WebRTC in both browsers are minimally aware of each other. Can I connect? Unfortunately no. In fact, everything is just beginning
- WebRTC in both browsers begins to study the state of the network connection (in fact, the standard does not specify when it should be done, and for many browsers, WebRTC begins to explore the network immediately after creating the corresponding objects, so as not to create then unnecessary delays when connecting). When the developer at the first step created WebRTC objects, he should at least transfer the address of the STUN server. This is the server that, in response to the UDP packet “what my IP” transmits the IP address from which it received this package. WebRTC uses a STUN server to get an “external” IP address, compare it to an “internal” IP address, and see if there is a NAT. And if so, which backfiles do NAT use to route UDP packets
- From time to time WebRTC on both browsers will call the nice candidate callback, passing the already familiar SIP packet with information for the second connection participant. This package contains information about internal and external IP addresses, connection attempts, ports used by NAT, and so on. The developer uses signaling to transmit these packets between browsers. The transmitted packet is returned to WebRTC using the addIceCandidate () method.
- After a while, WebRTC will establish a peer-to-peer connection. Or can not, if NAT will interfere. For such cases, the developer can transfer the address of the TURN server, which will be used as an external connection element: both browsers will transmit UDP packets with voice or video through it. If the STUN server can be found free (for example, Google has it), then the TURN server will have to be raised. No one is interested in letting terabytes of video traffic through just for yourself
All these nuances can be hidden if you use a ready-made platform. Our Web SDK correctly configures WebRTC, patches SDP packets, supports the WebSocket connection to the Voximplant cloud and takes care of a lot of details. And of course we have our own STUN- and TURN-servers, so that connection can take place in any case. But you can not hide the nuances and do it yourself! Available in the browsers API now allows you to make alarms in many ways, about them – below.
Simple signaling with HTTP requests that do not work
The first thing that comes to mind is a simple HTTP-server and XMLHTTP Request/fetch from the browser. Alas, it will work only for “hello world” from the textbook. In real life, the server will come from so many requests. Which will have to be done quite often, so that by clicking “connect” the user did not wait for several minutes of “connection setup”. And they will have to do it often because WebRTC is a real-time story, and offer/ answer/ice needs to be transferred very quickly. Delay even in a few seconds can serve as a signal to WebRTC that “nothing happens”, after which the engine will stop trying to establish a connection. As an option, you can try the technique of “long polling”, but in practice, it does not work very well and the intermediate Internet infrastructure likes to cut off such “slow” HTTP requests.
WebSockets: most effective tactics available
Most solutions using WebRTC use WebSockets for signaling. The protocol is already old enough to be supported by the vast majority of used web browsers and network equipment. And if you use a wrapper like socket.io or SocketJS, then in those rare cases when WebSocket does not work, you can degrade to HTTP long polling, which will work “at least somehow.” On the server side, a WebSockets connection that does not transmit data consumes almost no resources, and the server can safely service tens of thousands of pending web page calls.
What problems can there be with WebSockets? Well, connections sometimes break off – it needs to be processed. Still they have high timeouts to keep alive – the connection may look alive, but in fact, it is already torn somewhere on the intermediate equipment. And we only know about it when the next keep-alive packet does not come, and it can be ten minutes. During which they try to reach us, but they can not. This mechanism is given to implementations of browsers and servers so that the ping-pong frame from the server side will be useful to check and tweak if necessary.
HTTP/2-signaling as a modern equivalent of WebSocket
When the 2nd version of HTTP becomes more popular, WebSockets and Server Side Events are likely to be a thing of the past. The binary channel of communication with the server in both directions, through which it is possible to get both an HTML page, pictures, and organize a WebRTC signaling, is very cool. Unfortunately, despite the support of the latest versions of popular browsers, HTTP / 2 is still dangerous to use for projects with a wide audience. The reason is in the intermediate equipment that makes up the “skeleton” of the Internet. All these routers, gateways, barricades, and 20-year-old pussies often complete HTTP / 2 connections, not understanding what it is and trying to “protect” something from someone.
WebRTC-signaling as an example of recursion
And another WebRTC connection can be used to signal the WebRTC! It sounds strange, but this method has its advantages. If the first WebRTC connection is established between the browser and the cloud (as we do for non-P2P calls) with some other signaling, then this connection can then use the Data Channel API. Which differs from WebSockets in that it can work not only “like TCP”, but also “like UDP”, very quickly sending packets without guaranteed delivery. This method will very quickly signal connections – faster than WebSockets and HTTP/2. In some cases, this method is what you need. For example, in games.
TL -DL
In summary, before the WebRTC establishes a peer-to-peer connection, the developer must provide the ability for two browsers (or other devices, the liberty library from Google allows WebRTC to use everything that C ++ compiles) to exchange several text packages. You need to do this quickly, otherwise, timeouts will not work. Platforms do signaling (and much more) for the developer, but if you really need, you can do it yourself. Just remember the heap of nuances, and then debug everything