To me, the key innovation here is the tight integration between network conditions and codec frame size. Standard codecs are created with specific bandwidth requirements and they provide encoded frames that 'average' around that size. You could just re-initialize a codec at a lower bandwidth on the fly, but you would have to send an I frame (large full frame) to kickoff the new series of frames (as video most video frames are just updates of a previous frame). Having a codec accept a bandwidth target per frame is a really good idea.
Codecs used by real time video systems are able to adjust the bitrate on the fly. There's not a keyframe request every time that happens unless resolution is changed. How quickly they adjust might vary, software implementations generally does it for the next encoded frame. The frame still will be somewhat larger or smaller than the target size since the codecs can't accurately predict the encoded size for given quality parameters.
The Salsify implementation in the paper has slightly more accurate way of producing one single frame as it encodes two frames with different quality targets and takes the largest one below the frame size target.
For a resolution change, couldn't you just scale the old last frame to the new resolution and use that as the basis for more P frames? (originally replied to the wrong comment)
Longer answer: The codec needs to support it. Codecs actually allow prediction from multiple reference frames, and maintain a buffer of them (2 to 16, depending on the codec, profile, and level). An individual frame may refer to several (potentially all) of those. So re-scaling up to 16 frames for every frame you decode will get quite expensive, not to mention the generational losses of doing this repeatedly for every resolution change. In practice what happens is you scale individual blocks when they get referenced by the current frame. But that has to be integrated into the motion compensation routines of the codec.
I'm fairly sure that Ben Orenstein and friend are forming a company to commercialise this as a Screenhero replacement. Discussed on this podcast: http://artofproductpodcast.com/episode-39
Very interested to see what they cook up (and kinda envious I didn't have the idea / don't have the space in my life to have a crack at it myself---it sounds very interesting).
I've been taking Financial Markets course by Robert Shiller and he continuously makes the point when talking about inventions and new ideas that "it's crazy to me that this didn't exist before". It's usually the sign of a really good invention when you have that thought. And that's the thought I'm having looking at this combining the codec and transport protocol together: "Why hasn't this been done before?" == "This is awesome!"
That's easy to say in hindsight, but it's easy to come up with all sorts of crazy ideas, ask yourself "why hasn't this been done before?", and then find out that the answer is "it has, it turned out to be a terrible idea, and that's why I've never heard of it".
A bigger frustration I experience is that some streaming seems to just "give up"; stalling and never resuming. I know the connection and server are okay because I can usually force it to resume manually, e.g. doing a page refresh, so is it just bad server architecture or a codec problem?
This happens to me with Fox Matchpass events (gotta see Champions League matches while at work). Most of the time, it will get enough of a buffer to start again, but there have been times when I manually have to stop/start the player to get it to go again. I've never had to do a hard page refresh though.
Another funny thing about Fox Matchpass streams is that when watching it live, somewhere around the 85th minute, the stream magically jumps back to the very beginning of the broadcast well before kick-off. I have to manually click the 'Live' button to get back to it. This one is consistent, and odd. It's almost like some test code got left in, and nobody has noticed/reported/etc.
Co-author here. Totally reasonable reaction, and we've heard this when the paper was posted elsewhere (e.g. on Reddit), but have not heard it from specialists, and honestly we suspect it's probably a red herring. Salsify's gains on the "delay" metric are mostly coming from two things: (1) the way that it restrains its compressed video to avoid building up in-network queues (which audio must also transit) and provoking packet loss, and (2) the way that it recovers more quickly from network glitches (check out the video).
If you wanted to add audio to Salsify, you would want to control a receiver-side video and audio buffer to reduce audio gaps and keep a/v in sync during periods of happy network, but this is unlikely to affect the system's ability to recover more quickly from glitches or to avoid building up in-network queues that delay audio and video alike. If you watch the video (or see Figure 6(f), Figure 7, and Figure 8), I don't think there's much reason to think audio can justify what the Chrome/webrtc.org codebase is doing -- WebRTC's frame delays are distributed over a broad range (so it's not like they're synchronized to some fixed timebase either) and are very high, especially in the seconds after a network glitch.
More to the point for our academic work, it would have been trivial to add shitty audio that made no difference to the metrics. The hard-but-necessary part is in designing an evaluation metric to assess (1) the qualify of the reconstructed audio (including how many gaps/rebuffering delays were there when the de-jitter buffer went dry), (2) the delay of the reconstructed audio, keeping in mind this is not constant over time, (3) the quality of the audio/video synchronization, which also will not be constant over time. Then measuring that in a fair way across Skype/Facetime/Hangouts/WebRTC/Salsify, and then trying to decide which compromise on those three axes is desirable. Somebody should do all that work at some point, but it's a major piece of work to bite off and pretty far from anything we've done so far.
Opus, with its low delay and solid rate controls would seem to be the natural pair here. But I agree audio is likely not the real problem in this space.
Any reason you didn’t choose to start from VP9? Is the encoder still too slow overall?
Seems highly unlikely, at a quick glance there's no overlap.
From the FAQ:
> Why the name “Salsify”?
> It's not a very interesting reason. Salsify comes from an older project called “ALFalfa,” for the use of Application Layer Framing in video. Alfalfa gave way to Sprout, a congestion-control scheme intended for real-time applications, and now Salsify, a new design where congestion-control (in the transport protocol) and rate control (in the video codec) are jointly controlled. Alfalfa, Sprout, and Salsify are all plantish foods.
The company you linked seems to be meant as "sales-ify".
Salsify is led by Sadjad Fouladi, a doctoral student in computer science at Stanford University, along with fellow Stanford students John Emmons, Emre Orbay, and Riad S. Wahby, as well as Catherine Wu, a junior at Saratoga High School in Saratoga, California. The project is advised by Keith Winstein, an assistant professor of computer science.
Salsify was funded the National Science Foundation and the Defense Advanced Research Projects Agency (DARPA). Salsify has also received support from Google, Huawei, VMware, Dropbox, Facebook, and the Stanford Platform Lab.
Financially supported by the government, tech juggernauts, and executed by top tier doctoral students + a high school student + a top tier university professor.
Assuming this could be game-changing innovation to further advance worldwide communication, it's refreshing to see the positive externalities of a combination of capitalistic (F500 tech co's) and socialistic (university, government) systems executed by a seemingly diverse set of actors.
How is university and government funding socialistic? Neither involves the workers' ownership of the means of production (and indeed, they exist in a state that is anything but socialist).
I think Richard & team have planned to develop & test the entire ecosystem with the 8 companies first then MIT license it or something. Also, Jian Yang...
(Disclaimer: This comment is my personal opinion, not that of my employer.)
Really exciting work.
Encoding multiple versions of a video and picking a smaller one in response to congestion already happens for video-on-demand (think YouTube and Netflix videos) in DASH. That said, with VOD you can encode the video slower than real-time.
I can't imagine this ever making it into Skype/FaceTime/Hangouts/Duo. The big corps will probably continue to focus on "more internet" (fiber optic, zero rating, wi-fi hotspots, and internet traffic management practices).
DASH uses big chunks of operator selectable size though as it is codec agnostic. I wonder if coupling transport and codec could have benefits for the massively scalable VOD case. (ie pre-render a bunch of stuff up front and run a "lite" codec that is coupled to the transport layer and aware of network conditions)
Let me rephrase the question. Assuming that in the video the red background is the network capacity and the lines are the video bitrate used, salsify seems to be using ~6000kbps and WebRTC ~2500kbps. Is this higher bitrate because of you're using VP8, or it a limitation of the protocol? If it's because of VP8 how hard would it be to adapt to modern codecs like HEVC and AV1?
I don't see anything in what they did that won't work with HEVC or AV1. They just hacked VP8 to be able to save and restore codec state per frame so they can generate multiple versions of the next frame, choosing smaller when network condition is bad. Their innovation is in preventing congestion rather than reacting to the aftermath of congestion.
Right, they may have used VP8 because VP9 and HEVC are more CPU intensive and they could technically encode more frames per second than the input 60 FPS.
Some is probably due the codec VP8 vs h264, but they are not wildly different in efficiency. WebRTC doesn't use VP9 or H.265 yet. I view the higher bandwidth used as an attempt to maximize quality at every bandwidth they just didn't set a ceiling on the encoding frame size. It pushes the total quality number higher on the chart they've published.
As a practical matter it isn't very, well, practical to have to change every codec to support this. Maybe in the next standards, but that's going to be hard.
It's 2018 and I still have many dropped calls and other weird stuff when I talk with people on my mobile. FaceTime Audio is often a good alternative but still not perfect. So, I really hope the audio version of this will be commercialized soon.
Unfortunately this would only apply to one-on-one low latency video chats. For streaming to an audience, which generally uses a distribution network between the user and the video source to help handle load and geographical distribution, the CDN itself has no influence on video encoding. The CDN would need to jump in and do this back-and-forth negotiation and delivery of lower quality frames, which it is not currently suited for. I'd love to see it come about, but it's not just the codecs we need to look at for adoption beyond point-to-point video calls.
The other major limitation is that forking the encoder state significantly inflates the number of reference buffers you need to keep, which greatly increases memory requirements. That's not much of an issue for software, but it can be a significant problem for hardware (a lot of real-time interactive encoding is still done purely in software, however).
Barely related to this, but looking at the results (section 5.2) I'm amazed at how much worse T-Mobile is for latency. AT&T and Verizon both give about 2 s of delay for Hangouts, while T-Mobile gives 7 s of delay.
The reason T-Mobile looks so bad is because the T-Mobile trace was from a 3G network with very poor conditions, while the others (AT&T and Verizon) were from LTE networks under relatively good conditions. You shouldn't compare the quality of the carriers from our results.
Almost all telecom's network service will vary depending on specific location. Instead of X has better latency than Y, the correct conclusion should be that X has better network _in this particular location_ than Y.
Kudos for making things accessible. However, joint source-channel coding is not news, especially at the level of scalable video coding (probably 20-year-old research by this point). In academia this isn't as exciting as it sounds to industry.
In a completely different way (i.e. DASH). SVC came out with bad timing, as the move towards HTTP video was gaining momentum. Also, around 2008(?) or so when the spec was finalized, there was no HW encoding support, so it was pretty unusable in practice.
The idea of using layers, though, is much older (I remember reading papers about this already back in 2001 or so)
Shows you the difference in opportunities for a smart kid in Bumblefuck vs. a smart kid living in a $5 million house in Silicon Valley and having transportation to Stanford.
I always laugh (while secretly crying) about the stories from high school people write on here. Seems like everybody is going to top-tier _tech high schools_, meanwhile the most computer science I got was being taught Turing by a gym teacher.
she's an intern there. I don't see her as a git contributor but she is knowledgeable in physics and math so I'm assuming she helped out with the paper.