Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Salsify – A New Architecture for Real-time Internet Video (stanford.edu)
568 points by jremmons on May 1, 2018 | hide | past | favorite | 73 comments


To me, the key innovation here is the tight integration between network conditions and codec frame size. Standard codecs are created with specific bandwidth requirements and they provide encoded frames that 'average' around that size. You could just re-initialize a codec at a lower bandwidth on the fly, but you would have to send an I frame (large full frame) to kickoff the new series of frames (as video most video frames are just updates of a previous frame). Having a codec accept a bandwidth target per frame is a really good idea.


Codecs used by real time video systems are able to adjust the bitrate on the fly. There's not a keyframe request every time that happens unless resolution is changed. How quickly they adjust might vary, software implementations generally does it for the next encoded frame. The frame still will be somewhat larger or smaller than the target size since the codecs can't accurately predict the encoded size for given quality parameters.

The Salsify implementation in the paper has slightly more accurate way of producing one single frame as it encodes two frames with different quality targets and takes the largest one below the frame size target.


For a resolution change, couldn't you just scale the old last frame to the new resolution and use that as the basis for more P frames? (originally replied to the wrong comment)


Short answer: Yes.

Longer answer: The codec needs to support it. Codecs actually allow prediction from multiple reference frames, and maintain a buffer of them (2 to 16, depending on the codec, profile, and level). An individual frame may refer to several (potentially all) of those. So re-scaling up to 16 frames for every frame you decode will get quite expensive, not to mention the generational losses of doing this repeatedly for every resolution change. In practice what happens is you scale individual blocks when they get referenced by the current frame. But that has to be integrated into the motion compensation routines of the codec.

Both VP9 and AV1 support this, for example.


Modern codecs have a rolling I frames to spread the cost over many other frames to avoid the spike in bandwidth needs.


Aside from the fact that the tech is obviously cool, I think the FAQ section is really well-written. Props to the team.


I agree, refreshingly top-notch FAQs!


It reminds me of the Def Con and Coachella FAQs


I'm fairly sure that Ben Orenstein and friend are forming a company to commercialise this as a Screenhero replacement. Discussed on this podcast: http://artofproductpodcast.com/episode-39

Very interested to see what they cook up (and kinda envious I didn't have the idea / don't have the space in my life to have a crack at it myself---it sounds very interesting).


I've been taking Financial Markets course by Robert Shiller and he continuously makes the point when talking about inventions and new ideas that "it's crazy to me that this didn't exist before". It's usually the sign of a really good invention when you have that thought. And that's the thought I'm having looking at this combining the codec and transport protocol together: "Why hasn't this been done before?" == "This is awesome!"


That's easy to say in hindsight, but it's easy to come up with all sorts of crazy ideas, ask yourself "why hasn't this been done before?", and then find out that the answer is "it has, it turned out to be a terrible idea, and that's why I've never heard of it".


A bigger frustration I experience is that some streaming seems to just "give up"; stalling and never resuming. I know the connection and server are okay because I can usually force it to resume manually, e.g. doing a page refresh, so is it just bad server architecture or a codec problem?


This happens to me with Fox Matchpass events (gotta see Champions League matches while at work). Most of the time, it will get enough of a buffer to start again, but there have been times when I manually have to stop/start the player to get it to go again. I've never had to do a hard page refresh though.

Another funny thing about Fox Matchpass streams is that when watching it live, somewhere around the 85th minute, the stream magically jumps back to the very beginning of the broadcast well before kick-off. I have to manually click the 'Live' button to get back to it. This one is consistent, and odd. It's almost like some test code got left in, and nobody has noticed/reported/etc.


Youtube fixed this problem many years ago, but it recently became un-fixed. New codec?


Maybe, it happens on FF fairly frequently but never (that I can recall) on Chrome (same distro Fedora 27).


Um ... from the paper ...

"6.1 Limitations of Salsify

No audio. Salsify does not encode or transmit audio."

Claiming that you beat a bunch of codecs that have synchronized audio (even though they disable it) is kind of misleading ...


Co-author here. Totally reasonable reaction, and we've heard this when the paper was posted elsewhere (e.g. on Reddit), but have not heard it from specialists, and honestly we suspect it's probably a red herring. Salsify's gains on the "delay" metric are mostly coming from two things: (1) the way that it restrains its compressed video to avoid building up in-network queues (which audio must also transit) and provoking packet loss, and (2) the way that it recovers more quickly from network glitches (check out the video).

If you wanted to add audio to Salsify, you would want to control a receiver-side video and audio buffer to reduce audio gaps and keep a/v in sync during periods of happy network, but this is unlikely to affect the system's ability to recover more quickly from glitches or to avoid building up in-network queues that delay audio and video alike. If you watch the video (or see Figure 6(f), Figure 7, and Figure 8), I don't think there's much reason to think audio can justify what the Chrome/webrtc.org codebase is doing -- WebRTC's frame delays are distributed over a broad range (so it's not like they're synchronized to some fixed timebase either) and are very high, especially in the seconds after a network glitch.

More to the point for our academic work, it would have been trivial to add shitty audio that made no difference to the metrics. The hard-but-necessary part is in designing an evaluation metric to assess (1) the qualify of the reconstructed audio (including how many gaps/rebuffering delays were there when the de-jitter buffer went dry), (2) the delay of the reconstructed audio, keeping in mind this is not constant over time, (3) the quality of the audio/video synchronization, which also will not be constant over time. Then measuring that in a fair way across Skype/Facetime/Hangouts/WebRTC/Salsify, and then trying to decide which compromise on those three axes is desirable. Somebody should do all that work at some point, but it's a major piece of work to bite off and pretty far from anything we've done so far.


Opus, with its low delay and solid rate controls would seem to be the natural pair here. But I agree audio is likely not the real problem in this space.

Any reason you didn’t choose to start from VP9? Is the encoder still too slow overall?


Is this at all related to the company Salsify? https://www.salsify.com


Seems highly unlikely, at a quick glance there's no overlap.

From the FAQ:

> Why the name “Salsify”?

> It's not a very interesting reason. Salsify comes from an older project called “ALFalfa,” for the use of Application Layer Framing in video. Alfalfa gave way to Sprout, a congestion-control scheme intended for real-time applications, and now Salsify, a new design where congestion-control (in the transport protocol) and rate control (in the video codec) are jointly controlled. Alfalfa, Sprout, and Salsify are all plantish foods.

The company you linked seems to be meant as "sales-ify".


Salsify the PXM company comes from salsify the root vegetable and is pronounced the same way. Our logo is a stylized salsify flower.


Slighty tangential...

Salsify is led by Sadjad Fouladi, a doctoral student in computer science at Stanford University, along with fellow Stanford students John Emmons, Emre Orbay, and Riad S. Wahby, as well as Catherine Wu, a junior at Saratoga High School in Saratoga, California. The project is advised by Keith Winstein, an assistant professor of computer science.

Salsify was funded the National Science Foundation and the Defense Advanced Research Projects Agency (DARPA). Salsify has also received support from Google, Huawei, VMware, Dropbox, Facebook, and the Stanford Platform Lab.

Financially supported by the government, tech juggernauts, and executed by top tier doctoral students + a high school student + a top tier university professor.

Assuming this could be game-changing innovation to further advance worldwide communication, it's refreshing to see the positive externalities of a combination of capitalistic (F500 tech co's) and socialistic (university, government) systems executed by a seemingly diverse set of actors.


How is university and government funding socialistic? Neither involves the workers' ownership of the means of production (and indeed, they exist in a state that is anything but socialist).


Socialism is not Communism.


That's correct, but it does not invalidate my point. Socialism involves the workers' ownership of the means of production.


Socialism as in what conservative rhetoric considers it to be.


Even Richard Hendricks didn't combine the codec and the transport. Genius.


Richard Hendricks is an idiot who is currently attempting a private takeover of the Internet (presumably) on behalf of the NSA.


I don't get why they say pied piper is decentralized and a "new open internet" when pied piper is building it and is closed source


Because the majority of people watching the show either don't know or don't care.


I think Richard & team have planned to develop & test the entire ecosystem with the 8 companies first then MIT license it or something. Also, Jian Yang...


(Disclaimer: This comment is my personal opinion, not that of my employer.)

Really exciting work.

Encoding multiple versions of a video and picking a smaller one in response to congestion already happens for video-on-demand (think YouTube and Netflix videos) in DASH. That said, with VOD you can encode the video slower than real-time.

I can't imagine this ever making it into Skype/FaceTime/Hangouts/Duo. The big corps will probably continue to focus on "more internet" (fiber optic, zero rating, wi-fi hotspots, and internet traffic management practices).


DASH uses big chunks of operator selectable size though as it is codec agnostic. I wonder if coupling transport and codec could have benefits for the massively scalable VOD case. (ie pre-render a bunch of stuff up front and run a "lite" codec that is coupled to the transport layer and aware of network conditions)


Some of the things you mention already do the multi encode trick and get a resultant bad rep for cpu and battery load.

The real trick is balancing all such product concerns in any next gen.


The cost here seems to be bandwidth, is this because you're using VP8? Could this be adapted for other codecs like AV1?


The cost is encoding frames that won’t be sent, and non constant frame rate. Bandwidth is more optimally utilized.


Let me rephrase the question. Assuming that in the video the red background is the network capacity and the lines are the video bitrate used, salsify seems to be using ~6000kbps and WebRTC ~2500kbps. Is this higher bitrate because of you're using VP8, or it a limitation of the protocol? If it's because of VP8 how hard would it be to adapt to modern codecs like HEVC and AV1?


I don't see anything in what they did that won't work with HEVC or AV1. They just hacked VP8 to be able to save and restore codec state per frame so they can generate multiple versions of the next frame, choosing smaller when network condition is bad. Their innovation is in preventing congestion rather than reacting to the aftermath of congestion.


Right, they may have used VP8 because VP9 and HEVC are more CPU intensive and they could technically encode more frames per second than the input 60 FPS.


They probably used it because source code was most available.


Some is probably due the codec VP8 vs h264, but they are not wildly different in efficiency. WebRTC doesn't use VP9 or H.265 yet. I view the higher bandwidth used as an attempt to maximize quality at every bandwidth they just didn't set a ceiling on the encoding frame size. It pushes the total quality number higher on the chart they've published.


As a practical matter it isn't very, well, practical to have to change every codec to support this. Maybe in the next standards, but that's going to be hard.


It's 2018 and I still have many dropped calls and other weird stuff when I talk with people on my mobile. FaceTime Audio is often a good alternative but still not perfect. So, I really hope the audio version of this will be commercialized soon.


Unfortunately this would only apply to one-on-one low latency video chats. For streaming to an audience, which generally uses a distribution network between the user and the video source to help handle load and geographical distribution, the CDN itself has no influence on video encoding. The CDN would need to jump in and do this back-and-forth negotiation and delivery of lower quality frames, which it is not currently suited for. I'd love to see it come about, but it's not just the codecs we need to look at for adoption beyond point-to-point video calls.


The other major limitation is that forking the encoder state significantly inflates the number of reference buffers you need to keep, which greatly increases memory requirements. That's not much of an issue for software, but it can be a significant problem for hardware (a lot of real-time interactive encoding is still done purely in software, however).


There's a vegetable named "salsify", very yummy. https://duckduckgo.com/?q=salsify+vegetable&t=ffab&ia=recipe...


Barely related to this, but looking at the results (section 5.2) I'm amazed at how much worse T-Mobile is for latency. AT&T and Verizon both give about 2 s of delay for Hangouts, while T-Mobile gives 7 s of delay.


The reason T-Mobile looks so bad is because the T-Mobile trace was from a 3G network with very poor conditions, while the others (AT&T and Verizon) were from LTE networks under relatively good conditions. You shouldn't compare the quality of the carriers from our results.


Darn, I forgot UMTS was 3G. I think I was confusing it with HSPA+ (which I guess isn't really 4G either). Great work on the project.


Almost all telecom's network service will vary depending on specific location. Instead of X has better latency than Y, the correct conclusion should be that X has better network _in this particular location_ than Y.


T-Mobile aims for the Walmart positioning and Verizon the Neiman Marcus positioning.


> What would you say to tomorrow’s codec implementers?

> Standardize an interface to export and import the encoder’s and decoder’s internal state between frames!

Can't this be achieved using sandboxing/emulation/VM techniques?


Not very efficiently, which is kind of the point here.


Another recent discussion was https://news.ycombinator.com/item?id=16802079.


Kudos for making things accessible. However, joint source-channel coding is not news, especially at the level of scalable video coding (probably 20-year-old research by this point). In academia this isn't as exciting as it sounds to industry.


Do you know if scalable video coding actually ended up being implemented in industry (YouTube/Netflix/Hulu/Amazon)?


In a completely different way (i.e. DASH). SVC came out with bad timing, as the move towards HTTP video was gaining momentum. Also, around 2008(?) or so when the spec was finalized, there was no HW encoding support, so it was pretty unusable in practice.

The idea of using layers, though, is much older (I remember reading papers about this already back in 2001 or so)


You do realise this is academic research from top researchers and published at a top conference, right?


If you read the FAQ it address how this is different from a "cross-layer" approach.


this is not new. If you google video telephony adaptive rate adaptation techniques based on network conditions, you will find many(even from 1980s).


How does this compare to Pied Pipers algorithm?


"Is this a startup company?

No.

Are you sure? Your website looks like a startup company’s.

It's just the HTML template! They all look like this. [...]"

Brilliant


The website won't even render in the Materialist HN client for Android.


Like a true startup website.


It is a surprisingly nice looking website for an academic project.


Or a trial balloon for commercialization.


I saw this on Reddit before. Most not impressed since it doesn't support audio.


> Salsify is led by [...] Catherine Wu, a junior at Saratoga High School in Saratoga, California.

Oh


Shows you the difference in opportunities for a smart kid in Bumblefuck vs. a smart kid living in a $5 million house in Silicon Valley and having transportation to Stanford.


I always laugh (while secretly crying) about the stories from high school people write on here. Seems like everybody is going to top-tier _tech high schools_, meanwhile the most computer science I got was being taught Turing by a gym teacher.


She's not the only person leading the group, but it's awesome that she's one of them!


Yes, of course--the affiliation just jumped out at me.


she's an intern there. I don't see her as a git contributor but she is knowledgeable in physics and math so I'm assuming she helped out with the paper.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: