Hacker Newsnew | past | comments | ask | show | jobs | submit | jonathanmayer's commentslogin

Context: I teach at Princeton and study social media and recommendation systems.

From a very quick skim of the repositories, this appears to be quite limited transparency. The documentation gives a decent high-level overview of how Tweet recommendation works—no surprises—and the code tracks that roadmap. Those are meaningful positive steps. But the underlying policies and models are almost entirely missing (there are a couple valuable components in [1]). Without those, we can't evaluate the behavior and possible effects of "the algorithm."

[1] https://github.com/twitter/the-algorithm-ml


I work on Google Assistant Suggestions and I don't think it's very practical to open-source an algorithm like that including the models and the underlying data. Both of them can live in separate services and be frequently updated.

I am assuming that open sourcing the code aims to increase transparency about the business logic of the ranking decisions. At the same time you don't want spammers to be able to easily run experiments against a cloned version of your system.


> But the underlying policies and models are almost entirely missing (there are a couple valuable components in [1]). Without those, we can't evaluate the behavior and possible effects of "the algorithm."

Haven't gone through yet, but yeah, if that's the case, all this is, is a glorified framework to plug your own in.. Not exactly what was promised.


Did you also skim the accompanying (or rather, main) repo, https://github.com/twitter/the-algorithm ?

From a quick clone and line-count, it has:

  235 kLOC .scala
  136 kLOC .java
  22  kLOC .py
  7   kLOC .rs
So I don't think you did, since you posted so quickly and that's a LOT of code.

I also haven't skimmed this code except very superficially, but perhaps you should since you're out there making statements with your Princeton credentials.

(I posted this comment with the heads-up a few minutes after your comment above and then expanded it as you didn't respond.)


I think you misunderstood. He's saying the training models are not there.


For example, MostRecentCombinedUserSnapshotSource seems to be influential (such as for calculating "tweepcred"), but we can't see how it's calculated.


Wouldn’t that make them easy prey of “spam SEO”. However, given the framework isn’t it still possible to guess the models?


The spam SEO issue should be dealt/thought about _before_ engaging in the whole adventure, and having to guess how it could work if decently implemented properly defeats the "open source" spirit of it.

More credits would be given if the very idea of open sourcing the algorithm hasn't already been discussed to death with predictions of the difficult points and how it probably won't happen in any sane way.


And them be pilloried for not doing it or not fast enough. Damn if you do, damn if you don’t.

I’m starting to think the broblem with Elon is mostly personal, he’s just a proxy and default wrong.

(not that I approve of his behaviors, but I can’t enjoy this whole mobbing that he’s getting; not that he cares this I’m not worried he’s getting traumatized in any way? it’s just how it’s become an identitarian trait for a certain group that irks me.)


Makes me wonder if a way to override people SEO hacking the algorithm is to create a market of open-source algorithms that each individual can choose and then it's not trying to hack THE algorithm but having to hack many and not knowing which algorithm an individual is using.


You don't have to target each 'algorithm' all at once. You can target them one at a time. Hell you can run A/B test single out the easiest targets.


Yes but right now there is 100% of the users using the one algorithm (or chronological). If one doesn't know what percentage or which people are using which algorithm, it becomes harder to know which ones to try to hack to have the biggest result.



Those look older to me. They all have last updated dates for October and November 2022.


FB open source algo looks much better, right? /s


Is it valid to focus tracking a Dem/Rep split when that split is an exclusionary design for many Americans? Or is it not exclusionary in your belief? I'm curious of a social science perspective.

Ignoring the global nature of Twitter for a moment.


So why did they opensource it?


So they could pretend to be open. It's the "Open"AI model. Open-washing?


This is a very cynical take. They should be commended for publishing recommendation code at all, which no other major social network does.


Well if they say “we will open source the algorithm” and then what they really open source is a little bit of slightly relevant code that doesn’t allow us to understand the algorithm, then what we can deduce is that they are trying to weasel out of public commitments.

I can’t say for sure if that happened, but if they made a clear promise and then did something else, it’s perfectly reasonable to call that out.


Devil's advocate though: imagine you were to open source (probably with quite a short deadline) some 'algorithm' used in whatever you work on, but the rest should stay private; how would you go about that?

I don't think it's easy, there's inherently some interface(s!) where it's a hand-wavey 'get the thing from the private bit', and defining that sensibly is hard, and if you try to do it well will probably lead to a lot of meetings, scope creep, etc. - and as far as that goes it's not easy anyway, since it's highly technical and implementation-specific yet also a management/policy decision to make.


It depends on what your goal in open sourcing is. Are you looking to provide a base for others to build software on, and to provide a way for others to contribute back to your code? Then publishing the code makes sense.

Are you looking to build public trust in you and your organization? Then dumping a bunch of code with no context isn't going to help much, as it's not code but behavior that builds or destroys trust.

Are you looking to lean into a polarized partisan environment, pushing a narrative where its you and your supporters against an unfair group of "others"? Then a big splashy move high on symbolism and low on substance that will inspire lots of high profile, divisive media coverage is a great way to go.


If you were doing it in good faith, you wouldn't need to publish the actual code. Most likely you should publish an article and a flowchart explaining how the algorithm works. Publishing a partial chunk of code just creates a story that supporters who don't understand can parrot that "they opened their algorithm".


Exactly. Publishing what they have is the worst of both worlds - hopefully people will create flowcharts based off it, though, although it sounds like there will still be a low level of accuracy.


I still hear reverse-FUD about nvidia supposedly fully open-sourcing their Linux driver, when in reality they opened a tiny kernel portion of it that allows the main proprietary blob to connect to necessary kernel interfaces. You have to call out this bullshit when you see it.


Wait, what? AFAIU what you say is true, except for the part where the “main proprietary blob” does not run on the CPU. This isn’t as glorious as an actual open-source driver would be, but it does have meaningful advantages—e.g. you now have a ghost of a chance of implementing Nvidia GPU support on a non-Linux kernel, by uploading the GPU-side blob and rewriting the CPU-side shim as required. Or is the blob license-restricted from being used line that?


The "main proprietary blob" they're talking about is the userspace portion of the driver; the portion which does all of the heavy lifting. That definitely runs on your CPU. The only part they open-sourced is the kernel portion of the driver, which just exists to facilitate communication between the userspace driver and the hardware.


Hey, we can get even more cynical. Why should we trust that this code is even similar to what they run in production currently?


I can't imagine deliberately special casing Elon's account in something they made from scratch to fool people.


Let's have reasonable goals, shall we ? "Their shit doesn't stink as bad as others'" is nothing commendable, especially after souch publicity.


I say "why not both". Even if they are doing it only for good PR, we encourage it by giving them praise, because we should encourage things we want. (While remembering that they are not our friend, they are an entity we should pressure, and the way we pressure is by giving praise when they do things we like, and critcisim when they do not).


I’d give them more credit if they’d been honest and kept it secret then lie to my face and pretend they didn’t?


They should be commended for open sourcing something they don't understand because they fired all of the people whom understood it? Elon admitted as much.


[flagged]


Because the way he acts gives people every right to. I agree that he may be misrepresented, but if he is, then he has to shoulder at least some of the blame.


The question is: are they right?


Any time a billionaire buys a media company it's bad for the health of democracy.


Not necessarily. What if the media company was bad for the health of democracy, and the billionaire's incompetence destroys the company's social standing and thus its ability to do more damage (even in the billionaire's own interests)?


Yeah, have to wonder how many people, if they had the money, would want to buy out Twitter just to wipe it out. Doesn't a huge chunk of HN hate Twitter and wish it were dead?

(Regardless I think that would be useless in the long run, since the millions of stranded users will still want another Twitter-like platform. And Twitter imploding without a designated archive will wipe out a tremendous amount of digital history.)

A lot of his decisions look pretty incompetent in the surface, like how could he not see how charging for verification devalue the system to whoever has the money?

Instead it could just be an intentional ploy to completely devalue Twitter disguised as incompetence. He can justify firing employees and charging for API access/verification as money-saving strategies, even if they're terrible strategies that have little chance of succeeding. And he could make enough people believe he's an idiot who makes things up as he goes rather than someone specifically driven or apathetic enough to run Twitter into the ground. Not to mention he was forced to buy them after changing his mind. Almost feels like a "so that's what happens" response.

I wonder how higher powers would be able to distinguish fake incompetence from real incompetence. Would they care how Twitter as a private company ends up if it's the case that it implodes from its own legitimately bad business decisions? It reminds me of how employers won't directly fire employees for discriminatory reasons, instead they make the employees' lives miserable so they're compelled to leave on their own, thus they escape scrutiny.


This is basically at the level of "9/11 was an inside job to bring down WTC 1, but WTC2 was destroyed in an unrelated but simultaneous terrorist attack"


> Yeah, have to wonder how many people, if they had the money, would want to buy out Twitter just to wipe it out. Doesn't a huge chunk of HN hate Twitter and wish it were dead?

> (Regardless I think that would be useless in the long run, since the millions of stranded users will still want another Twitter-like platform.

If there's not an obvious successor, right when its shutdown, a lot of those people might get their habit broken and find something better to do. I know Mastodon was held up as a successor, but it's unclear to me if that's actually capable of scaling to that level.


Mastodon is way too flawed to be anything but a niche tool for tech people and activists. I highly highly doubt such a system can cross the chasm. That doesn’t mean that’s a bad thing though.


Or, he’s as incompetent as he looks.


Can you name one relevant media company owned by someone from the working class?


If you personally own a media company, you are by definition bourgeoisie. But see:

https://en.wikipedia.org/wiki/Media_cooperative#List_of_medi...


Which is why HN was so incensed about Bezos buying the Washington Post.


And when a highly scrutinised, highly visible billionaire buys it off a different bunch of billionaires which you know little about?


i wasnt referring to him buying twitter, i was referring to him saying he was going to open source the recommendation engine and then doing it.

i agree billionaires owning media companies is huge problem


Do you believe billionaires can do good? Is their existence an existential threat to democracy?


Yes. There are plenty of philanthropic billionaires. Yes. That much money buys a destabilizing amount of influence.


Billionaires are billionaires not by literally storing cash. The rest of the society values their contributions and creations in the companies/corporations they run. Sure, they have some liquidity but the entire concept of resentment towards billionaires is essentially equal to resentment for the betterment of the world. There are some exceptions but for the most part, in a well oiled market, you can't just become a billionaire by fucking over people. See Adani and how it turns out for him: https://www.ft.com/content/5c0b6174-e66d-4fa5-89a5-6da1d69ab...


Every major media company is owned by a billionaire


[flagged]


It's because there was close to zero newsworthy information in them, just nonsense being disseminated by wannabe-journalists.


I encourage you to watch the C-SPAN recordings of the senate sessions where they brought in Twitter employees and journalists to cover what was in the Twitter files.

From your comment it sounds like you’ve been consuming the 30s soundbytes from those hearings and the misinformation spreading around the internet.

A long list of 3 letter agencies were compiling lists of citizens and journalists and sending them to social media companies to review for ToS violations.

There is a very real threat to civil rights here. When this cannon swings around and points back at LGBTQ, racial equality, stopping the war on drugs, etc. this is going to be “not pretty.”

And the hearings covering them were unbelievably shameful. Senators talk passed the guests in the room. Refused to abandon their “sick burn” scripts regardless of where the conversation went. Insulted their guests. Went in random directions of questioning that had little to do with the root problem…

At the core of this, 3 letter agencies (seemingly across the board) have decided that it’s acceptable to ask social media companies to prevent citizens from communicating on their platforms by selectively directing the attention of their moderation teams towards individuals. Whether this is legal, or a violation of 1st amendment rights, is for sure an open question.

Only one senator directly addressed that and only briefly by saying “maybe they’re trying their best” - a statement that doesn’t exempt anyone involved from following the law.

Is the government allowed to censor citizens by weaponizing their ToS for selective enforcement and, if the government can do that, where is the line drawn? How specific are they required to be? Can a platform ban all political speech and then only selectively enforce requests from the government without doing their own moderation? How far can we launder the 1st amendment through a public-private collaboration of enforcing ToS?

Honestly I’m not sure what the hearings were really meant for, the government is unlikely to hold itself accountable. At this point I do believe the ball is in the citizen’s court to bring suit against the agencies named in the Twitter files like we did with the presidential surveillance program.


The government requesting that the tos of a private company be upheld seems rather mild to me. Did we get the reasons for the requests in the released files? Were they trying to reduce foreign propaganda or public health misinformation or something else important?


You like your government trying to tell a private company what's true and untrue?


More than I like a private company telling me what's true and untrue.


You clearly are oblivious as to what they contain


[flagged]


Please don't break the HN guidelines like this. It's not what this site is for, and destroys what it is for.

What's worse, if you have a true point, then posting like this actually discredits the truth and gives people a reason to reject it. That isn't in your interest and in fact hurts everyone.

https://news.ycombinator.com/newsguidelines.html

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...


"Way too negatively"? We're talking about one of the world's most influential people who uses their power to randomly accuse innocent normal people of being pedophiles. There is no portrayal too negative.


This is like FB open sourcing the compiled frontend code you can see yourself using inspect.

If we commend them for this we're helping promote and encourage this faux open source virtue signaling


No, that's very different.


There is clearly a lot of information to share. It's worth considering this could be step 1 of n as opposed to assuming the worst possible intention.


It's healthy to have a normal amount of cynicism. They released it for a reason. "The goal of our open source endeavor is to provide full transparency to you, our users, about how our systems work."

Why be transparent (or try to appear transparent)? To convince people to trust your platform (or to recruit - which seems to be another goal of the post). Why would Twitter want or need to do this now? Well, there is a bit of context. This disclosure doesn't exist in a vacuum.


I love this take. Doomed if you do, doomed if you don't.


[flagged]


I agree, which is why I wonder what your motivation is to defend Twitter. You're posting about this for a reason. If I were a social media company, I'd probably have paid agitators to defend them.


If we are willing to not assume some borderline "it's what they want you to think" conspiracy play, obviously there was always going to be a lot of highly interested and qualified people taking a very close look at this and, at some point, there was always going to be very definitive conclusion of what's the deal with what they released.

If your play was "it's some source code, hence people will think we are open, and that should be really good for us", that would make you a very special kind of idiot in this space.


That was one of Elon’s core statements when he first talked about buying Twitter. If he had gotten it out sooner there would be an easier link between the two, but if you want more context just go read the old tweets and articles from the Twitter vs Elon days.


If we can't build anything with this, is it "source"?


"Does not include batteries"


You must be new to Musk's business practices.


It's no secret that Twitter, like any other social media platform, is driven by user engagement and ad revenues. The more time we spend on the platform, the more valuable it becomes for them. With this new open-source algorithm, they're essentially crowdsourcing improvements to their system to better serve us the content we crave.

this move could be seen as a strategic PR play to boost their public image amidst the growing concerns around algorithmic bias and lack of transparency. By inviting the community to collaborate and address these issues, they're not only shifting some of the responsibility onto the users but also deflecting potential criticism.


Because they let go many of the engineers working on it?


Noone has mentioned this before - I don't know if it's really related, but afaik the European Union is thinking about requiring social media platforms to be more transparent when it comes to recommendations etc. If you can already say "hey we have a lot already online!" then maybe the laws will become less strict.


bc he have no devs anymore and thinks the community will fix it for free


PR and it was already leaked last week.


PR


> But the underlying policies and models are almost entirely missing... Without those, we can't evaluate the behavior and possible effects of "the algorithm

And neither can spammers find and test the cracks and edge cases that would allow them to break the system, that does sound reasonable to me. If they were public there would be an arms race between spammers/those wishing to game the system and Twitter engineers.


Then don’t pretend to release “the algorithm.”


They’re explaining how it works without giving the specifics. Much like the US military explains how the nuclear deterrent works without disclosing detailed plans and control codes.


It's an open algorithm, but it's not open data! (joking)


[flagged]


imagine thinking you need to read every file in a project to understand the architecture and which pieces are important for specific functionality you're looking to understand. Have you ever picked up a bugfix ticket for some code you didn't write?


It's fast to read stuff when you have the domain knowledge. The weights won't be a 5kb Scala file: they'd probably be a big binary file, which is easy to search it github/locally after cloning.

Otherwise, if they are provided, someone in the thread will surely point to them.


You missed this in your rush to display your newly acquired sarcasm101 skills:

  "Skim": To read quickly or cursorily, to glance over, or to omit details in order to get the gist of something.


Context: I studied at Oxford

Fair point, I missed that when I skimmed OPs comment


class project, 200 students, 1500 LoC each. Time for grading.

there are contexts in which this may be well practiced.


We should really all just bow in awe as we are clearly inferior.


Princeton has a Code Reading 101 that all postdocs/professors must take, however in exchange for the Secrets of Speed Reading you must acknowledge every message with where you learnt those skills.


[flagged]


The context is relevant for indicating that they’ve familiar with the problem and have thought about these issues in depth. It’s also useful for not being accused of hiding their identity if someone thinks they have an unmentioned agenda. Argument from authority is bad when it’s of the form “I am an expert, therefore you shouldn’t question this claim”, not when it’s used to provide an identity to a previously-unknown name while also providing a cogent argument and supporting evidence.


What did you expect?


I don’t know if the parent’s expectations matter here. This is more about making sure others don’t misunderstand the meaning here.


Good point. I didn't see it like that. Thanks!


Can i audit your classs for free?


Context: I teach at Princeton and used to work at the FCC.

Several comments suggest systematically comparing FCC data to what ISP websites say about availability. My research group did this! Here's the paper:

https://dl.acm.org/doi/10.1145/3419394.3423652

And here's a followup project by investigative journalists at The Markup:

https://themarkup.org/still-loading/2022/10/19/dollars-to-me...


So in other words, all the telecoms are systematically lying to the FCC in order to steal money from taxpayers, but there is no actual enforcement nor penalties for doing so, so they continue to do this with impunity.

(I remember when Clinton gave away a huge giveaway to the telecoms so they would provide fibre everywhere, and then they did fuck all and just laughed.)

In a just society, they would quickly be presented with a estimated bill for the largest amount that they could possibly have stolen - I'm sorry, let me repeat this word "stolen" - from the US taxpayer, PLUS massive penalties, and then they would be required to prove how much they actually stole if they wanted to reduce the cost.

Plus complete discovery of all their records should be required, with a view toward criminal prosecution of their executives.

As it is, they have absolutely no reason not to cheat, lie and defraud if they think they can make money at it.


> And here's a followup project by investigative journalists at The Markup:

> https://themarkup.org/still-loading/2022/10/19/dollars-to-me...

That article is pretty bad. It doesn't once mention DSL or that technology's inherent technical limitations that can result in widely variable speeds (IIRC, your bandwidth is determined by the length of the wire between your house and the central telephone office). Then it spends a lot of time talking about race, which is likely creating a misleading impression that lends itself to outrage.


DSL bandwidth is determined by the length of the wire between DSL modem and DSLAM which can be in the cabinet on the curb (and I believe it usually is).

Also it depends on what version of DSL standard we are talking. I personally started with adsl2 which was 12/3, upgraded eventually to VDSL2 which did 150/10 and latest standard is G.Fast which can give 1 Gbit/s aggregate uplink and downlink at 100m


It also depends on the condition of the wire. 50 year old paper insulated POTS wiring will struggle for just a few Mbps even if it's a short hop to the CO. Thank god the telcos were able to collect the USF surcharge to pay for the necessary upgrades.


good point. interesting if there are any estimations "out there" about age of POTS wiring.


The odds of having a cable under 100m to a DSLAM basically rounds to 0.

At an highly optimistic 1 mile you’re already down to 20Mbps and most people are significantly further than that. Remember it’s the physical distance of the cable between the actual devices that matters and that’s not straight.


depends on location. and on greed of the telco. i used to live within 100m of dslam. in cities with dense construction/etc it's achievable. also, with g.fast it's 1000 aggregated at 100m. for longer distances numbers are 200 m 600mbit, 300m-300mbit, 500m - 100mbit. numbers are not too bad even for suburb. For comparison, high speed mm wave 5g needs to be deployed any 100-200m in order to get proper speed/penetration.

admittedly, even g.gast it's not as good/scalable as DOCSIS or Fiber, but it could be used and deployed "back in a day" as perfectly good solution. and even today it's not that bad for majority of population, if properly deployed

as anecdote , i saw like 20 years ago privately deployed/managed DSL systems in kibbuz. wonder what they have now


Again you can’t directly compare distances between 5G and DSL because the wire isn’t taking the shortest path through 3D space between teleco equipment and your modem.

Also a single 600-700MHz 5G tower can cover hundreds of square miles with 5G service with up to 250 megabits per second. 2.5-3.5GHz can still hit several miles with up to 900 megabits per second, and 24-39GHz towers can cover a mile radius at up to 3Gbps. Real world performance depends on many many factors, but DSL performance can be similarly degraded from it’s theoretical maximum.


i compared with mm wave 5g because it requires same density of deployment as DSLAMs for proper performance, if not higher.

> Real world performance depends on many many factors,

like how many UEs are sharing spectrum. Which is usually a lot

> but DSL performance can be similarly degraded from it’s theoretical maximum.

totally.


dsl can get that fast…?! Why is even today the current best offering in many mountain communities in CA is 5/1 ?

and why is 5/1 also the only AT&T offering in pockets of high tech Irvine CA?


because at&t doesn't feel like upgrading equipment. too much effort. in general they made a strategic decision to invest mainly in wireless. you can also throw into mix words like "absence of government regulation" and "regional monopoly".


On multiple areas, AT&T (I believe) successfully petitioned the FCC to not count a bunch of low end DSL and similar services, on the basis that they were "obsolete" and lowering averages.

To be clear, they were still actively selling those services, and in some cases, it was the only option, but they just didn't want them to count.


I'd like to read the paper but not $15 want to read it. Apparently ACM wants to charge $5 or $10 for this one article even if I had a membership.

Is this one of those situations I hear about where researchers would be super happy to provide copies of papers that ACM etc keep behind paywalls, if you just ask?


There's a helpful raven at the hub of science for that ; )

https://sci-hub.ru/https://doi.org/10.1145/3419394.3423652


They most likely would send you a copy, I have a 100% success rate with that. Though I should also mention that paper is present on scihub at this very moment.


that video presentation is heroic! what a riot to see the numbers sliced that way; and you are clearly cautios in the estimations. well played


> Here's the paper:

It's super-paywalled. Can you upload a copy of it somewhere?


I previously served as CTO of the FCC Enforcement Bureau. A couple thoughts on the regulatory dimensions of this report.

* This could be a Federal Trade Commission problem. T-Mobile, like all major ISPs, has made public representations about upholding net neutrality principles [1]. These voluntary commitments were part of the Trump-era FCC's rationale for repealing net neutrality rules. Breaching the commitments could constitute a deceptive business practice under Section 5 of the Federal Trade Commission Act.

* This could also be a Federal Communications Commission problem. When repealing the Obama-era net neutrality rules, the Trump-era FCC left in place a set of transparency requirements [2]. Making an inaccurate statement about network management practices can be actionable under that remaining component of the FCC's net neutrality rules.

I haven't seen a comment from T-Mobile, so to be clear, that's just based on the report.

[1] https://www.t-mobile.com/responsibility/consumer-info/polici...

[2] https://www.ecfr.gov/current/title-47/chapter-I/subchapter-A...


> Making an inaccurate statement about network management practices can be actionable under that remaining component of the FCC's net neutrality rules.

Who would be responsible for bringing about that action and, if they don't bring about action, what can regular people do about it?


Thank you. Is there a form where one could file a complaint with the FCC to inform them of this? I'm not sure that this would be widely reported.

I am also curious if the reports about content filtering being required to deactivate the feature are accurate, and if so, what the default status of that feature is on TMobile's network.


Hi, I previously served as CTO of the FCC's Enforcement Bureau, where I worked on then-Chairman Wheeler's Robocall Strike Force. I'd like to offer a few observations that might be of interest.

* T-Mobile, like the other carriers, is offering a numerator and not a denominator. These call filtering services are plainly valuable, but it's difficult to evaluate how effective they are based on current public evidence.

* It isn't a coincidence that the top robocall destinations include locations that are popular for retirement. These scams disproportionately target and take advantage of older customers.

* Call authentication (STIR/SHAKEN) is helping, and will continue to become more effective. The FCC did not push carriers to rapidly adopt call authentication during the last administration; Congress eventually stepped in with the TRACED Act, and the FCC has since made STIR/SHAKEN a top priority.


From anecdotal evidence (n=1) the call blocking feature on T-Mobile is about 70% effective. Unfortunately I don’t know of an api to pull my full phone and spam shield records but I estimate I received about 2,000 calls over the past three months. About 90% of those were spam/scam calls. Of those, T-Mobile identified and blocked about 70% of them.

It is reassuring to see the stir/shaken “checkmark” on my iPhone call log indicating that the call has been authenticated. Unfortunately as you say it’s not very effective yet.

I’ve noticed that there are carriers/voip gateway providers who are proactive on shutting down spam emanating from their networks and others who are not. Not affiliated but the list here seems to be accurate: https://scammerblaster.com/the-ultimate-method-of-scammer-pa...


wait what. You recieved 540 spam calls over a quarter? (2000.9.3)

Holy crap. That's six a day. I would have thrown my phone away.


Correct. I think my high water mark was about 30 calls in one day. I would receive a call while messing with another, so I would sometimes conference them together for hilarity to ensue.


Spam as in "want to use our website building agency?" or "want to buy this incredible new coin/NFT/pump dump stock?"


At some point my phone number was sold as part of a list of “old people”. So my calls consisted of a mix of Medicare supplement plans, Medicare scams (“free” diabetic supplies, something about chronic pain, and my favorite which was “a five year renewal” of my Medicare card. All they needed was all my PII, doctors name and Medicare card number!)

I also received a lot of other scam calls targeting older folks: namely callers impersonating Social security administration officials who scare you into sending thousands of $$$ to them so you avoid getting arrested - you’re told that your SS benefits are suspended and you’ll be charged with a crime because your SS# was associated with some vague crime in the “southern border of Texas”…

It’s honestly sickening to see in real time how these low lives fleece innocent people and it makes me furious. I do what I can to try and shut them down but I’m sure it’s just a drop in the bucket and they just pop back up with a different voip provider in a few days anyway.

They can be very persistent and they will track your “identity” for years. I had invented a persona back in 2015 and forgotten about it. Someone called several dozen times - very aggressively - asking for that persona. I had fun messing with him but it was scary having him pull up personal details from over 7 years ago even if it was totally fabricated.


One time I received 30ish calls everyday for a few days and each one of them was from a different number but with same prefix. It bothered me that I couldn’t block that prefix.


That’s about where I’m at. A slow day is two spam calls. A bad day is 10.


It seems ridiculous to me that I regularly receive calls that are clear indicators of illegal activity but that nobody is being held accountable.

Why is there no way to find the people who are making these calls and why are the phone companies not liable for allowing these calls to be made without accountability?


The phone companies standardized on a hopelessly insecure protocol in 1975, and have no financial incentive to fix it.

If the FCC mandated a $1/spam call fine for cell phone providers (automatically paid as an unbounded rebate to subscribers), I suspect they would fix it in under 12 months.

More reading on the protocol (Signaling System 7) is here:

https://en.m.wikipedia.org/wiki/Signalling_System_No._7

The fundamental issue is that is assumes 100% of global telephone exchanges are trustworthy.


> The phone companies standardized on a hopelessly insecure protocol in 1975...

I vaguely remember an interview with somebody involved in early ARPANET standardization efforts stating pretty definitively that the prevailing direction for network protocols was source based routing. Anybody who has ever had to write an email address parser has seen vestiges of this (multiple @, ! and : symbols). Supposedly a representative from the NSA helpfully "suggested" they abandon that line of thinking and just mimic the PSTN's approach of trusting the next hop to do the routing.

I wonder how accidental it is that SS7 was implemented in such a plainly insecure manner.


It’s a lot of work and honestly the telcos don’t care. Even if and when you do find them, what can you do? They’re calling from halfway around the world - so “impersonating a us government employee” is not a law you can enforce on a citizen of another country.


Why can't the telcos be held liable for routing these calls? If you get scammed and could sue the phone company, they'd very quickly find real solutions.


I suspect it's that pesky "rule of law" thing. We need to change the laws to make them liable.


Most of the scam/spam calls originate from overseas, while using American phone numbers:

"Five U.S. states, Costa Rica, Guatemala, India, Mexico and the Philippines are where most robocalls originate."

I imagine it's much more complicated to prosecute robocallers that live overseas, as you're now dealing with having to extradite people.


Around 99% of calls I receive, total, are spoofed to my local area and exchange. The discrimination tech is clearly not being used. Sprint/TMO.


A quick fix could be that you require that the phone number matches the country the call comes from.

Won't solve everything but maybe a little bit.


Then every time I travel overseas I cannot use my phone or a US phone number? What about living close to Canada, Mexico, Caribbean…etc and you pick up international towers?

It’s an easier fix, but not really a solution.

The reality is that everyone wants fairness but no one really wants government regulation (Russia is a great example of this where your phone number is essentially treated like an assault rifle. Registered, monitored, and geo-tracked).


But that would be roaming, not a us number originating from a non us phone line.


Prior to VoIP it was easier to trace the source of a call. With VoIP, the call could come from anywhere. Also, that VoIP service may have been resold several times and the end of that chain might look like a shady foreign entity with fictitious names. You kill one shady reseller and 3 more pop up.


Can I sue my carrier for breach of contract?

They are providing me a phone but most callers are spoofed and it can't be answered any more in the way a reasonable person would expect a phone to be useful.


T-Mobile has had recurring data security deficiencies. I know because I served as CTO of the FCC's Enforcement Bureau, before returning to academia.

In 2017, the FCC determined that T-Mobile had violated federal law in a data breach involving customer credit information [1]. There was reportedly no fine because Congress has imposed a strict one-year statute of limitations on FCC enforcement actions.

In 2020, the FCC charged T-Mobile with again violating federal law in failing to protect customer location information [2]. The FCC proposed a $91.6M fine, widely criticized as insufficient at the time [3-4]. I don't believe the FCC has finalized or collected that penalty.

There have been several other incidents, including in 2018 [5], 2019 [6], early 2020 [7], and late 2020 [8].

I hope there has not been a new data breach. But if there has been, this is the latest in a pattern, and the incentives have to change.

[1] https://www.nexttv.com/news/fcc-admonishes-t-mobile-breach-1...

[2] https://www.fcc.gov/document/fcc-proposes-916m-fine-against-...

[3] https://docs.fcc.gov/public/attachments/FCC-20-27A4.pdf

[4] https://docs.fcc.gov/public/attachments/FCC-20-27A5.pdf

[5] https://www.theverge.com/2018/8/24/17776836/tmobile-hack-dat...

[6] https://www.bleepingcomputer.com/news/security/t-mobile-disc...

[7] https://www.bleepingcomputer.com/news/security/t-mobile-data...

[8] https://www.bleepingcomputer.com/news/security/t-mobile-data...


Thank you for that context. It seems like breaches are happening every month now. What do you think needs to happen to ensure these gigantic companies secure data? I can imagine (a) new legislation enabling bigger, swifter fines or (b) anti-trust action. Do you think we should prioritize one over the other, do both, or something else?


I left TMo in 2018, when their 'forgot password' link sent me my actual password, via email.



A relevant quote from there "What if this doesn't happen because our security is amazingly good?"


oh boy...


I remember this happening in real time. People were losing their minds over it. I really hope that PR rep got fired, they have no business doing anything related to telecommunications.


This reads like a parody account.


Absolutely agree that the incentives have to change!

What does the FCC consider to be "reasonable measures to protect the confidentiality of its customers data"? Is there a document somewhere that outlines the best practices they expect you to follow?

I might be able to better convince my employer to prioritize security work if I had something like that to point to.


So the only fines that T-Mobile has paid are for the rural call call completion issues then?

Crazy that they can get away with regional and nationwide voice outages, SSNs and TINs repeatedly being leaked en masse, and the only fines they get are for rural call completion...

https://www.fcc.gov/document/settlement-t-mobile-rural-call-...


(Context: I teach computer security at Princeton and have a paper at this week's Usenix Security Symposium describing and analyzing a protocol that is similar to Apple's: https://www.usenix.org/conference/usenixsecurity21/presentat....)

The proposed attack on Apple's protocol doesn't work. The user's device adds randomness when generating an outer encryption key for the voucher. Even if an adversary obtains both the hash set and the blinding key, they're just in the same position as Apple—only able to decrypt if there's a hash match. The paper could do a better job explaining how the ECC blinding scheme works.


> only able to decrypt if there's a hash match

This is one of the concerns in the OP, have an AI generate millions of variations of a certain kind of images and check the hashes. In this case it boils down to how common false positives neural hashes are.


Yes, this ^^^^^^

> The proposed attack on Apple's protocol doesn't work.

With all due respect, I think you may have misunderstood the proposed attack @jonathanmayer, as what @jobigoud said is correct.


There may be another attack.

Given some CP image, an attacker could perhaps morph it into an innocent looking image while maintaining the hash. Then spread this image on the web, and incriminate everybody.


Yes perceptual hashes are not cryptographically secure so you can probably generate collisions easily, (i.e. a natural looking image which has a attacker-specified hash).

Here is a proof of concept I just created on how to proceed : https://news.ycombinator.com/item?id=28105849


Sounds like a fantastic way for law enforcement to get into your phone with probably cause. Random message you a benign picture from some rando account with a matching hash. Immediate capture for CP, data mine the phone, insert rootkit, 'so sorry about the time and money you lost - toodles'.


Don’t warrants have to name why ?

Like a warrant for CP can’t be used to collect evidence on another cases for say tax fraud.


Warrants do have to name why, and where. However, anything they find along the way is fair game. If they open your trunk to find drugs and see a dead body, then the dead body is still admissible. (Assuming that the opening the trunk for drugs is okay.)


It'd be interesting to see how the way common images are reused (for example in memes by only adding text) would be enough to change that hash. If it wasn't enough it could spread very quickly.

Of course I'd dare not research or tinker with it lest I'll be added to a list somewhere such is the chilling effect.

I guess in that case they'd delete that single hash from the database because they'd still have an endless (sadly) supply of other bad image hashes to use instead.


> Then spread this image on the web, and incriminate everybody.

You'd still have to generate several images and persuade people to download multiple of them into their photo roll. And as I understand it there's yet another layer of Apple employees to review the photo metadata before it ever makes its way to law enforcement.


That does seem like an interesting protest vector, though. Generate a bunch of images that match CSAM images but are mundane. Then have everyone download them and send them to their cloud. Someone then needs to spend resources determining that the images are _not_ actual matches. Basically, a DDOS attack on the functionality.


Indeed, that thought occurred to me as well.

It's a risky bet, though: if somehow that intermediate layer fails and you find yourself locked up and accused of storing/disseminating CSAM material, it's not like the civil rights era when your friends and neighbors (and hopefully employers) will understand you've been arrested for a peaceful protest.


The smarter, if potentially less ethical solution is to encode such images and make memes with them. One of them going viral is likely to flag an enormous number of people along the way.


>several images and persuade people to download multiple of them into their photo roll.

I believe such images are called "Dank Memes" these days.


> with the only privacy guarantees being that the data is encrypted during transport, and a "promise" that they will run internal audits to make sure private data isn't released from their servers.

There's much more than that, including: privacy and security review before a study launches, a data minimization requirement, a sandboxed data analysis environment with strict access controls, and IRB oversight for academic studies.

> IMO this seems to provide worse privacy than even Google and Micro$oft's telemetry, which at least use differential privacy to make sure that each individual's privacy is somewhat protected (the data you send is randomised so even if the aggregator is compromised by a malicious third party (e.g. NSA) individuals have some degree of plausible deniability).

The vast majority of Google and Microsoft telemetry does not involve local differential privacy. Google, in fact, has almost entirely removed local differential privacy (RAPPOR) from Chrome telemetry [1].

We've been examining the feasibility of local differential privacy for Rally. The challenge for us—and why local differential privacy has limited deployment—is that the level of noise makes answering most (often all) research questions impossible.

[1] https://bugs.chromium.org/p/chromium/issues/detail?id=101690...


Have you thought about using central/global differential privacy (which tends to have much less noise) on the "high level aggregates" or "aggregated datasets" that persist after the research study ends?

E.g. from the FAQ: "We do intend to release aggregated data sets in the public good to foster an open web. When we do this, we will remove your personal information and try to disclose it in a way that minimizes the risk of you being re-identified."

It's a little worrying to think that this disclosure process might be done with no formal privacy protection. See the Netflix competition, AOL search dataset, Public Transportation in Victoria, etc. case studies of how non-formal attempts at anonymization can fail users.


> Have you thought about using central/global differential privacy (which tends to have much less noise) on the "high level aggregates" or "aggregated datasets" that persist after the research study ends?

Yes. Central differential privacy is a very promising direction for datasets that result from studies on Rally.

> It's a little worrying to think that this disclosure process might be done with no formal privacy protection. See the Netflix competition, AOL search dataset, Public Transportation in Victoria, etc. case studies of how non-formal attempts at anonymization can fail users.

I've done a little re-identification research, and my faculty neighbor at Princeton CITP wrote the seminal Netflix paper, so we take this quite seriously.


Interesting. I can see that RAPPOR seems to be deprecated in favor of something else called ukm (Url-keyed metrics) but not why this change is being made. Is there somewhere I can read more about it?


I am not aware of any public announcement or explanation. Which is... probably intentional, since Google is removing a headline privacy feature from Chrome.


How did you learn about it? By studying the code?


Our team looked closely at the Google, Microsoft, and Apple local differential privacy implementations when building Rally. It helped that we have friends who worked on RAPPOR.


Did you end up using differential privacy in Rally? What's the thinking behind this?


> This is a luxury many researchers that work outside of these big tech companies don't have, which creates a scientific power imbalance.

The power imbalance goes far beyond science. Independent research is foundational for platform accountability. An example: when I was working on the Senate staff, before I started teaching at Princeton, a recurring challenge was the lack of rigorous independent research on platform problems. We were mostly compelled to rely on anecdotes, which made oversight and building a factual record for legislation difficult.


I’m curious as to your take on independent scholarship, outside of the domain of academia?

Would appropriately rigorous independent scholarship be considered as a trustworthy source within your sphere?


> Would appropriately rigorous independent scholarship be considered as a trustworthy source within your sphere?

Definitely. Academia doesn't have a monopoly on excellent technology and society research. The Markup's data-driven investigative journalism, for example, is outstanding.


> Presumably, the users will be well-endowed and tax-advantaged institutions who could have just bought the information from data-aggregators anyway.

Nope. This is an important point: the type of crowdsourced science that Rally enables is something that researchers couldn't do before. (With the exception of a very small number of teams who made massive investments in building single-purpose crowdsourcing infrastructure from the ground up.)


Could you provide more detail on what makes it novel?


Common research methods have significant limitations. Web crawls, for instance, usually don't realistically simulate user activity and experiences. Lab studies often involve simplified systems that don't generalize to the real world. Surveys yield self-reported data, which can be very unreliable.

Rally studies, by contrast, reflect real-world user activity and experiences. In science jargon, Rally enables field studies and intervention experiments with excellent ecological validity.


Thanks for clarifying! Makes sense.

A few follow up questions:

1. Do you expect the opt-in nature of these studies to impact their findings?

2. To compensate for the voluntary nature of the studies, do you think researchers in general will still be incentivized to find data sources that are less respectful of people's privacy and don't require an opt-in to the study?


> 1. Do you expect the opt-in nature of these studies to impact their findings?

The Rally participant population is not representative of the U.S. population—these are users who run Firefox (other browsers coming soon), choose to join Rally, and choose to join a study. In research jargon, there's significant sampling bias.

For some studies, that's OK, because the research doesn't depend on a representative sample. For other studies, researchers can approximate U.S. population demographics. When a user joins Rally, they can optionally provide demographic information. Researchers can then use the demographics with reweighting, matching, subsampling, and similar methods to approximate a representative population. Those methods already appear throughout social science; whether they're sufficient also depends on the study.

> 2. To compensate for the voluntary nature of the studies, do you think researchers in general will still be incentivized to find data sources that are less respectful of people's privacy and don't require an opt-in to the study?

Rally is designed to provide a new research capability that didn't exist before. I don't expect a substitution effect like that.


Got it. Thanks Jonathan!


Regarding 2. that would run afoul of many ethics boards at universities. Generally they require that (informed) consent has been given to take part in the study.


> Rally studies, by contrast, reflect real-world user activity and experiences. In science jargon, Rally enables field studies and intervention experiments with excellent ecological validity.

Rally users are all opt-in. How does that impact the design of a Rally study and the conclusions you can draw from it?


Academic research in the social sciences is rigorously based on the concept of informed consent (i.e., opt-in), in the first place.

There would be no change in terms of research design and the ability to draw scientific conclusions.

edit: also, see https://news.ycombinator.com/item?id=27633212 for details on research design considerations when conducting social science.


Except as noted elsewhere, Mozilla also gets the data to "improve products and services" right?

So it sounds like a nice shiny cloak for...exactly the kind of data collection nobody actually likes.

Yay for extra steps?


Mozilla has been known to be pretty iffy when it comes to 'opt in' ( the mr. robot tie in .. etc )


>Mozilla has been known to be pretty iffy when it comes to 'opt in' ( the mr. robot tie in .. etc )

Did the instance you're referencing state it was opt-in then turn out to not be opt-in?


Princeton can't buy data from aggregators? Wikipedia says they have a $26.6B endowment.


Princeton research collaborator here. Glad to answer questions about Rally.

> What "data"? Browsing history? Identity? Something else?

That depends on the Rally study, since research questions differ and studies are required to practice data minimization. Each study is opt in, with both short-form and long-form explanations. Academic studies also involve IRB-approved informed consent. Take a look at our launch study for an example [1].

> Why? What's in it for them? Since when was giving our data to third parties a good idea? There is literally no motivation presented here.

The motivation is enabling crowdsourced scientific research that benefits society. Think Apple Research [2], NYU Ad Observatory [3], or The Markup's Citizen Browser [4]. There are many research questions at the intersection of technology and society where conventional methods like web crawls, surveys, and social media feeds aren't sufficient. That's especially true for platform accountability research; the major platforms have generally refused to facilitate independent research that might identify problems, and platform problems often involve targeting and personalization that other methods can't meaningfully examine.

[1] https://rally.mozilla.org/current-studies/political-and-covi... [2] https://www.apple.com/ios/research-app/ [3] https://adobservatory.org/ [4] https://themarkup.org/citizen-browser


These "This Study Will Collect" and "How We Protect You" sections are really good. It probably wouldn't convince me personally to sign up, but it's as comprehensive as I would expect. It's a shame that these comments didn't make it into the blog post.


I think that the motivation of 'enabling citizen science' is not a very strong one. You will get very, very skewed results, moreso than typical WEIRD, if you conduct studies on the people for whom that is sufficient motivation.

A stronger motivation would be providing a product or service that tangibly adds value to someone's life.

After reading this, I have no idea how Rally would provide any tangible benefits to me.


Exactly. It is so weird to see all this marketing speak that makes it sound like users can get to benefit from something, but in the end this is just something that gets people to work and provide data for free to multi-billionaire universities.

We don't any more studies or research to know that the best privacy policy is to not collect any data in the first place.


I know you mean well but I think you completely missed the above commenters point.

You've replied here with answers to address their (our?) potential concerns, but the commenter never said they had concerns about the project itself, rather that this particular blog post doesn't "sell" or explain the value add well. That's feedback on the project's communication strategy, not on what it's actually doing.

> > Why? What's in it for them? Since when was giving our data to third parties a good idea? There is literally no motivation presented here.

> The motivation is enabling crowdsourced scientific research that benefits society.

You seem to be confusing "theys". The question is what motivates participants, not what motivates researchers.


> You seem to be confusing "theys". The question is what motivates participants, not what motivates researchers.

Contrarily, you seem to be confusing “theys”, yourself.

There exist participants that are motivated by participating in research that benefits society.

Just like there exist individuals motivated by lending their computing resources to the various @Home research efforts.


But if the participants are limited to people who are motivated solely by participating in research, wouldn't that add significant bias to that research?


Indeed, sampling bias is a large concern.

Nonetheless, much of psychology research conducted in the US has made do with ridiculous sampling bias - the US college student is anecdotally considered to be the most-studied population in the world.


Doesn't the field of psychology have pretty serious issues with the replicability of their experimental results?


Indeed, anecdotally, if not empirically, that is the case. Nonetheless, psychology is a highly operationalized field.

In other words, every thorough study begins with an assessment and revision of the consensus language being used to describe reality.

On that front alone, psychology is one of the most hard sciences around.

Deep learning is directly attributable to psychology research, for what it is worth.


Personally I don't think that researchers have any more business doing this kind of surveillance than Google and company do.

The idea that this will benefit society seems naive to me. I feel like it will only serve to legitimize the practice by putting ostensibly trustworthy faces on the packaging.


Not just surveillance, but conducting research within corporate platforms. Therefore, they would have access to my data and a corporation's engine. If I think that google knows too much about me, do I get to opt-in whether that hyper-knowledge is shared to researchers (because I won't).


> Personally I don't think that researchers have any more business doing this kind of surveillance than Google and company do.

As other commenters have noted, then you should decline to opt-in to participating in research such as this.


> The motivation is enabling crowdsourced scientific research that benefits society.

Oh, well since it “benefits society”...

Tell me, how is it that you filter for the research that benefits society vs the research that doesn’t?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: