Hacker Newsnew | past | comments | ask | show | jobs | submit | bcaine's commentslogin

Do you know how to publicly comment? I couldn't find a way on the press release or their website.


Hey! Hopefully you go back and read comments on old posts, since the public comment period has begun and comments can now be submitted on regulations.gov:

https://www.regulations.gov/document/FTC-2023-0007-0001/comm...


The notice will be posted on regulations.gov. You can leave a comment there.

It's not posted yet. The public comments should open shortly. Set a reminder for Wednesday next week and it'll almost certainly be up on regulations.gov.


It's actually not linear, its a power law. That means we need exponentially more compute, data, and model parameters to see linear improvements in performance.


Not to pour too much cold water on this, but the claim of 100% accuracy has a huge caveat. In the paper (Page 4) they state:

Interaction. The original question may not be a prompt that synthesizes a program whose execution results in the correct answer. In addition, the answer may require multiple steps with clear plots or other modalities. We therefore may interactively prompt Codex until reaching the correct answer or visualizations, making the minimum necessary changes from the original question

Which to me basically sounds like they had a human in the loop (that knows how to solve these math problems) that kept changing the question until it gave the correct answer. They do measure the distance (using a sentence embedding model) of the original question to the one that yielded the correct answer, but that feels a bit contrived to me.

Nevertheless, its still really cool that the correct answer is indeed inside the model.


Proving Douglas Adams correct. The question is harder than the answer.

This makes the "at scale" claim in the abstract clearly false IMO. Any AI system that requires that much human intervention is not scalable. When they have a second AI to produce the prompts automatically from the original questions, then they can claim to have achieved scalability.

But even without that, a system like this can still certainly be useful. And I expect rapid progress in the next few years.



although, the correct answer is also likely on the web. With a suitable search query you would see the correct paper/textbook/wiki page with the right answer. A text highlighting model could also likely extract this answer from the text. The training probably achieves a good degree of memorization for these known results.

This begs the question, would we be impressed with a similar compression algorithm for storing past web documents?


The main achievement is not the compression, but the search functionality (search==solve).


Well the trivial test to make sure it’s not memorized would be to change constants in the input that alter the correct answer but don’t make the problem any more difficult if it is actually doing the calculation.


But the correct answer isn't inside the model at all, in none of their examples. The correct answer is inside SymPy or NumPy, at least 99% of the time. That is, the model doesn't respond with a demonstration or with the answer itself: it responds with a Python program that poses the given question to SymPy or NumPy, and then they run that program and report the answer.

Here is a basic example:

MIT Course question: Solve each equation for x. ln(x2 − 1) = 3

Model input: Using Sympy, solve Eq ln(x*2-1)=3 for x.

Model output:

  from sympy import *
  x = symbols(’x’)
  solve(log(x**2-1) - 3, x)
As you can see, the model has simply translated a mechanized form of the original question to equivalent Python code. The model has 0 idea of how to solve an equation: it's using a symbolic equation solver.

In other cases, they even "tidy" the original question to a representation of the solution. Here is their example E in Figure 2:

Original question:

> Outside of their humdrum duties as 6.042 TAs, Sayan is trying to learn to levitate using only intense concentration and Jelani is launching a “Nelson 2008” presidential campaign. Suppose that Sayan’s probability of levitating is 1/6, Jelani’s chance of becoming president is 1/4, and the success of one does not alter the other’s chances. If at most one of them succeeds, what is the probability that Jelani becomes the president of the United States?

"Tidyed question":

> Suppose that Sayan’s probability Sayan succeeding is 1/6, and Jelani’s probability of succeeding is 1/4.

> Use numpy to find the probability at most one of Jelani and Sayan succeed.

> Use numpy to find the probability Jelani succeeds but Sayan does not succeed.

> Divide the former by the latter probability.

Model solution:

  import numpy as np
  # Probability of Sayan succeeding
  p_sayan = 1/6
  # Probability of Jelani succeeding
  p_jelani = 1/4
  # Probability of at most one of Jelani and Sayan succeeding
  p_at_most_one = 1 - (p_sayan * p_jelani)
  # Probability of Jelani succeeding but Sayan not succeeding
  p_jelani_succeed_sayan_not = p_jelani * (1 - p_sayan)
  # Probability of Jelani succeeding but Sayan not succeeding divided by
  probability of at most one of Jelani and Sayan succeeding
  p_jelani_succeed_sayan_not / p_at_most_one
  # Probability of at least one of Jelani and Sayan succeeding
  p_at_least_one = 1 - p_at_most_one
Tidying up the extra verbiage of the question is absolutely fair. But then, they also explain exactly how to compute the result using the data in the question; the model then generates code that perfectly matches the described algorithm, it's again not using even the tiniest bit of mathematical understanding.

I have browsed their examples, and I have not seen even a single one where the model does more than rephrase the question into a 1:1 Python representation of the question itself.

None of the answers would pass even the simplest undergrad exam. They are literally of the form "how would you solve equation E?" "I would write a program that says sympy.solve(E)".


Well, they do say very clearly that they "solve" problems by program synthesis and what they describe is perfectly legit program synthesis.

To clarify, program synthesis (or automatic programming) is the task of generating programs from specifications. There are two kinds of program synthesis: deductive program synthesis, from a complete specification of the target program; and inductive program synthesis, or program induction, from an incomplete specification (such as sets of program inputs and outputs, or traces). An example of deductive program synthesis is the generation of low-level code from a high-level language by a compiler.

What the paper describes is a kind of deductive program synthesis from a complete specification in natural lanaguage. I suspect the true contribution of the work is the demonstration of using natural language as a complete specification, where earlier work generally only demonstrated the use of natural language as incomplete specification (for example, comments describing intent rather than implementation) and the combination of natural language with code; as in the original Codex work [Edit: actually, now that I look again, the codex paper also has examples of comments that fully specify the target program, e.g. in Figure 2: https://arxiv.org/abs/2107.03374; so the work above is typically incremental].

On the other hand it's clear to me that the training has made the model memorise answers and all the work in prompt engineering, described under "Workflow" serves to find the right prompts to retrieve the desired memorisations, much like one must fire just the right SQL query to get back the right data. Certainly interesting to see in action and useful for everyday work, but far from "solving" anything in the gradniose way that it is announced by the authors (e.g. "These astounding results..." in section "Conclusion", etc).


How well would Copilot™ do on this type of problem?


I believe copilot uses the same underlying research as in the paper - codex.


I was hoping their breakthrough was that they had found a general way to parse conceptual problems into the language of math and logic. That is the truly hard part, and what people spend alot of time learning to do. Software like octave and mathematica can already evaluate tons of things once parsed.


It looks like this is just Pilocarpine, which has been used for decades to treat glaucoma, available at pharmacies everywhere, and is commonly used (perhaps off label) to shrink pupils. I wonder what if anything they changed compared to the generic version?

I've been using Pilocarpine off label to shrink my pupils at night after ICL surgery (an alternative to lasik) to solve debilitating halos caused by my pupils growing larger than the implanted lens.

In my experience, it does increase close range vision (at some minor expense to long range vision). That said, it also gives a mild headache, and blurs your vision substantially for the first 5-15 minutes after use. I don't really see the appeal of using it daily unless you really have to.


supported by a couple searches eg https://en.wikipedia.org/wiki/Pilocarpine


While I sort of agree that machine learning will end up as an experimental science, it's way, way too early to say whether the theory relating deep learning to kernel methods (e.g. Neural Tangent Kernels) will be useful or not.

As an example, just last week a (huge) paper [1] was put on arXiv that used these theoretical methods to analyze a bunch of common architecture building blocks (skip connections, normalization, etc), and then applied their theoretical findings to figure out how to train Resnet like models in similar training time without these seemingly "required" building blocks.

Deep Learning is still in its infancy in many ways, and this type of research takes time, slowly building on successive results.

[1] https://arxiv.org/abs/2110.01765


When you wrote 'huge', I thought you meant huge potential impact; I wasn't expecting 172 pages.

team behind the paper is Deepmind/Google. It is probably worth a read.


Another absolutely fantastic resource is this Jupyter Notebook based textbook on Kalman Filters and related topics: https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Pyt...

An addition to your correction: there are many other ways to solve the SLAM problem beyond (kalman/information/particle) filters. Optimization based approaches are very popular (search terms: Graph SLAM, Factor Graphs, Pose Graphs).


True! I don’t find too many people out there who know about this stuff. Any interest in trading battle stories through DM? Would be interested to know what you’re working on and where you’ve been!


Was there any statistically significant changes in compensation at each level based on more unique skills or more granular classifications of engineers?

In particular I'm curious about Machine Learning/Data Science, Robotics, Distributed Systems etc, but I would imagine Web vs Mobile vs DevOps vs Data may look different too.


My first ever internship in software engineering (in 2012) was in the MUMPS world! I worked on a project for a contracting company building an automated testing framework (in python) to interface with and allow the VA to refactor their massive VistA EHR that was entirely written in MUMPS.

The fact that the system works at all is total magic to me, with hundreds of subsystems and millions of lines of code, all with the same shared global variable pool. I remember having to spend a few days digging through hundreds of pages of kernel documentation (it has its own kernel!) to simply find out how to write to a file..

What you can remember seems largely correct. I think commands and syntax were case insensitive, and variables were case sensitive too? All kinds of insanity like that.


Sounds like a job for Symantec Designs to get a translation from. Although I wonder if it would even be lucrative for them to specialize in an obscure language like this, but maybe!


My bad, make that "Semantic Designs"


In this specific case, it was probably further from the strike zone than the pitcher wanted. That said, pitchers throw balls often to see if hitters will swing at them, and some hitters are more apt to swing at bad pitches (and therefore pitchers exploit this).

Pitches at the bottom of the strikezone, or pitches that are low and drop below the strikezone (that would be called balls) are generally hard to hit, so a lot of pitchers throw sinking pitches there hoping to either get a very borderline strike or a swing and miss.


I agree that this is "better" than the alternative, but it can be absolutely exhausting for candidates actively searching for a job. I feel like it's recently become much, much more common (from my small-ish sample of me and some friends).

My issue with this approach is fourfold:

1. Most companies have no idea how to structure a problem that is both informative to them and also not abusive to the candidates time.

2. Companies generally do this right after the recruiter phone screen, which most likely doesn't give the candidate enough information to decide if the next steps are worth their time.

3. Most companies still do a whole suite of normal tech screens after you work on a take home problem.

4. If you're actively looking, getting a bunch of these over a short period of time is likely. I know during my full time search, more than 50% of companies had a take home test right after the recruiter screen. Most of these were 4-8 hours of work each, due within the week.

A lot of startups structure it more like hazing or a barrier to entry than an evaluation criteria. I have some fun (read: horrifying) anecdotes from my recent search that illustrate the problems above, but I don't think any of my points are surprising.

A nice alternative would have been to simply have one or two projects completed that are straightforward to evaluate and walk companies through them, letting them ask me questions.


Here's a really simple test for whether a work-sample scheme is effective, or just a bullshit trend-chasing afterthought:

Does the work sample test offset all or most of the unstructured tech evaluation the company would otherwise do?

If it does, that means they have a work-sample test rubric they believe in, and they're hiring confidently. If it doesn't, they don't believe in what they're doing, and the programming challenges they're assigning can thus reasonably be seen as more hoop-jumping.

In the latter case, it's up to you to decide whether a prospective job is worth extra hoop-jumping. Some teams are worth it!


I think that's fair. I've had both the former and the latter, but unfortunately most of my experiences fall into latter case, where it's simply been hoop jumping. Most of my friends (all about to graduate, so a good number of examples) are experiencing the same.

For example one company gave a problem with five parts, with the final part being solve longest path on a bipartite weighted graph (which is quite a hard and time consuming problem). After that, the next step was a phone technical screen, then an on-site with 4-5 more interviews, most being white-boarding. It was basically hazing instead of an evaluation criteria.

An alternative is my last job, which had a take home test that took about 6 hours, but that was the whole technical part of the process. Being on the other side reviewing them, the problem absolutely gave enough information.

I totally get there's a right way to do it, but like most interviewing trends, companies seem to just be adding this as a step instead of revamping their process.


Does the job they're interviewing involve finding the longest paths on weighted bipartite graphs? Or is this just non-recursive Towers of Hanoi pretending to be a realistic work sample?


No, the position most definitely had absolutely nothing to do with longest path or combinatorial optimization.

Anyway, my larger point is that what I've been seeing interviewing is that these tests are becoming much more common at US startups without companies removing/reducing the rest of their technical evaluation process, nor really structuring the problems to be a good signal.

In an ideal world where companies do take home tests right, I think its a great solution. But what I've been seeing more often than not doesn't support that, making it hard to support.

I'm really curious what you've been seeing at Starfighter. Are partnering companies still going on to do a full technical interview? Or does Starfighter largely replace their normal technical evaluation?

Ignoring the fun of the challenges themselves (which probably isn't entirely fair), the latter makes it very compelling for a candidate. The former does not.


Most of our partners have a somewhat abbreviated interview for our candidates, but everyone (as far as I know) still techs our candidates out.

I'm actually fine with that! We make no pretense of having designed a screening process that is appropriate for every job. What I'm less fine with is the fact that the norm, even for abbreviated tech-outs, is 7-8 hours of on-site whiteboard interview.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: