Wasn't there something similar a few months ago on HN and where the top comment talked about how it's not as impressive as it sounds [0]? The main issue is that this type of methodology is pulling from a pool of images, not literally reconstructing what image was seen in the brain directly.
> I immediately found the results suspect, and think I have found what is actually going on. The dataset it was trained on was 2770 images, minus 982 of those used for validation. I posit that the system did not actually read any pictures from the brains, but simply overfitted all the training images into the network itself. For example, if one looks at a picture of a teddy bear, you'd get an overfitted picture of another teddy bear from the training dataset instead.
> The best evidence for this is a picture(1) from page 6 of the paper. Look at the second row. The building generated by 'mind reading' subject 2 and 4 look strikingly similar, but not very similar to the ground truth! From manually combing through the training dataset, I found a picture of a building that does look like that, and by scaling it down and cropping it exactly in the middle, it overlays rather closely(2) on the output that was ostensibly generated for an unrelated image.
> If so, at most they found that looking at similar subjects light up similar regions of the brain, putting Stable Diffusion on top of it serves no purpose. At worst it's entirely cherry-picked coincidences.
Our model generates CLIP image embeddings from fMRI signals and those image embeddings can be used for retrieval (using cosine similarity for example) or passed into a pretrained diffusion model that takes in CLIP image embeddings and generates an image (it's a bit more complicated than that but that's the gist, read the blog post for more info).
So we are doing both reconstruction and retrieval.
The reconstruction achieves SOTA results. The retrieval demonstrates that the image embeddings contain fine-grained information, not just saying it's just a picture of a teddy bear and then the diffusion model just generates a random teddy bear picture.
I think the zebra example really highlights that. The image embedding generated matches the exact zebra image that was seen by the person. If the model only could say it's just a zebra picture, it wouldn't be able to do that. But the model is picking up on fine-grained info present in the fMRI signal.
The blog post has more information and the paper itself has even more information so please check it out! :)
I'm curious what answers you would find acceptable? I'm not being snarky - I genuinely struggle with this line of thinking. People seem to find "if I don't then someone else will" to be an unacceptable answer but it seems to me to be fairly central.
There's a inevitability about most scientific discoveries (there are notable exceptions but they are few) and unless we're talking about something with capital outlay in the trillions of dollars then it's going to happen whether we like it or not - short of a global totalitarian state capable of deep scrutiny of all research.
>People seem to find "if I don't then someone else will" to be an unacceptable answer but it seems to me to be fairly central.
Because you can use this as a cop out for truly heinous work. I.e. gain of function research, autonomous weapons, chemical weapons, etc. It's not a coherent world view for someone that actually cares about doing good.
I think you've hit upon some interesting examples. Maybe the way to look at this is cost vs "benefit" (in the broadest sense of the word).
When research has an obvious and immediate negative outcome that's a cost. The difficulty/expense of the research is also a cost.
The "benefit" would be the incentive to know the outcome. This may be profit, military advantage, academic kudos etc.
Maybe the problem with the type of research being discussed here is that there isn't neccesarily any agreement that the outcome is negative. For many people, I suspect this will remove a lot of the weight on the "cost" side of things.
I'm not making a specific point here - I'm actually trying to work this out in my head as I write.
> I think you've hit upon some interesting examples. Maybe the way to look at this is cost vs "benefit" (in the broadest sense of the word).
This is obviously a better framework to be in.
"If I don't do it someone else will" is really fraught and that's why people reject it.
So one would really need to ask is there a net benefit to having a "mind reading" system out in the world. In fact I find it hard to think of positive use cases that aren't just dwarfed by the possibility of Orwellian/panopticon type hellscapes.
> In fact I find it hard to think of positive use cases
Firstly - forcing people to think of positive use-cases up front is a terrible way to think about science. Most discoveries would have failed this test.
Secondly - can you really not? Off the top-of my head:
a) Research tools for psychology and other disciplines
b) Assistive devices for the severely disabled
c) An entirely new form of human-computer interface with many possible areas of application
As I mentioned do any of those outweigh the possibility that some 3 letter agency might start mass scanning US Citizens for what amounts to thought crime? The very fundamental idea of privacy would cease to exist.
That's a very big leap. If we're at the stage where a three letter agency can put you in an fMRI machine, then we're probably also at the stage where they can beat you with a rubber hose until you confess.
My point is that there's already a wide variety of things a future draconian state can do. This doesn't seem to move the dial very much.
I'm not suggesting I have some ability to judge whatever the answer is, I'm just curious because TFA didn't include a lot of detail on this point except some vague bullet points at the end.
The underlying NSD dataset used in the three prominent (and impressive) recent papers on this topic (including the one linked here) is a bit problematic as it invites this (classification/identification, not reconstruction): It only has 80 categories. It has not been recorded with reconstruction in mind.
Reconstruction is the primary and difficult aim, but is what you want and expect when people talk such „mind reading”. Classifying something on brain activity has long been solved and is not difficult, it is almost trivial with modern data sizes and quality. At 80 categories and with data from higher visual areas you could even use an SVM for the basic classifier and then some method for getting a similar blob shape from the activity (V1-V3 are map-like), and get good results.
If you are ignorant about the question whether you are just doing classification you can easily get too-good-to-be-true results. With these newer methods relying on pretrained features this classification case can hide deep inside the model too, and can easily be missed.
One thing they showed is that the 80 categories of that data collapse to just 40 clusters in the semantic space.
(Kamitani has been working on the reconstruction question for long time and knows all these traps quite well.)
The deeprecon dataset proposed as an alternative has been around for a few years and been used in multiple reconstruction papers. It has many more classes, out of distribution „abstract“ images and no class overlap between train and test images, so it’s quite suitable for proving that it is actually reconstruction. But it’s also one order of magnitude smaller than the NSD data used for the newer reconstruction studies. If you modify the 80-class NSD data to not have train-test class overlap, the two diffusion methods tested there do not work as well, but still look like they do some reconstruction.
On deeprecon the two tested diffusion methods fail at reconstructing the abstract OOD images (which NSD does not have), something previous reconstruction methods could do.
Yes there was. However this is a different paper, describing a different method, applied to a different dataset, with different results.
As the abstract says, "In particular, MindEye can retrieve the exact original image even among highly similar candidates indicating that its brain embeddings retain fine-grained image-specific information. This allows us to accurately retrieve images even from large-scale databases like LAION-5B. We demonstrate through ablations that MindEye's performance improvements over previous methods result from specialized submodules for retrieval and reconstruction, improved training techniques, and training models with orders of magnitude more parameters."
> To achieve the goals of retrieval and reconstruction with a single model trained end-to-end, we adopt a novel approach of using two parallel submodules that are specialized for retrieval (using contrastive learning) and reconstruction (using a diffusion prior).
What you can think of contrastive learning as is: two separate models that take different inputs and make vectors of the same length as outputs. This is achieved by training both models on pairs of training data (in this case fMRI images and observed images).
What the LAION-5B work shows is that they did a good enough job of this training that the models are really good at creating similar vectors for nearly any image and fMRI pair.
Then, they make a prior model which basically says “our fMRI vectors are essentially image vectors with an arbitrary amount of randomness in them (representing the difference between the contrastive learning models). Let’s train a model to learn to remove that randomness, then we have image vectors.”
So yes, this is an impressive result at first glance and not some overfitting trick.
It’s also sort of bread and butter at this point (replace fMRI with “text” and that’s just what Stable Diffusion is).
They’ll be lots of these sort of results coming out soon.
This is mostly correct, except that there is only one model. This model takes an fMRI and predicts 2 outputs. The first is specialized for retrieval and the second can be fed into a diffusion model to reconstruct images.
You can see the comparison in performance between LAION-5B retrieval and actual reconstructions in the paper. When retrieving from a large enough database like LAION-5B, we can get images that are quite similar to the seen images in terms of high level content, but not so similar in low-level details (relative position of objects, colors, texture, etc). Reconstruction with diffusion models does much better in terms of low-level metrics.
How is contrastive learning done with one model, exactly?
I agree only one is used in inference, but two are needed for training (otherwise how do you calculate a meaningful loss function?). Notice in the original CLIP paper, there's an image encoder and a text encoder, even though only the text encoder is used during inference. [0]
There are 2 submodules in our model — a contrastive submodule and a diffusion prior submodule, but they still form 1 model because they are trained end-to-end. In the final architecture that we picked there is a common backbone that maps from fMRIs to an intermediate space. Then there is an MLP projector that produces the retrieval embeddings and a diffusion prior that produces the stable diffusion embeddings.
Both the prior and MLP projector makes use of the same intermediate space, and the backbone + projector + prior are all trained end-to-end (the contrastive loss on the projector output and mse loss on prior outputs are simply added together).
We found that this works better than first training a contrastive model then freezing it and training a diffusion prior on its outputs (similar to CLIP + DALLE-2). That is, the retrieval objective improves reconstruction and the reconstruction objective slightly improves retrieval.
If it's still retrieving an image and not reconstructing it, if the dataset is large enough that's decently fine, but this is generally not how diffusion models work in general and I'd have expected the model to map the fMRI data to a wholly new image.
> I immediately found the results suspect, and think I have found what is actually going on. The dataset it was trained on was 2770 images, minus 982 of those used for validation. I posit that the system did not actually read any pictures from the brains, but simply overfitted all the training images into the network itself. For example, if one looks at a picture of a teddy bear, you'd get an overfitted picture of another teddy bear from the training dataset instead.
> The best evidence for this is a picture(1) from page 6 of the paper. Look at the second row. The building generated by 'mind reading' subject 2 and 4 look strikingly similar, but not very similar to the ground truth! From manually combing through the training dataset, I found a picture of a building that does look like that, and by scaling it down and cropping it exactly in the middle, it overlays rather closely(2) on the output that was ostensibly generated for an unrelated image.
> If so, at most they found that looking at similar subjects light up similar regions of the brain, putting Stable Diffusion on top of it serves no purpose. At worst it's entirely cherry-picked coincidences.
> 1. https://i.imgur.com/ILCD2Mu.png
> 2. https://i.imgur.com/ftMlGq8.png
[0] https://news.ycombinator.com/item?id=35012981