Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes there was. However this is a different paper, describing a different method, applied to a different dataset, with different results.

As the abstract says, "In particular, MindEye can retrieve the exact original image even among highly similar candidates indicating that its brain embeddings retain fine-grained image-specific information. This allows us to accurately retrieve images even from large-scale databases like LAION-5B. We demonstrate through ablations that MindEye's performance improvements over previous methods result from specialized submodules for retrieval and reconstruction, improved training techniques, and training models with orders of magnitude more parameters."

Note that LAION-5B has five billion images.



> To achieve the goals of retrieval and reconstruction with a single model trained end-to-end, we adopt a novel approach of using two parallel submodules that are specialized for retrieval (using contrastive learning) and reconstruction (using a diffusion prior).

What you can think of contrastive learning as is: two separate models that take different inputs and make vectors of the same length as outputs. This is achieved by training both models on pairs of training data (in this case fMRI images and observed images).

What the LAION-5B work shows is that they did a good enough job of this training that the models are really good at creating similar vectors for nearly any image and fMRI pair.

Then, they make a prior model which basically says “our fMRI vectors are essentially image vectors with an arbitrary amount of randomness in them (representing the difference between the contrastive learning models). Let’s train a model to learn to remove that randomness, then we have image vectors.”

So yes, this is an impressive result at first glance and not some overfitting trick.

It’s also sort of bread and butter at this point (replace fMRI with “text” and that’s just what Stable Diffusion is).

They’ll be lots of these sort of results coming out soon.


This is mostly correct, except that there is only one model. This model takes an fMRI and predicts 2 outputs. The first is specialized for retrieval and the second can be fed into a diffusion model to reconstruct images.

You can see the comparison in performance between LAION-5B retrieval and actual reconstructions in the paper. When retrieving from a large enough database like LAION-5B, we can get images that are quite similar to the seen images in terms of high level content, but not so similar in low-level details (relative position of objects, colors, texture, etc). Reconstruction with diffusion models does much better in terms of low-level metrics.


How is contrastive learning done with one model, exactly?

I agree only one is used in inference, but two are needed for training (otherwise how do you calculate a meaningful loss function?). Notice in the original CLIP paper, there's an image encoder and a text encoder, even though only the text encoder is used during inference. [0]

[0] https://arxiv.org/pdf/2103.00020.pdf


There are 2 submodules in our model — a contrastive submodule and a diffusion prior submodule, but they still form 1 model because they are trained end-to-end. In the final architecture that we picked there is a common backbone that maps from fMRIs to an intermediate space. Then there is an MLP projector that produces the retrieval embeddings and a diffusion prior that produces the stable diffusion embeddings.

Both the prior and MLP projector makes use of the same intermediate space, and the backbone + projector + prior are all trained end-to-end (the contrastive loss on the projector output and mse loss on prior outputs are simply added together).

We found that this works better than first training a contrastive model then freezing it and training a diffusion prior on its outputs (similar to CLIP + DALLE-2). That is, the retrieval objective improves reconstruction and the reconstruction objective slightly improves retrieval.


If it's still retrieving an image and not reconstructing it, if the dataset is large enough that's decently fine, but this is generally not how diffusion models work in general and I'd have expected the model to map the fMRI data to a wholly new image.


Please read the paper. Or at least the blog post. It's really quite readable.

They explain that they've done both retrieval and reconstruction, and have lots of pictures showing examples of each.

https://medarc-ai.github.io/mindeye/


If you can retrieve an image using a latent vector, it’s trivial to reconstruct it (decently well) with a diffusion model.


They tested themselves both on retrieval and reconstruction.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: