Could anybody explain (or provide a pointer to an explanation of) the details of...

quentinp · on Aug 15, 2013

You can think of this as a square matrix W. The size of the matrix is the size of the vocabulary. If we look at the 100k most frequent words in our corpus, W will be a 100k x 100k matrix.

The value of W(i,j) is the distance between words i and j, and a row of the matrix is the vector representation of that word. Research around word vectors is all about computing W(i,j) in an efficient way that is also useful in natural language processing applications.

Word vectors are often used to compute similarity between words: since words are represented as vectors, we can compute the cosine angle between a given pair of words to find out how similar the two words are.

rcfox · on Aug 15, 2013

Does that mean there actually is an answer for "What do you get when you cross a mosquito with a mountaineer?"

SergeyHack · on Aug 17, 2013

TL;DR: The answer to your query is a person named Chaudhry Sitwell Borisovich who is definitely an entomologist-hymnist and probably is also a mineralogist-ornithologist.

A google search suggests that he was born in 1961.

I ran a few queries using the code and its default dataset, trying to use neutral words for substraction: "mosquito -small +mountaineer", "mosquito -big +mountaineer", "mosquito -loud +mountaineer", "mosquito -normal +mountaineer", "mosquito -usual +mountaineer", "mosquito -air +mountaineer", "mosquito -nothing +mountaineer".

The most frequent words for these queries are:

6 times: "borisovich" "chaudhry" "entomologist" "hymnist" "sitwell"

5 times: "mineralogist" "ornithologist"

rcfox · on Aug 18, 2013

Well done, sir.

sixbrx · on Aug 15, 2013

Well you can dot them if not cross them (cross needs 3 dimensions iirc)!

defen · on Aug 15, 2013

You inadvertently stumbled onto the punchline of the joke - "You can't cross them because a mountaineer is a scalar." (scaler) - works better when spoken.

sp332 · on Aug 15, 2013

Don't forget the part about the mosquito being a vector :)

magicalist · on Aug 15, 2013

Yeah, the papers linked in the references are probably a better place to start than the readme (though I'm not sure how closely aligned this implementation is with that research, but the paper is still a good read), especially [1]

[1] http://arxiv.org/pdf/1301.3781.pdf

chrislo · on Aug 15, 2013

I wrote a simple library[1] in Ruby for measuring the similarity between documents using word vectors. It has none of the cleverness of this one, but is much simpler, if that helps?

[1] https://github.com/bbcrd/Similarity