My question is fairly open and allows me to continue asking significantly more complicated questions. So far, nobody has (for example) asked enough clarifying questions to determine that a bloom counting filter could potentially solve the problem- most people just store ASCII string keys in a hash table, which wastes tons of memory and requires a lookup for every string.
So usually I end up after 45 minutes finding that the candidate has more or less tapped themselves out at "make a hash table of string keys, use it to store counts" (OK for a very junior programmer) or "make a perfect minimal hash" (I help them get there if they don't know what those are) or "use a probabilistic counting filter". This is about all I need to make a determination.
The problem with this question is that outside the junior people who won't understand the hints, many good programmers would pass or fail just based on whether they have seen a similar problem or know what a Bloom filter is. CS is so broad that you may have been working for many years and never come across one topic that the interviewer deems standard.
I don't fail people who don't suggest a bloom filter. What I meant was: knowledge of probabilistic data structures and knowing this is a situation where they could be used is a sign of a more experienced programmer (likely one who's worked in clickstream processing).
If all you do in my interview is know that a hash table or other associative data structure is good for maintaining counts, that using a full ASCII byte to store symbols from { A G T C } is wasteful and compress the letters to 2 bits each, you can pack multiple 2 bits into bytes, and can write a function that codes and decodes such data, I'm happy and you get a pass. I'm even happy to sit there and help people through nearly all the steps of the codec. Not trying to trick anybody or select for obscure CS knowledxge.
To me the real question is, how much hinting is reasonable for the coding and decoding function? Many programmers (including senior ones) struggle to implement:
x <<= 2 # left shift the int to make room for the next item
x |= pattern # or the new 2-bit item into the accumulator.
Since I never use bit munging in my day job (it's 90% python data science) should I really ding somebody for not knowing about left shift or or-assignment?
What I really don't get is why people immediately jump to trying to implement huffman coding, RLE, or lookbacks as a solution to reduce the size of the DNA string. I still wonder if how I present the question is giving everybody a fair chance to shine.
I am 100% certain that your questions are not giving everybody a fair chance to shine. You have to start with the area of expertise they have already worked in and not the area that you happen to have experience in or be interested in. I could (for example) ask “simple” dynamic optimisation questions that are “easy” for me and I know that 95% of great experienced developers would fail. So obviously I don’t do that. I want to hire smart developers who can learn. Not developers who happens to have the same experience/interest that I do.
I hope you realised that your questions are extremely narrow and in no way filter for competent developers. Imagine yourself going to an interview and being asked similar narrow specific questions but in a different area you have no previous experience in. You would probably fail. For example: how would you represent a lambda closure when compiling a functional language to machine code? It is a super easy question for me but I know most people would fail. So I don’t ask questions like that when interviewing.
This sounds like a perfect question for a company that's hiring engineers who want to focus on optimizing DNA storage. If that's your goal you should just put that in the hiring criteria. You'll get more candidates that know those type of algs. If your goal is know if the candidate knows about bloom filters a better question might be "What do you know about bloom filters?" Definitely seems like you'd get answer in less than 45 minutes.
I think "key/value counts" is absolutely the correct starting answer. Anything after that is optimization. In the real world optimization comes at length through research, learning, grokking, reading, benchmarking, sleeping, mulling things over. The only way to short change that is knowing about those optimizations in advance, so might as well just ask about those algs specifically.
All that being said, I think it's an enlightening question. Thank you for sharing it! You've educated me. I've been coding a loooooooong time and didn't know about bloom counting filters specifically.
But have you had a chance to look back and see if those questions worked out? Like, are they a good predictor for how the person actually performed in the job?
So usually I end up after 45 minutes finding that the candidate has more or less tapped themselves out at "make a hash table of string keys, use it to store counts" (OK for a very junior programmer) or "make a perfect minimal hash" (I help them get there if they don't know what those are) or "use a probabilistic counting filter". This is about all I need to make a determination.