More

falling_myshkin · on March 6, 2024

been a lot of these RAG abstractions posted recently. As someone working on this problem, it's unclear to me whether the calculation and ingestion of embeddings from source data should be abstracted into the same software package as their search and retrieval. I guess it probably depends on the complexity of the problem. This does seem interesting in that it does make intuitive sense to have a built-in db extension if the source data itself is coming from the same place as the embeddings are going. But so far I have preferred a separation of concerns in this respect, as it seems that in some cases the models will be used to compute embeddings outside the db context (for example, the user search query needs to get vectorized. why not have the frontend and the backend query the same embedding service?) Anyone else have thoughts on this?

chuckhend · on March 6, 2024

Its certainly up for debate and there is a lot of nuance. I think it can simplify the system's architecture quite a bit of all the consumers of data do not need to keep track of which transformer model to use. After all, once the embeddings are first derived from the source data, any subsequent search query will need to use the same transformer model that created the embeddings in the first place.

I think the same problem exists with classical/supervised machine learning. Most model's features went through some sort of transformation, and when its time to call the model for inference those same transformations will need to happen again.

falling_myshkin · on March 6, 2024

There's an issue in the pgvector repo about someone having several ~10-20million row tables and getting acceptable performance with the right hardware and some performance tuning: https://github.com/pgvector/pgvector/issues/455

I'm in the early stages of evaluating pgvector myself. but having used pinecone I currently am liking pgvector better because of it being open source. The indexing algorithm is clear, one can understand and modify the parameters. Furthermore the database is postgresql, not a proprietary document store. When the other data in the problem is stored relationally, it is very convenient to have the vectors stored like this as well. And postgresql has good observability and metrics. I think when it comes to flexibility for specialized applications, pgvector seems like the clear winner. But I can definitely see pinecone's appeal if vector search is not a core component of the problem/business, as it is very easy to use and scales very easily

falling_myshkin · on March 6, 2024

After seeing raw source text performance, I agree that representational learning of higher-level semantic "context clusters" as you say seems like an interesting direction.

falling_myshkin · on Feb 25, 2024

For those who don't know: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

I agree with you for the NLP domain, but I wonder if there will also be a bitter lesson learned about the perceived generality of language for universal applications.

falling_myshkin · on Feb 21, 2024

i don't disagree with the premise that Google should be responsible and explicitly acknowledge that the average computer-interested person trying out bigquery has no clue how sharp of a knife it is and they actually do need to be protected from themselves. I was in this boat only a few months ago. One thing I will say though is that I think the documentation is actually quite comprehensive, and personally after taking the time to RTFM and actually understand things like columnar storage, partitioned and clustered tables, etc., I was able to optimize costs quite a bit for our use case and am quite pleased with the product overall. Just takes time to learn, it's a (necessarily imo) intricate machine.

falling_myshkin · on Feb 19, 2024

> to store and retrieve

retrieve != query

this project is extremely simplistic in regards to its vector search tech. pgvector is an open source implementation of an _index_ (multiple algos actually), this uses Cloudflare's completely proprietary index with a single call.

falling_myshkin · on Feb 19, 2024

I am not sure what you mean specifically by 'overlapping'. But high-dimensional vector space is really "big" in the sense that everything is way closer together compared to low dimensions (this is the curse of dimensionality for euclidean norm), and this is already something one has to think about regardless of the similarity of the source documents. From reading wikipedia it seems like it's been argued that the curse is the worst with independent and uniformly distributed features.

falling_myshkin · on Feb 19, 2024

i think the confusion comes from the mixup between the words "database", "store", and "index". Vector "store" is trivial, even for hundreds of millions of vectors you are still in the realm of what is possible on a single disk. Vector "index" to enable efficient aNN is not trivial for large numbers of high-dimensional vectors, and this is usually the proposed value add of someone providing a vector "database", which combines the two. I think this is also how the words are understood more generally. This project is a wrapper over Cloudflare's infrastructure, which does provide a vector index, though it is not clear how well their index performs in real-world use cases.

falling_myshkin · on Feb 19, 2024

Young engineer here in charge of a project and feeling quite out of their depth, I agree with this. Currently no mentorship and it will take a couple months for a senior hire. Do you have any advice? What does a senior engineer love/hate to see when they come onto a project started by engineers earlier in their career? How can I be most helpful?

aleph_minus_one · on Feb 19, 2024

My personal opinion on this topic is a little bit "postmodernist": we neither have yet a "scientific theory" of software engineering, management, and management of software project (despite the fact that there do exist people who claim otherwise). So, there exist a multitude of "schools" of how to handle these topics (with often quite conflicting opinions on what is "good" and "bad").

Having this consideration in the back of your mind:

What might be helpful concerning your questions is to consider that many programming languages suggest a specific style of approaching the programming project, structuring the code, often managing the team, ... . There exist good reasons why one talks of "Java shops", "C# shops", "Python shops", ... because these programming languages often imply very different company cultures. Note that very prevalent programming languages can also have various "subcultures" ...

So what will likely be valued is to have a good understanding of the "desired programming style", "desired problem solving approach", "desired management style", ... that is encouraged by the "programming ecosystem" in which the company is placed, and going by it.

This will likely yield a decent, conservative code base that can more easily be passed over to a more experienced programmer as soon as one becomes available: especially for such a situation, it is in my opinion much more important to "not to make huge (also architectural) mistakes" than to deliver the most fancy/ideal code (but note that if you work at a startup that has an "all or nothing" approach (i.e. if the product won't become great, the startup has failed) or is in fear of running out of money, the priorities might differ quite a lot).

falling_myshkin · on Feb 19, 2024

Thank you for the response, it is helpful. I am not used to Python and know that I am not using the language well. So I think it is worth focusing on this.

> not to make huge (also architectural) mistakes

I'm noticing it's the first time I'm even having to make significant architectural decisions, which is difficult because I don't have much experience to draw from, so even the smallest decisions often require a lot of research.

falling_myshkin · on Dec 16, 2023

I agree, vector search on music seems pretty obvious at this point.