Show HN: PageIndex – Vectorless RAG

ineedasername · 2025-08-29T13:14:38 1756473278

>"Retrieval based on reasoning — say goodbye to approximate semantic search ("vibe retrieval"

How is this not precisely "vibe retrieval" and much more approximate, where approximate in this case is uncertainty over the precise reasoning?

Similarity with conversion to high-dimensional vectors and then something like kNN seems significantly less approximate, less "vibe" based, than this.

This also appears to be completely predicated on pre-enrichment of the documents by adding structure through API calls to, in the example, openAI.

It doesn't at all seem accurate to:

1: Toss out mathematical similarity calculations

2: Add structure with LLMs

3: Use LLMs to traverse the structure

4: Label this as less vibe-ish

Also for any sufficiently large set of documents, or granularity on smaller sets of documents, scaling will become problematic as the doc structure approaches the context limit of the LLM doing the retrieval.

leetharris · 2025-08-29T16:28:54 1756484934

I work in this field, so I can answer.

Embeddings are great at basic conceptual similarity, but in quality maximalist fields and use cases they fall apart very quickly.

For example:

"I want you to find inconsistencies across N documents." There is no concept of an inconsistency in an embedding. However, a textual summary or context stuffing entire documents can help with this.

"What was John's opinion on the European economy in 2025?" It will find a similarity to things involving the European economy, including lots of docs from 2024, 2023, etc. And because of chunking strategies with embeddings and embeddings being heavily compressed representations of data, you will absolutely get chunks from various documents that are not limited to 2025.

"Where are Sarah or John directly quoted in this folder full of legal documents?" Sarah and John might be referenced across many documents, but finding where they are directly quoted is nearly impossible even in a high dimensional vector.

Embeddings are awesome, and great for some things like product catalog lookups and other fun stuff, but for many industries the mathematical cosign similarity approach is just not effective.

dragonwriter · 2025-08-29T17:08:43 1756487323

> Embeddings are great at basic conceptual similarity, but in quality maximalist fields and use cases they fall apart very quickly.

This makes a lot of sense if you think about it. You want something as conceptually similar to the correct answer as possible. But with vector search, you are looking for something conceptually similar to some formulation of the question, which has some loose correlation, but is very much not the same thing.

There's ways you can prepare data to try to get a closer approximation (e.g. you can have an LLM formulate for each indexed block questions that it could answer and index those, and then you'll be searching for material that answers a question similar to the question being asked, which is a bit closer to what you want, but its still an approximation.

But if you ahead of time know from experience salient features of the dataset that are useful for the particular application, and can index those directly, it just makes sense that while this will be more labor intensive than generalized vector search and may generalize less well outside of that particular use case, it will also be more useful in the intended use case in many places.

ineedasername · 2025-08-29T21:26:49 1756502809

Yes, sure vector similarity has limits, but does this address PageIndex's approach to those limits? I mean, beyond the approach of "Add structure with recursive LLM API calls, show LLM that structure to search". I don't see where PageIndex is doing more than this.

jimmytucson · 2025-08-29T15:57:00 1756483020

It is just as "vibe-ish" as vector search and notably does require chunking (document chunks are fed to the indexer to build the table of contents). That said, I don't find vector search any less "vibey". While "mathematical similarity" is a structured operation, the "conversion to high-dimensional vectors" part is predicated on the encoder, which can be trained towards any objective.

    > scaling will become problematic as the doc structure approaches the context limit of the LLM doing the retrieval

IIUC, retrieval is based on traversing a tree structure, so only the root nodes have to fit in the context window. I find that kinda cool about this approach.

But yes, still "vibe retrieval".

ineedasername · 2025-08-29T21:16:35 1756502195

It doesn't look like it's just root nodes from the structure, it appears to be the entire structure including a summary and excluding the text content itself:

    {json.dumps(tree_without_text, indent=2)}

The end result is that a pre-summarized digest is input in each prompt, the LLM selects whatever it decides on.

The pageIndex value add here is ostensibly the creation of that summary structure, but this too is done with LLM assistance. I've been through the code now, and what I see is essentially JSON creation and parsing during the index process that has LLM prompts as the creation engine for all of that as well.

Yes, it is technically vectorless-RAG, but it gets there completely and totally with iterative and recursive calls to an LLM on all sides.

Looking through the rest of their code & API, the API exists to do these things:

    1: Create your ToC using unsupervised[1]  LLM calls.
    2: Serve your ToC to an LLM when searching or querying your doc base
    3: Be your document store to return hits from #2

[1] Unsupervised in the ML sense, not as a value/quality judgement.

SV_BubbleTime · 2025-08-29T15:35:05 1756481705

> This also appears to be completely predicated on pre-enrichment of the documents by adding structure through API calls to, in the example, openAI.

That was my immediate take. [Look at the summary and answer based on where you expect the data to be found] maybe works well for reliably structured data.

mosselman · 2025-08-29T09:39:21 1756460361

So if I understand this correctly it goes over every possible document with an LLM each time someone performs a search?

I might have misunderstood of course.

If so, then the use cases for this would be fairly limited since you'd have to deal with lots of latency and costs. In some cases (legal documents, medical records, etc) it might be worth it though.

An interesting alternative I've been meaning to try out is inverting this flow. Instead of using an LLM at time of searching to find relevant pieces to the query, you flip it around: at time of ingesting you let an LLM note all of the possible questions that you can answer with a given text and store those in an index. You could them use some traditional full-text search or other algorithms (BM25?) to search for relevant documents and pieces of text. You could even go for a hybrid approach with vectors on top or next to this. Maybe vectors first and then more ranking with something more traditional.

What appeals to me with that setup is low latency and good debug-ability of the results.

But as I said, maybe I've misunderstood the linked approach.

Qwuke · 2025-08-29T10:14:05 1756462445

>An interesting alternative I've been meaning to try out is inverting this flow. Instead of using an LLM at time of searching to find relevant pieces to the query, you flip it around: at time of ingesting you let an LLM note all of the possible questions that you can answer with a given text and store those in an index.

You may already know of this one, but consider giving Google LangExtract a look. A lot of companies are doing what you described in production, too!

summarity · 2025-08-29T11:00:00 1756465200

This is just a variation of index time HyDE (Hypothetical Document Embedding). I used a similar strategy when building the index and search engine for findsight.ai

agentcoops · 2025-08-29T10:37:44 1756463864

I’ve been working on RAG systems a lot this year and I think one thing people miss is that often for internal RAG efficiency/latency is not the main concern. You want predictable, linear pricing of course, but sometimes you want to simply be able to get a predictably better response by throwing a bit more money/compute time at it.

It’s really hard to get to such a place with standard vector-based systems, even GraphRag. Because it relies on summaries of topic clusters that are pre-computed, if one of those summaries is inaccurate or none of the summaries deal with your exact question, that will never change during query processing. Moreover, GraphRag preprocessing is insanely expensive and precisely does not scale linearly with your dataset.

TLDR all the trade-offs in RAG system design are still being explored, but in practice I’ve found the main desired property to be “predictably better answer with predictably scaling cost” and I can see how similar concerns got OP to this design.

bjornsing · 2025-08-29T12:45:58 1756471558

> Moreover, GraphRag preprocessing is insanely expensive and precisely does not scale linearly with your dataset.

Sounds interesting. What exactly is the expensive computation?

On a separate note: I have a feeling RAG could benefit from a kind of ”simultaneous vector search” across several different embedding spaces, sort of like AND in an SQL database. Do you agree?

agentcoops · 2025-08-29T20:16:10 1756498570

GraphRAG does full entity extraction across the entire data set, then looks at every relation between those entities in the documents, then looks at every “community” of relations and generates narratives/descriptions for everything at all of those levels. That is… not linear scaling in relation to your data to say the least — and because questions will be answered on the basis of this preprocessing you don’t want to just use the stupidest/cheapest LLM available. It adds up pretty quickly — and most of the preprocessing turns out to be useless for questions you’ll ask. The OP’s approach is more expensive per query, but you’re more likely to get good results for that particular question.

physicsguy · 2025-08-29T10:54:51 1756464891

Yes, in the use case we're doing it's been diagnosis of issues, and draws on documents in that. the latency doesn't matter because it's all done before the diagnosis is raised to the customer.

bjornsing · 2025-08-29T12:49:42 1756471782

> You want predictable, linear pricing of course, but sometimes you want to simply be able to get a predictably better response by throwing a bit more money/compute time at it.

Through more thorough ANN vector search / higher recall, or would it also require different preprocessing?

agentcoops · 2025-08-29T20:07:10 1756498030

Honestly I don’t know the best answer, but my sense is there’s something important in the direction the OP is going: I.e moving away from vector search or preprocessing towards dynamic exploration of the document space by an agent. Ultimately, if the content in one’s corpus develops in a linear manner (things build one after another), no vector search will ever work on its own, since you just get a however exhaustive list of every passage directly relevant to the question — but not how those relate to all the text before or after.

GraphRAG gets around this by preprocessing these “narrative” summaries of pretty much every combination of topics in a document: vector search then returns a combination of individual topic descriptions, relations between topic descriptions, raw excerpts from the data, and then such overarching “narratives.” This definitely works pretty well in general, but a lot of the narratives turn out to be pretty useless for the important questions and it’s expensive for preprocessing etc.

I think the area that hasn’t been explored enough is generating these narratives dynamically, ie more or less as the OP does having the agent simulate reading through every document with a question in mind and a log of possibly relevant issues. Obviously that’s expensive per query, but if you can get the right answer to an important question for less than the cost of a human’s time it’s worth it. GraphRAG preprocessing costs a lot (exponentially scales with data) and that cost doesn’t guarantee a good answer to any particular question.

sdesol · 2025-08-29T16:38:52 1756485532

> An interesting alternative I've been meaning to try out is inverting this flow.

This is what I am doing with my AI Search Assistant feature, which I discuss in more detail via the link below:

https://github.com/gitsense/chat/blob/main/packages/chat/wid...

By default, I provide what I call a "Tiny Overview Analyzer". You can read the prompt for the Analyzer with the link below:

https://github.com/gitsense/chat/blob/main/packages/chat/wid...

In a nutshell, it generates a very short summary of every document along with keywords. The basic idea is to use BM25 ranking to identify the most relevant documents for the AI to review. For example, my use case is to understand how Aider, Claude Code, etc., store their conversations so that I can make them readable in my chat app. To answer this, I would ask 'How does Aider store conversations?' and the LLM would construct a deterministic keyword search using terms that would most likely identify how conversations are stored.

Once I have the list of files, the LLM is asked again to review the summaries of all matches and suggest which documents should be loaded in full for further review. I've found this approach to be inconsistent, however. What I've found to work much better is just loading the "Tiny Overview" summaries into context and chatting with the LLM. For example, I would ask the same question: "Which files do you think can tell me how Aider stores conversations? Identify up to 20 files and create a context bundle for them so I can load them into context." For a thousand files, you can easily fit three-sentence summaries for each of them without overwhelming the LLM. Once I have my answer, I just need a few clicks to load the files into context, and then the LLM will have full access to the file content and can better answer my question.

rafaelmn · 2025-08-29T11:03:34 1756465414

I didn't look at the implementation but sounds similar to something I two years ago recursively summarize the documentation based on structure (domain/page/section) and then ask the model to walk the hierarchy based on summaries.

My motivation back then I had 8k context length to work with so I had to be very conservative about what I include. I still used vectors to narrow down the entry points and then use LLM to drill down or pick the most relevant ones and the search threads were separate, would summarize the response based on the tree path they took and then main thread would combine it.

jdthedisciple · 2025-08-29T10:59:11 1756465151

> let an LLM note all of the possible questions that you can answer

What does this even mean? At what point do you know you have all of them?

Humans are quite ingenious coming up with new, unique questions in my observation, whereas LLMs have a hard time replicating those efficiently.

malnourish · 2025-08-29T11:56:34 1756468594

Cantors diagonalization is trivial to show for questions. There are uncountably many.

page_index · 2025-08-29T16:11:18 1756483878

you can use document search straedgies (like SQL metadata search, semantic search etc, doc descrption search by LLM) to narrow down the doc candidates first.

CuriouslyC · 2025-08-29T15:07:14 1756480034

So, this has already been done plenty, Serena MCP and Codanna MCP both do this with AST source graphs, Codanna even gives hints in the MCP response to guide the agent to walk up/down the graph. There might be some small efficiency gain in having a separate agent walk the graph in terms of context savings, but you also lose solution fidelity, so I'm not sure it's a win. Also, it's not a replacement for RAG, it's just another piece in the pipeline that you merge over (rerank+cut or llm distillate).

tomomomo · 2025-08-29T16:04:39 1756483479

Yeah, I agree it’s not something new, since humans also do this kind of retrieval. It’s just a way to generate a table of contents for an LLM. I’m wondering, when LLMs become stronger, will we still need vector-based retrieval? Or will we need a retrieval method that’s more like how humans do it?

dragonwriter · 2025-09-06T02:33:45 1757126025

> I’m wondering, when LLMs become stronger, will we still need vector-based retrieval? Or will we need a retrieval method that’s more like how humans do it?

If we knew how humans do it well enough to reproduce it, we’d probably skip straight to that. Everything in AI, though, is basically throwing ideas at the wall about how you might get closer to that, starting from very little knowledge of the mechanism and lots of anecdotes and subjective impressions about, but very little structured understanding of, even the behavior we want to mimic.

sdesol · 2025-08-29T16:58:41 1756486721

> will we still need vector-based retrieval

I think for most use cases, it doesn't make much sense to use vector DBs. When I started to design my AI Search feature, I researched chunking a lot and the general consensus was, you can can lose context if you don't chunk in the right way and there wasn't really a right way to chunk. This was why I decided to take the approach that I am using today, which I talk about in another comment.

With input cost for very good models ($0.30/1M) for Gemini 2.5 Flash (bulk rates would be $0.15/1M), feeding the llm thousands of documents to generate summaries would probably cost 5 dollars or less if using bulk rate pricing. With input cost and with most SOTA LLMs being able to handle 50k tokens in context window with no apparent lost in reasoning, I really don't see the reason for vector DBs anymore, especially if it means potentially less accurate results.

CuriouslyC · 2025-08-29T17:11:06 1756487466

Actually, chunking isn't such a bad problem with code, it chunks itself, and code embeddings produce better results. The problem is that RAG is fiddly, and people try to just copy a basic template or use a batteries included lib that's tuned to QA, which isn't gonna produce good results.

sdesol · 2025-08-29T17:19:36 1756487976

> Actually, chunking isn't such a bad problem with code, it chunks itself, and code embeddings produce better results.

I can't remember what post I read this in (but it was on Hacker News) and I read when designing Claude Code, they (Anthropic) tried a RAG approach but it didn't work very well compared to loading in the full file. If my understanding of how Claude Code works is correct (this was based on comments from others), was it "greps like a intern/junior developer". So what Claude Code does (provided grep is the key), is it would ask Sonnet for keywords to grep for based on the users query. And it would continuously revise the grep key words until it was satisfied with the files that it found.

As ridiculous as this sounds, this approach is not horrible, albeit very inefficient. For my approach, I focus on capturing intent which is what grep can't match. And for RAG, if the code is not chunked correctly and/or if the code is just badly organized, you may miss the true intent for the code.

CuriouslyC · 2025-08-29T17:27:51 1756488471

Oh yeah, loading in full files when possible is great. I use Gemini pro to look at bundles of my whole codebase, the level of comprehension it gets from that is pretty shocking.

sdesol · 2025-08-29T17:38:13 1756489093

This is why I think Vector DBs are probably not going to be used for a lot of applications in the future. It served a very valid purpose when context windows were a lot smaller and LLMs were not as good, but moving forward, I personally think it makes less and less sense.

CuriouslyC · 2025-08-29T17:53:40 1756490020

Vector DBs will still be around to do a first pass before feeding data in to a long context reasoner like Gemini in most cases. The thing that's going to go away is rerankers.

mikeve · 2025-08-29T10:16:47 1756462607

Not sure if I fully understand it, but this seems highly inefficient?

Instead of using embeddings which are easy to make a cheap to compare, you use summarized sections of documents and process them with an LLM? LLM's are slower and more expensive to run.

falcor84 · 2025-08-29T10:52:33 1756464753

If this is used as an important tool call for an AI agent that preforms many other calls, then it's likely that the added cost and latency would be negligible compared to the benefit of significantly improved retrieval. As an analogy, for a small task you're often ok with just going over the first few search results, but to prepare for a large project, you might want to spend an afternoon researching.

page_index · 2025-08-29T16:16:01 1756484161

In specific domains, accuracy matters more than than speed. Document structure and reasoning bring better retrieval than semantic search which retrieves "similar" but not "relevant" results.

CuriouslyC · 2025-08-29T15:09:59 1756480199

The idea this person is trying for is a LLM that explores the codebase using the source graph in the way a human might, by control+clicking in idea/vscode to go to definition, searching for usages of a function, etc. It actually does work, other systems use it as well, though they have the main agent performing the codebase walk rather than delegate to a "codebase walker" agent.

SV_BubbleTime · 2025-08-29T15:39:07 1756481947

My concern would be that a function called setup() might mask some really important thing, likewise a “preface” chapter might get missed by an LLM when you ask some specially deep question.

Either way that your input data structure could build bad summaries that the LLM misses with.

mingtianzhang · 2025-08-29T14:02:41 1756476161

I think it only needs to generate the tree once before retrieval, and it doesn’t require any external model at query time. The indexing may take some time upfront, but retrieval is then very fast and cost-free.

dcre · 2025-08-29T12:43:01 1756471381

My approach in "LLM-only RAG for small corpora" [0] was to mechanically make an outline version of all the documents _without_ an LLM, feed that to an LLM with the prompt to tell which docs are likely relevant, and then feed the entirety of those relevant docs to a second LLM call to answer the prompt. It only works with markdown and asciidoc files, but it's surprisingly solid for, for example, searching a local copy of the jj or helix docs. And if the corpus is small enough and your model is on the cheap side (like Gemini 2.5 Flash), you can of course skip the retrieval step and just send the entire thing every time.

[0]: https://crespo.business/posts/llm-only-rag/

mingtianzhang · 2025-08-29T17:31:55 1756488715

LLM-only RAG for small corpora looks super interesting!

raytang · 2025-08-29T16:41:51 1756485711

this rocks! will definitely check

thatjoeoverthr · 2025-08-29T10:39:44 1756463984

There's good reasons to do this. Embedding similarity is _not_ a reliable method of determining relevance.

I did some measurements and found you can't even really tell if two documents are "similar" or not. Here: https://joecooper.me/blog/redundancy/

One common way is to mix approaches. e.g. take a large top-K from ANN on embeddings as a preliminary shortlist, then run a tuned LLM or cross encoder to evaluate relevance.

I'll link here these guys' paper which you might find fun: https://arxiv.org/pdf/2310.08319

At the end of the day you just want a way to shortlist and focus information that's cheaper, computationally, and more reliable, than dumping your entire corpus into a very large context window.

So what we're doing is fitting the technique to the situation. Price of RAM; GPU price; size of dataset; etc. The "ideal" setup will evolve as the cost structure and model quality evolves, and will always depend on your activity.

But for sure, ANN-on-embedding as your RAG pipeline is a very blunt instrument and if you can afford to do better you can usually think of a way.

tomomomo · 2025-08-29T16:11:35 1756483895

The "redundacy" experiment is very interesting! Strongly agree, we just need to do something better than "dumping your entire corpus into a very large context window", maybe using this table-of-contents methods would be very useful?

joshua_s_penman · 2025-08-29T16:45:27 1756485927

The thing is — for very long documents, it's actually pretty hard for humans to find things, even with a hierarchical structure. This is why we made indexes — the original indexes! — on paper. What you're saying makes pretty hard assumptions about document content, and of course doesn't start to touch multiple documents.

My feeling is that what you're getting at is actually the fact that it's hard to get semantic chunks and when embedding them, it's hard to have those chunks retain context/meaning, and then when retrieving, the cosine similarity of query/document is too vibes-y and not strictly logical.

These are all extremely real problems with the current paradigm of vector search. However, my belief is that one can fix each of these problems vs abandoning the fundamental technology. I think that we've only seen the first generation of vector search technology and there is a lot more to be built.

At Vectorsmith, we have some novel takes on both the comptuation and storage architecture for vector search. We have been working on this for the last 6 months and have seen some very promising resutls.

Fundamentally my belief is that the system is smarter when it mostly stays latent. All the steps of discretization that are implied in a search system like the above lose information in a way that likely hampers retrieval.

zan2434 · 2025-08-29T17:12:58 1756487578

interesting, so you think the issue with the above approach is the graph structure being too rigid / lossy (in terms of losing semantics)? And embeddings are also too lossy (in terms of losing context and structure)? But you guys are working on something less lossy for both semantics and context?

joshua_s_penman · 2025-08-29T17:39:52 1756489192

> interesting, so you think the issue with the above approach is the graph structure being too rigid / lossy (in terms of losing semantics)?

Yeah, exactly.

>And embeddings are also too lossy (in terms of losing context and structure)

Interestingly, it appears that the problem is not embeddings but rather retrieval. It appears that embeddings can contain a lot more information than we're currently able to pull out. Like, obviously they are lossy, but... less than maybe I thought before I started this project? Or at least can be made to be that way?

> But you guys are working on something less lossy for both semantics and context?

Yes! :) We're getting there! It's currently at the good-but-not-great like GPT-2ish kind of stage. It's a model-toddler - it can't get a job yet, but it's already doing pretty interesting stuff (i.e. it does much better than SOTA on some complex tasks). I feel pretty optimistic that we're going to be able to get it to work at a usable commercial level for at least some verticals — maybe at an alpha/design partner level — before the end of the year. We'll definitely launch the semantic part before the context part, so this probably means things like people search etc. first — and then the contextual chunking for big docs for legal etc... ideally sometime next year?

mvieira38 · 2025-08-29T13:06:47 1756472807

> It moves RAG away from approximate "semantic vibes" and toward explicit reasoning about where information lives. That clarity can help teams trust outputs and debug workflows more effectively.

Wasn't this a feature of RAGs, though? That they could match semantics instead of structure, while us mere balls of flesh need to rely on indexes. I'd be interested in benchmarks of this versus traditional vector-based RAGs, is something to that effect planned?

mingtianzhang · 2025-08-29T14:01:02 1756476062

In their gitHub repo’s readme, they show a benchmark on FinanceBench and found that PageIndex-based retrieval significantly outperforms vector-based methods. I’ve noticed that in domain-specific documents, where all the text has similar “semantic vibes,” non-vector methods like PageIndex can be more useful. In contrast, for use cases like recommendation systems, you might actually need a semantic-vibe search.

leetharris · 2025-08-29T16:32:21 1756485141

RAG is just finding the right context for your question.

Embedding based RAG is fast and conceptually accurate, but very poor for high complexity tasks. Agentic RAG is higher quality, but much higher compute and latency cost. But often worth it for complex situations.

brap · 2025-08-29T09:46:45 1756460805

Very cool. These days I’m building RAG over a large website, and when I look at the results being fed into the LLM, most of them are so silly it’s surprising the LLM even manages to extract something meaningful. Always makes me wonder if it’s just using prior knowledge even though it’s instructed not to do so (which is hacky).

I like your approach because it seems like a very natural search process, like a human would navigate a website to find information. I imagine the tradeoff is performance of both indexing and search, but for some use cases (like mine) it’s a good sacrifice to make.

I wonder if it’s useful to merge to two approaches. Like you could vectorize the nodes in the tree to give you a heuristic that guides the search. Could be useful in cases where information is hidden deep in a subtree, in a way that the document’s structure doesn’t give it away.

mingtianzhang · 2025-08-29T14:18:04 1756477084

Strongly agree! It is basically the Mone-Carlo tree search method used in Alpha Go! This is also mentioned in one of their toturials: PageIndex/blob/main/tutorials/doc-search/semantics.md. I believe it will make the method more scalable for large documents.

malshe · 2025-08-29T14:17:00 1756477020

The folks who are using RAG, what's the SOTA for extracting text from pdf documents? I have been following discussions on HN and I have seen a few promising solutions that involve converting pdf to png and then doing extraction. However, for my application this looks a bit risky because my pdfs have tons of tables and I can't afford to get in return incorrect of made up numbers.

The original documents are in HTML format and although I don't have access to them I can obtain them if I want. Is it better to just use these HTML documents instead? Previously I tried converting HTML to markdown and then use these for RAG. I wasn't too happy with the result although I fear I might be doing something wrong.

gillesjacobs · 2025-08-29T15:13:00 1756480380

Extracting structure and elements from HTML should be trivial and probably has multiple libraries in your programming language of choice. Be happy you have machine-readable semantic documents, that's best-case scenario in NLP. I used to convert the chunks to Markdown as it was more token-efficient and LLMs are often heavily preference trained on Markdown, but not sure with current input pricing and LLM performance gains that matters anymore.

If you have scanned documents, last I checked Gemini Flash was very good cost/performance wise for document extraction. Mistral OCR claims better performance in their benchmarks but people I know used it and other benchmarks beg to differ. Personally I use Azure Document Intelligence a lot for the bounding boxes feature, but Gemini Flash apparently has this covered too.

https://getomni.ai/blog/ocr-benchmark

Sidenote: What you want for RAG is not OCR as-in extracting text. The task for RAG preprocessing is typically called Document Layout Analysis or End-to-End Document Parsing/Extraction.

Good RAG is multimodal and semantic document structure and layout-aware so your pipeline needs to extract and recognize text sections, footers/headers, images, and tables. When working with PDFs you want accurate bounding boxes in your metadata for referring your users to retrieved sources etc.

mingtianzhang · 2025-08-29T15:39:01 1756481941

Yeah, thanks for pointing out the OCR! We also found that for complex PDFs, you first need to use OCR to convert them into Markdown and then run PageIndex. However, most OCR tools process each page independently, which causes them to lose the overall document structure. For example, existing OCR tools often generate incorrect heading levels, which is a big problem if you want to build a tree structure from them. You could check out PageIndex-OCR, the first long-context OCR model that can produce Markdown with more accurate heading-level recognition.

gillesjacobs · 2025-08-29T15:51:00 1756482660

I am always on the lookout for new document extraction tools, but can't seem to find any benchmarks for PageIndex-OCR. There are several like OmniDocBench and readoc. So... Got benchmark?

uri_merhav · 2025-09-07T01:10:25 1757207425

Try DocuPipe. It blows Gemini out of the water in terms of accuracy in extracting . They also generate a page + bounding box for every extracted field.

malshe · 2025-08-29T15:59:06 1756483146

> Sidenote: What you want for RAG is not OCR as-in extracting text. The task for RAG preprocessing is typically called Document Layout Analysis or End-to-End Document Parsing/Extraction.

Got it. Indeed, I need to do End-to-End Document Parsing/Extraction.

giamma · 2025-08-29T14:32:41 1756477961

How about using something like Apache Tika for extracting text from multiple documents? It's a subproject of Lucene and consists of a proxy parser + delegates for a number of document formats. If a document, e.g. PDF, comes from a scanner, Tika can optionally shell-out a Tesseract invocation and perform OCR for you.

huqedato · 2025-08-29T14:50:47 1756479047

The Tika's documentation is abysmal. Maybe it is a great product but we had to scrap it because of this.

leetharris · 2025-08-29T16:31:20 1756485080

In our benchmarks, https://github.com/datalab-to/marker is the best if you need to deploy it on your own hardware.

malshe · 2025-08-29T17:20:00 1756488000

Thanks! I will check this out.

JJax7 · 2025-08-29T14:24:56 1756477496

If accuracy is a major concern, then it's probably guaranteed better to go with the HTML documents. Otherwise, I've heard Docling is pretty good from a few co-workers.

malshe · 2025-08-29T15:15:50 1756480550

So you suggest working directly with HTML or going HTML -> Markdown first?

mingtianzhang · 2025-08-29T17:33:25 1756488805

Our PageIndex for HTML will be open-sourced next week, we are actually working on that!

z3ugma · 2025-08-29T15:00:21 1756479621

extractous is worth a look if it's real text

If it's an image / you need to OCR it, Gemini Flash is so good and so cheap that I've had good luck using it as a "meta OCR" tool

malshe · 2025-08-29T15:21:44 1756480904

I will try it out. Is this the correct library? - https://github.com/yobix-ai/extractous

I have used Gemini for OCR and it was indeed good. I also used GPT 3.5 and liked that too.

mingtianzhang · 2025-08-29T15:41:51 1756482111

You could also try PageIndex OCR, the first long-context OCR model. Most current OCR tools process each page independently, which causes them to lose the document’s structure and produce markdown with incorrect heading levels. PageIndex OCR generates markdown with more accurate heading levels to better capture the document’s structure.

malshe · 2025-08-29T15:56:35 1756482995

Ok, thanks for sharing. I will take a look.

kcb · 2025-08-29T15:05:11 1756479911

I've used nv-ingest and Nvidia's nemoretriever-parse model.

davidajackson · 2025-08-29T14:37:27 1756478247

Can you explain why to png? why not to markdown?

malshe · 2025-08-29T15:14:03 1756480443

Oh, I totally think markdown is better than converting to png and then doing OCR. Maybe I did not use a good HTML to markdown converter. The HTML documents are really long and the markdown converter broke down a few times. But as I mentioned, this is probably on me as I did not do a good job of finding a better HTML to markdown converter.

huqedato · 2025-08-29T11:12:07 1756465927

I have a RAG built on 10000+ docs knowledge base. On vector store, of course (Qdrant - hybrid search). It work smoothly and quite reliable.

I wonder how this "vectorless" engine would deal with this. Simply, I can't see this tech scalable.

mingtianzhang · 2025-08-29T14:19:26 1756477166

A good thing about tree representation compared to a 'list' representation is that you can search hierarchically, layer by layer, in a large tree. For example, AlphaGo performs search in a large tree. Since the scale of retrieval is smaller than that of the Go game, I guess this framework can scale very well.

huqedato · 2025-08-29T14:45:58 1756478758

A proof/real-world example would be needed to validate your claim(s).

I think the technology is promising but I don't believe in all those "advantages" that they advertise on the website.

Koaisu · 2025-08-29T10:57:07 1756465027

Sounds a bit like generative retrieval (e.g. this Google paper here: https://arxiv.org/abs/2202.06991)

mingtianzhang · 2025-08-29T13:53:55 1756475635

Yeah, they share a similar intuition. I found that the difference is that PageIndex is more of a learning-free approach, more like how a human would do retrieval?

thatjoeoverthr · 2025-08-29T11:03:39 1756465419

I love it

lewisjoe · 2025-08-29T09:59:13 1756461553

This will scale when you have a single/a small set of document(s) and want your questions answered.

When you have a question and you don't know which of the million documents in your dataspace contains the answer - I'm not sure how this approach will perform. In that case we are looking at either feeding an enormously large tree as context to LLM or looping through potentially thousands of iterations between a tree & a LLM.

That said, this really is a good idea for a small search space (like a single document).

gillesjacobs · 2025-08-29T15:02:53 1756479773

A suspicious lack of any performance metrics on the many standard RAG/QA benchmarks out there, except for their highly fine-tuned and dataset-specific MAFIN2.5 system. I would love the see this approach vs. a similarly well-tuned structured hybrid retriever (vector similarity + text matching) which is the common way of building domain-specific RAG. The FinanceBench GPT4o+Search system never mentions what the retrieval approach is [1,2], so I will have to assume it is the dumbest retriever possible to oversell the improvement.

PageIndex does not state to what degree the semantic structuring is rule-based (document structure) or also inferred by an ML model, in any case structuring chunks using semantic document structure is nothing new and pretty common, as is adding generated titles and summaries to the chunk nodes. But I find it dubious that prompt-based retrieval on structured chunk metadata works robustly, and if it does perform well it is because of the extra work in prompt-engineering done on chunk metadata generation and retrieval. This introduces two LLM-based components that can lead to highly variable output versus a traditional vector chunker and retriever. There are many more knobs to tune in a text prompt and an LLM-based chunker than in a sentence/paragraph chunker and a vector+text similarity hybrid retriever.

You will have to test retrieval and generation performance for your application regardless, but with so many LLM-based components this will lead to increased iteration time and cost vs. embeddings. Advantage of PageIndex is you can make it really domain-specific probably. Claims of improved retrieval time are dubious, vector databases (even with hybrid search) are highly efficient, definitely more efficient that prompting an LLM to select relevant nodes.

1. https://pageindex.ai/blog/Mafin2.5 2. https://github.com/VectifyAI/Mafin2.5-FinanceBench

gogeta_99999 · 2025-08-29T15:39:54 1756481994

>Instead of relying on vector databases or artificial chunking, it builds a hierarchical tree structure from documents and uses reasoning-based tree search to locate the most relevant sections.

So are we are creating create for each document on the fly ? even if its a batch process then dont you think we are pointing back to something which is graph (approximation vs latency sort of framework)

Looks like you are talking more in line of LLM driven outcome where "semantic" part is replaced with LLM intelligence.

I tried similar approaches few months back but those often results in poor scalablity, predictiablity and quality.

visarga · 2025-08-29T17:12:15 1756487535

I did something like this myself. Take a large PDF, summarize each page. Make sure to have the titles of previous 3 pages, it helps with consistency and detecting transitions from one part to another. Then you take all page summaries in a list, and do another call to generate the table of contents. When you want to use it you add the TOC in the prompt and use a tool to retrieve sections on demand. This works better than embeddings which are blind to relations and larger context.

It was for a complex scenario of QA on long documents, like 200 page earning reports.

guerby · 2025-08-29T09:39:47 1756460387

https://en.wikipedia.org/wiki/Retrieval-augmented_generation

dmezzetti · 2025-08-29T12:17:06 1756469826

Context and prompt engineering is the most important of AI, hands down.

There are plenty of lightweight retrieval options that don't require a separate vector database (I'm the author of txtai [https://github.com/neuml/txtai], which is one of them).

It can be as simple this in Python: you pass an index operation a data generator and save the index to a local folder. Then use that for RAG.

CuriouslyC · 2025-08-29T15:15:52 1756480552

Context and prompt engineering are super automatable. DSPy can automate prompt generation that massively outperforms human prompts, and instead of hand packing context, you can use IR/ML algorithms to intelligently select the optimal context bundle to produce the desired output.

Context and prompt engineering are going to be replaced by algorithms, 100%.

dmezzetti · 2025-08-29T15:27:39 1756481259

Yep, context, however you build it.

mingtianzhang · 2025-08-29T13:51:30 1756475490

Strongly agree, I also found txtai is super interesting! Thank you for your open-source effort!

dmezzetti · 2025-08-29T14:02:49 1756476169

You got it!

neya · 2025-08-29T09:49:23 1756460963

This is good for applications where a background queue based RAG is acceptable. You upload a file, set the expectation to the user that you're processing it and needs more time for a few hours and then after X hours you deliver them. Great for manuals, documentation and larger content.

But for on-demand, near instant RAG (like say in a chat application), this won't work. Speed vs accuracy vs cost. Cost will be a really big one.

actionfromafar · 2025-08-29T10:06:43 1756462003

If you have a lot of time, cost on a local machine may be low.

monster_truck · 2025-08-29T10:54:56 1756464896

vectorless rag? I think I have one of those in my kitchen

page_index · 2025-08-29T16:36:02 1756485362

I have page index in my book :)

nikishuyi · 2025-08-29T13:49:24 1756475364

Loll you also need one in your computer.

mritchie712 · 2025-08-29T13:27:01 1756474021

an effective "vectorless RAG" is to have an LLM write search queries against the documents. e.g. if you store your documents in postgres, allow the LLM to construct a regex string that will find relevant matches. If you were searching for “Martin Luther King Jr.”, it might write something like:

    SELECT id, body
    FROM docs
    WHERE body ~* E'(?x)                                     -- x = allow whitespace/comments
      (?:\\m(?:dr|rev(?:erend)?)\\.?\\M[\\s.]+)?             -- optional title: Dr., Rev., Reverend
      (                                                      -- name forms
        (?:\\mmartin\\M[\\s.]+(?:\\mluther\\M[\\s.]+)?\\mking\\M)  -- "Martin (Luther)? King"
      | (?:\\mm\\.?\\M[\\s.]+(?:\\ml\\.?\\M[\\s.]+)?\\mking\\M)     -- "M. (L.)? King" / "M L King"
      | (?:\\mmlk\\M)                                       -- "MLK"
      )
      (?:[\\s.,-]*\\m(?:jr|junior)\\M\\.?)*                  -- optional suffix(es): Jr, Jr., Junior
    ';

sgt · 2025-08-29T14:44:26 1756478666

Won't that be slower than vector DB's by an order of magnitude or more?

page_index · 2025-08-29T16:35:22 1756485322

I guess the major foucs in certain uses cases is not speed but accuracy and retrieval quality.

ahonn · 2025-08-29T15:14:12 1756480452

Faster is not always better. In certain situations, we may choose to sacrifice speed for increased accuracy.

rco8786 · 2025-08-29T11:44:59 1756467899

This seems really interesting but I can't quite figure out if this is like a SaaS product or an OSS library? The code sample seems to indicate that it uses some sort of "client" to send the document somewhere and then wait to retrieve it later.

But the home page doesn't indicate any sort of sign up or pricing.

So I'm a little confused.

edit Ok I found a sign up flow, but the verification email never came :(

vasa_ · 2025-08-30T16:33:32 1756571612

We've done this for a while with cognee, where we have graph completition retrieval that does that + many other things like weighting, self improving feedback and more https://github.com/topoteretes/cognee

esafak · 2025-08-29T13:39:11 1756474751

I don't see this scaling: https://deepwiki.com/search/how-is-the-tree-formed-and-tra_9...

I'd do some large scale benchmarks before doubling down on this approach.

mingtianzhang · 2025-08-29T13:58:05 1756475885

A good thing about tree representation compared to a 'list' representation is that you can search hierarchically, layer by layer, in a large tree. For example, AlphaGo performs search in a large tree. Since the scale of retrieval is smaller than that of the Go game, I guess this framework can scale very well.

cantor_S_drug · 2025-08-29T13:05:23 1756472723

This is like semantic version of B+ trees.

nikishuyi · 2025-08-29T13:48:51 1756475331

Yeah, I strongly agree. I also found in AI coding tools, tree search has replaced vector search. I’m wondering if in generic RAG systems, tree search will replace vector databases?

CuriouslyC · 2025-08-29T15:12:09 1756480329

Tree search hasn't replaced vector search, you can use them synergistically, it's just that vector search is "fiddly" as you have to set up a bunch of stuff to index your repos, manage embeddings, etc and it can use a lot of disk space if you don't use graph representations for your embeddings like LEANN.

jdthedisciple · 2025-08-29T11:00:25 1756465225

Looks like this should scale spectacularly poorly.

Might be useful for a few hundred documents max though.

mingtianzhang · 2025-08-29T13:57:12 1756475832

A good thing about tree representation compared to a 'list' representation is that you can search hierarchically, layer by layer, in a large tree. For example, AlphaGo performs search in a large tree. Since the scale of retrieval is smaller than that of the Go game, I guess this framework can scale very well.

CuriouslyC · 2025-08-29T15:13:38 1756480418

This design isn't new, Codanna MCP uses it, and it definitely works (at least when run by the main agent, a dumb subagent might biff it).

bjornsing · 2025-08-29T12:56:16 1756472176

It scales as log(N), right? So if you can tolerate it for a few hundred docs you can probably tolerate it for a lot more.

koakuma-chan · 2025-08-29T09:29:37 1756459777

What about latency?

marcodena · 2025-08-29T09:34:23 1756460063

yeah vectors are way more efficient for this

mingtianzhang · 2025-08-29T14:33:43 1756478023

In this approach, the documents need to be pre-processed once to generate a tree structure, which is slower than the current vector-based method. However, during retrieval, this approach only requires conditioning on the context for the LLM and does not require an embedding model to convert the query into vectors. As a result, it can be efficient when the tree is small. When the tree is large, however, this approach may be slower than the vector-based method since it prioritizes accuracy. If you prioritize speed over accuracy, then I guess you should use Vector DB.

Qwuke · 2025-08-29T09:55:40 1756461340

The approach used here for breaking down large documents into summarized chunks that can more easily be reasoned about is how a lot of AI systems deal with large documents that surpass effective context limits in-general, but in my experience this approach will only work up to a certain point and then the summaries will start to hide enough detail that you do need semantic search or another RAG approach like GraphRAG. I think the efficacy of this approach will really fall apart after a certain number of documents.

Would've loved to seen the author run experiments about how they compare to other RAG approaches or what the limitations are to this one.

mingtianzhang · 2025-08-29T15:35:16 1756481716

Thanks, that’s a great point! That’s why we use the tree structure, which can search layer by layer without putting the whole tree into the context (to compromise the summary quality). We’ll update with more examples and experiments on this. Thanks for the suggestion!

theshetty · 2025-08-29T09:53:03 1756461183

Can you eloborate on this please?

brap · 2025-08-29T09:56:59 1756461419

To put it in terms of data structures, a vector DB is more like a Map, this is more like a Tree

neutronicus · 2025-08-29T10:16:02 1756462562

For the C++ programmers among us I think that means it's more like `unordered_map` than `map`

page_index · 2025-08-29T17:02:32 1756486952

Lol you mean vector db is more like hash_map. map is more tree based

mingtianzhang · 2025-08-29T16:57:19 1756486639

I just realized that the whole Hacker News discussion is formalized as a tree, and I am using my eyes to tree search through the tree to retrieve ideas from the insightful comments.

joshua_s_penman · 2025-08-29T16:59:13 1756486753

this is fundamentally organized by popularity, though

nathan_compton · 2025-08-29T10:35:09 1756463709

I let a boot do a free text search over and indexed database. Works ok. I've also tried keyword based retrieval and vector search.

I've found all leave something to be desired, sadly.

geedzmo · 2025-08-29T11:07:20 1756465640

"Human-like Retrieval: Simulates how human experts navigate and extract knowledge from complex documents." - pretty sure I use control-f when I look for stuff

mingtianzhang · 2025-08-29T14:20:36 1756477236

But different people may have different ways. For example, I use command+f in macbook.

scotty79 · 2025-08-29T11:10:16 1756465816

I think it's about how you decide where to press Ctrl+F next.

page_index · 2025-08-29T16:27:34 1756484854

LOL ctrl-f feels like bm25 vector search

raytang · 2025-08-29T16:43:15 1756485795

this is how I as a human retrieve on a computer :)

petesergeant · 2025-08-29T15:24:06 1756481046

Second attempt to get away from vectors and embeddings I’ve seen here recently. Are people really struggling that much with their RAG systems?

page_index · 2025-08-29T16:31:00 1756485060

curious about the other attempt you see

petesergeant · 2025-08-30T10:19:23 1756549163

https://news.ycombinator.com/item?id=44969622

dr_dshiv · 2025-08-29T10:25:15 1756463115

Unrelated: why is chat search in Claude so bad?

nikishuyi · 2025-08-29T13:49:58 1756475398

Maybe lost in the context? I guess a tree method can be used to improve that?

raytang · 2025-08-29T16:58:56 1756486736

thinking if this is related to llms.txt?