Good quick read. Worth it for the line "See these telephone numbers? I need data like that. I don't care how you get it; I'm just showing you this particular representation, because you're a programmer, and we rarely understand each other."
If I ever invent a time machine, I want to go back to my first year at the day job and say "When the boss says he wants a 'web service', he really just needs data from page X displayed on all these sidebars. Use an iframe and you won't be stuck in the office doing overtime for the next 3 months fighting Java's ridiculously obtuse frameworks."
The example given in this subpar submission is not unique to PDF. The problem would not be much easier if the resumes were in Word, RTF or even plain text. There are numerous tools to extract text from PDF documents. The hard part of this problem is finding discrete data in text that isn't in a standardized format.
It is difficult to determine whether the author is referring to PDF as an image format or if he actually means that the resume is an image in the PDF file (as opposed to text created through OCR). If it is the latter, it is again not a PDF problem. JPEG images would not make the problem less difficult.
I also don't like the attitude that non-technical people should know what is and isn't possible. That is our job. Many technical people claim things are "impossible" when they don't want to do them as well. If a technical person spends weeks on this before realizing that it is an extremely difficult problem then they are incompetent.
The problem would not be much easier if the resumes were in Word, RTF or even plain text. There are numerous tools to extract text from PDF documents.
That is not entirely true. PDF can be (and is frequently) generated in a way that doesn't even allow you to extract the sequence of words deterministically. A text, Word or RTF file always makes this possible (the pathological case of text embedded in images notwithstanding).
There are tools to extract text from PDF, but all of them have to use more or less reliable heuristics in order to recover the original order of words and letters unless the PDF file was generated with particular settings that appear to be non default in many tools.
The fun comes in, of course, when the PDF really is the only available data source (which, thankfully, has been rare for me). Then you just have to hope you're dealing with standardized forms, or else you're in for some grief.
We created a digital archive of about 80 years worth of magazine issues that were in all sorts of formats. The ocr worked pretty well in most cases, but we found that the older stuff worked better due there being limited typography and simpler layouts 80 years ago.
The more modern issues were only available in PDF and were the biggest challenge, but the ocr still did a reasonably good job, even with the complex layouts and fonts. The tricky bit was preserving the flow of the document i.e. where to go when a column ends.
If you have pdfs that are scans of resumes (as in his example), then PDF Text extraction is the least of your problems. It's actually extremely useful to automatically generate an index of the words in resumes if you have a lot of them, but you'll need OCR to do it.
Recently I tried PyPDF (http://pybrary.net/pyPdf/) and was not happy with the results. Has anyone else used anything they liked for parsing PDF, namely extracting the text, that is open source?
As a soon-to-be graduating college student, this just makes me more and more sad that my tastefully yet interestingly laid-out resume will be reduced to a set of data that I could have spent 5 minutes putting into an email or a web form. I guess it's nice to hand out at a job fair though...