PDF widely misunderstood by developers

wglb · on Aug 31, 2009

Good quick read. Worth it for the line "See these telephone numbers? I need data like that. I don't care how you get it; I'm just showing you this particular representation, because you're a programmer, and we rarely understand each other."

patio11 · on Aug 31, 2009

If I ever invent a time machine, I want to go back to my first year at the day job and say "When the boss says he wants a 'web service', he really just needs data from page X displayed on all these sidebars. Use an iframe and you won't be stuck in the office doing overtime for the next 3 months fighting Java's ridiculously obtuse frameworks."

yarone · on Aug 31, 2009

If I had a time machine, I'd do something much cooler.

jodrellblank · on Aug 31, 2009

With a time machine, you could be doing that and something much cooler, simultaneously.

tvon · on Aug 31, 2009

"PDF widely misunderstood by non-developers" seems to fit better.

absconditus · on Aug 31, 2009

The example given in this subpar submission is not unique to PDF. The problem would not be much easier if the resumes were in Word, RTF or even plain text. There are numerous tools to extract text from PDF documents. The hard part of this problem is finding discrete data in text that isn't in a standardized format.

It is difficult to determine whether the author is referring to PDF as an image format or if he actually means that the resume is an image in the PDF file (as opposed to text created through OCR). If it is the latter, it is again not a PDF problem. JPEG images would not make the problem less difficult.

I also don't like the attitude that non-technical people should know what is and isn't possible. That is our job. Many technical people claim things are "impossible" when they don't want to do them as well. If a technical person spends weeks on this before realizing that it is an extremely difficult problem then they are incompetent.

fauigerzigerk · on Aug 31, 2009

The problem would not be much easier if the resumes were in Word, RTF or even plain text. There are numerous tools to extract text from PDF documents.

That is not entirely true. PDF can be (and is frequently) generated in a way that doesn't even allow you to extract the sequence of words deterministically. A text, Word or RTF file always makes this possible (the pathological case of text embedded in images notwithstanding).

There are tools to extract text from PDF, but all of them have to use more or less reliable heuristics in order to recover the original order of words and letters unless the PDF file was generated with particular settings that appear to be non default in many tools.

Semiapies · on Aug 31, 2009

The fun comes in, of course, when the PDF really is the only available data source (which, thankfully, has been rare for me). Then you just have to hope you're dealing with standardized forms, or else you're in for some grief.

kenver · on Aug 31, 2009

We created a digital archive of about 80 years worth of magazine issues that were in all sorts of formats. The ocr worked pretty well in most cases, but we found that the older stuff worked better due there being limited typography and simpler layouts 80 years ago.

The more modern issues were only available in PDF and were the biggest challenge, but the ocr still did a reasonably good job, even with the complex layouts and fonts. The tricky bit was preserving the flow of the document i.e. where to go when a column ends.

Semiapies · on Aug 31, 2009

Extracting fields as opposed to documents is a different kettle of fish, I think.

loumf · on Aug 31, 2009

If you have pdfs that are scans of resumes (as in his example), then PDF Text extraction is the least of your problems. It's actually extremely useful to automatically generate an index of the words in resumes if you have a lot of them, but you'll need OCR to do it.

dugmartin · on Aug 31, 2009

I'd recommend PDF TextStream for anyone that needs to pull data from pdfs - http://snowtide.com/PDFTextStream

nodirection · on Aug 31, 2009

Recently I tried PyPDF (http://pybrary.net/pyPdf/) and was not happy with the results. Has anyone else used anything they liked for parsing PDF, namely extracting the text, that is open source?

tsestrich · on Aug 31, 2009

As a soon-to-be graduating college student, this just makes me more and more sad that my tastefully yet interestingly laid-out resume will be reduced to a set of data that I could have spent 5 minutes putting into an email or a web form. I guess it's nice to hand out at a job fair though...