Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My favorite technique is:

wget URL > HTML tidy HTML > XHTML xslt [identity transform based content extraction] XHTML > XML XML > DB

The whole process glued together with PERL or shell scripts. Depending on how you construct your content extraction, this technique can weather lots of the inevitable content style changes and easily adjusts when changes need to be made.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: