wget URL > HTML
tidy HTML > XHTML
xslt [identity transform based content extraction] XHTML > XML
XML > DB
The whole process glued together with PERL or shell scripts. Depending on how you construct your content extraction, this technique can weather lots of the inevitable content style changes and easily adjusts when changes need to be made.
wget URL > HTML tidy HTML > XHTML xslt [identity transform based content extraction] XHTML > XML XML > DB
The whole process glued together with PERL or shell scripts. Depending on how you construct your content extraction, this technique can weather lots of the inevitable content style changes and easily adjusts when changes need to be made.