Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The author has also written related tools. One to convert XML to JSON and back (https://github.com/ldn-softdev/jtm) and another to convert JSON to SQLite tables (https://github.com/ldn-softdev/jsl). Combining these with the hxnormalize tool ( https://www.w3.org/Tools/HTML-XML-utils/man1/hxnormalize.htm...), one can do very sophisticated manipulation on HTML web pages.

HTML -> XML (via hxnormalize) -> JSON (via jtm) -> process using jtc (or even jq)



> convert XML to JSON and back

This is basically impossible to do in a way that is compatible with other tools. Things like duplicate attributes of an object can exist in XML, but not in JSON. You can still work-around these limitations if you just have a pipeline using the same toolset, but part of the point of these tools is to then convert them back to a format that some other tool can use, which is where this pattern breaks down.

Here's a list of pitfalls: https://stackoverflow.com/questions/33072812/potential-probl...


This suggest a very scalable, easy approach to extract data from somewhat regular HTML...


I generally use xidel [1] for that type of task. Feed it xpath, css selectors or its own pattern matching thing.

[1] https://github.com/benibela/xidel


or just use xpath




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: