It's part of HTML. https://en.wikipedia.org/wiki/Character_encodings_in_HTML

bch · on Oct 7, 2014

Thanks. I'm trying to get cURL to decode, but it doesn't seem to natively handle. Now I'm digging into .../escape.c

I feel like I must be missing something... :/

tlb · on Oct 7, 2014

You can't simply decode each character without losing information. For example, < means a literal < character to be shown on the page, as opposed to a < in the stream which starts an HTML tag.

If you're just planning on displaying the text in a browser, no decoding is needed. If you want to parse the text to do some sort of textual analysis, an HTML parser library might be best.

bch · on Oct 7, 2014

I understand what you're talking about re: < and '<' -- the json -looks- page (terminal in my case) displayable, barring the &#xhhhh; encoding. cURL has facilities for decoding %20 (for example), but not what we're getting back w/ this json.

You've given me an idea though, so back to vi for me.

Thx.

mh- · on Oct 7, 2014

not sure if you figured something out already, but just saw your comment and remembered that this exists in PHP:

http://us1.php.net/manual/en/function.get-html-translation-t...

absent another source, you could dump it out for your usage elsewhere.

  % php -r 'print_r(get_html_translation_table(HTML_ENTITIES, ENT_QUOTES|ENT_HTML5));'

or

  % php -r 'print json_encode(get_html_translation_table(HTML_ENTITIES, ENT_QUOTES|ENT_HTML5));' | jq .

edit: just found http://dev.w3.org/html5/html-author/charref (but might be harder to parse..)