Parsing HTML with regex

pmjv@lemmy.sdf.org · 1 year ago

Parsing HTML with regex

MonkderZweite@feddit.ch · edit-2 1 year ago

Actually, you can’t even parse html (5) with specialized tools or by converting it and then using xml linters (they quit out due to too many errors). Only tools capable of reliably parsing html (mostly) are the big 3 browser engines. Experience from converting saved webpages to asciidoctor, it involves cleaning up manually, despite tidy and pandoc.

kevincox@lemmy.ml · 1 year ago

This isn’t true. HTML5 made a very strict set of rules and there are a large handful of compliant parsers. But yes, you absolutely can’t use an XML parser. You can’t even use an XML emitter, as you can emit valid XML that means something completely different in HTML.

…what a fucking disaster. I still wish XHTML won.

piecat@lemmy.world · 1 year ago

Real question, why? I feel like there’s a story there