Title: Parsing <noscript> tags with goquery Date: 2024-12-13 22:15 As I ditched Thunderbird for [miniflux](https://github.com/miniflux/v2/) as main [RSS reader](https://en.wikipedia.org/wiki/News_aggregator), I've [spent quite some time](https://github.com/miniflux/v2/graphs/contributors) improving it. I was casually browsing its code when I stumbled upon [the following regex](https://github.com/miniflux/v2/blob/c3649bd6b1d89d52162b198c5019cf7bc69dc6eb/internal/reader/rewrite/rewrite_functions.go#L26): ```imgRegex = regexp.MustCompile(`]+>`)```, used in a [single place](https://github.com/miniflux/v2/blob/c3649bd6b1d89d52162b198c5019cf7bc69dc6eb/internal/reader/rewrite/rewrite_functions.go#L151C1-L159C5): ```go doc.Find("noscript").Each(func(i int, noscript *goquery.Selection) { matches := imgRegex.FindAllString(noscript.Text(), 2) if len(matches) == 1 { changed = true noscript.ReplaceWithHtml(matches[0]) } }) ``` This looks like a [terrible idea](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags), and shouldn't be hard to replace with something better like this: ```go doc.Find("noscript").Each(func(i int, noscript *goquery.Selection) { if img := noscript.Find("img"); img.Length() == 1 { img.Unwrap() changed = true } }) ``` Unfortunately, it didn't work, and led to a significant amount of time being ~~wasted~~ spent trying to debug/understand what was going on. Turns out goquery is using [cascadia](https://github.com/andybalholm/cascadia/issues/14), which in turn uses go's `x/net/html`, which is parsing html with [scripting enabled](https://html.spec.whatwg.org/multipage/webappapis.html#enabling-and-disabling-scripting), making it not play nice with `

` tags in `<head>`, meh. In [goquery's issue](https://github.com/PuerkitoBio/goquery/issues/139) on the topic, [someone suggested](https://github.com/PuerkitoBio/goquery/issues/139#issuecomment-517526070) to use something like this, to populate `<noscript>`'s html content: ```go root.Find(`noscript`).Each(func(i int, noscript *goquery.Selection) { noscript.SetHtml(noscript.Text()) }) ``` Unfortunately, this didn't work for me. An horrible alternative would be to use `x/net/html` or `goquery` to manually parse `noscript.Html()`, but this would be ridiculously overkill, surely there is a better way. `ParseOptionEnableScripting`'s [documentation](https://pkg.go.dev/golang.org/x/net/html#ParseOptionEnableScripting) doesn't say anything about `<head>` context, and by looking at the history of `html/parse.go`, we can see that [namusyaka](https://namusyaka.com/) implemented `<noscript>` parsing in `<body>` as well in [December 2019](https://go-review.googlesource.com/c/net/+/210318)! So the proper solution is this simple diff: ```diff - doc, err := goquery.NewDocumentFromReader(strings.NewReader(entryContent)) + parserHtml, err := nethtml.ParseWithOptions(strings.NewReader(entryContent), nethtml.ParseOptionEnableScripting(false)) + doc := goquery.NewDocumentFromNode(parserHtml) ``` The corresponding miniflux pull-request can be found [here](https://github.com/miniflux/v2/pull/3011), no more ugly regex! May this little blogpost prevent other from wasting as much time as I did.