Title: Parsing &lt;noscript&gt; tags with goquery
Date: 2024-12-13 22:15

As I ditched Thunderbird for [miniflux](https://github.com/miniflux/v2/) as
main [RSS reader](https://en.wikipedia.org/wiki/News_aggregator), I've [spent
quite some time](https://github.com/miniflux/v2/graphs/contributors) improving it.

I was casually browsing its code when I stumbled upon [the following
regex](https://github.com/miniflux/v2/blob/c3649bd6b1d89d52162b198c5019cf7bc69dc6eb/internal/reader/rewrite/rewrite_functions.go#L26):
```imgRegex = regexp.MustCompile(`<img [^>]+>`)```, used in a [single place](https://github.com/miniflux/v2/blob/c3649bd6b1d89d52162b198c5019cf7bc69dc6eb/internal/reader/rewrite/rewrite_functions.go#L151C1-L159C5):

```go
doc.Find("noscript").Each(func(i int, noscript *goquery.Selection) {
    matches := imgRegex.FindAllString(noscript.Text(), 2)

    if len(matches) == 1 {
        changed = true

        noscript.ReplaceWithHtml(matches[0])
    }
})
```

This looks like a [terrible
idea](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags),
and shouldn't be hard to replace with something better like this:

```go
doc.Find("noscript").Each(func(i int, noscript *goquery.Selection) {
      if img := noscript.Find("img"); img.Length() == 1 {
              img.Unwrap()
              changed = true
      }
})
```

Unfortunately, it didn't work, and led to a significant amount of time being
<s>wasted</s> spent trying to debug/understand what was going on.
Turns out goquery is using [cascadia](https://github.com/andybalholm/cascadia/issues/14), which
in turn uses go's `x/net/html`, which is parsing html with [scripting
enabled](https://html.spec.whatwg.org/multipage/webappapis.html#enabling-and-disabling-scripting),
making it not play nice with `<noscript>` tags. An
[issue](https://github.com/golang/go/issues/16318) has been opened upstream in
July 2016, and closed by [a fix](https://go-review.googlesource.com/c/net/+/174157) from April 2019, but
unfortunately it only works for `<noscript>` tags in `<head>`, meh.

In [goquery's issue](https://github.com/PuerkitoBio/goquery/issues/139) on the
topic, [someone suggested](https://github.com/PuerkitoBio/goquery/issues/139#issuecomment-517526070)
to use something like this, to populate `<noscript>`'s html content:

```go
root.Find(`noscript`).Each(func(i int, noscript *goquery.Selection) {
    noscript.SetHtml(noscript.Text())
})
```

Unfortunately, this didn't work for me. An horrible alternative would be to use
`x/net/html` or `goquery` to manually parse `noscript.Html()`, but this would
be ridiculously overkill, surely there is a better way. 
`ParseOptionEnableScripting`'s
[documentation](https://pkg.go.dev/golang.org/x/net/html#ParseOptionEnableScripting)
doesn't say anything about `<head>` context, and by looking at the history of
`html/parse.go`, we can see that [namusyaka](https://namusyaka.com/)
implemented `<noscript>` parsing in `<body>` as well in [December
2019](https://go-review.googlesource.com/c/net/+/210318)! So the proper
solution is this simple diff:

```diff
-       doc, err := goquery.NewDocumentFromReader(strings.NewReader(entryContent))
+       parserHtml, err := nethtml.ParseWithOptions(strings.NewReader(entryContent), nethtml.ParseOptionEnableScripting(false))
+       doc := goquery.NewDocumentFromNode(parserHtml)
```

The corresponding miniflux pull-request can be found
[here](https://github.com/miniflux/v2/pull/3011), no more ugly regex! May this
little blogpost prevent other from wasting as much time as I did.

