Title: MAT2 0.8.0
Date: 2019-02-28 00:01

There is a new minor version of MAT2, the [0.8.0](https://0xacab.org/jvoisin/mat2/tags/0.8.0),
with a single new feature: support for the epub format!
This release is super-close to the previous one, because the [debian buster
freeze](https://release.debian.org/buster/freeze_policy.html) is near,
and some people were really eager to have epub support in mat2 on it, so I
wrote the code as fast as I could.

# Changelog

- Add support for epub files
- Fix the setup.py file crashing on non-utf8 platforms
- Improve css support
- Improve html support

# Debugging an annoying issue on Debian

While adding support for epub, I stumbled upon an interesting issue: everything
was working great, except on the Debian instances of the CI. I tried to
reproduce the issue in a [debootstrap](https://wiki.debian.org/Debootstrap),
but didn't managed to: the testsuite was working. I tried inside a virtual
machine: same behaviour, everything was green.

So I added a lot of calls to `print` everywhere, to see what was going on in
the CI, and this finally boiled down to the infamous
`UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1210: ordinal not in range(128)`.
How come that this exception was silently ignored? Well, it's because it's a
subclass of `UnicodeError`, which is itself a subclass of `ValueError`, which
is the exception raised internally by mat2 when something goes wrong with the
parsing of a file.

The fix was to simply specify that mat2 should always use `utf-8` by adding
`encoding=utf-8` in every call to methods/functions related to file-content
manipulation, because the Debian instead in the CI isn't apparently expressing
a preference in the environment about the fact that program shouldn't use
US-ASCII by default for everything.

Moreover, to avoid losing time again, mat2 is now displaying the content of the
exception instead of silently swallowing it.

# Implementing epub support

The [previous version]({filename}/metadata/mat2_0.7.0.md) add html support. This
was done to support epub, since this format is basically a bunch of html/css
files stitched together in a zip archive. I thought this would be pretty easy
to implement. I was wrong.

Python has a [html parser](https://docs.python.org/3/library/html.parser.html) in its stdlib, but:

- It's implemented via [regular expressions](https://github.com/python/cpython/blob/3.7/Lib/html/parser.py#L20), which is a [notoriously bad idea](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags).
- It's non-validating, meaning that you have to implement validation on top of
	it. It was a great amount of <s>pain</s> fun to write one.
- There is a `get_starttag_text` method to get the start of a tag, but there is
	no `get_endtag_text`, so you have to somehow cache the opening tag in a LIFO
	to be able to transform it as a closing one when needed.
- Non-validating __and__ not really made for export is a nice combo, because
	writing a state-machine to modify and validate a pseudo-xml document you're
	iterating on convinced me that having a drawing board in my room is a
	valudable investment of space, time, and not becoming crazy.
- The bulk of its code was written by Guido himself, in
	[2001](https://github.com/python/cpython/commit/8846d7178b8caf1411ca6f458b78b9f46ba73abe#diff-a07dd7eb9cb779be7f57ea2282a94d96),
	with only small bugfixes and no major cleanup/overhaul in 18 years.

Moreover, the [epub specification](https://www.w3.org/publishing/groups/epub3-cg/)
has different versions, each of them more or less correctly implemented by e-readers.
I might have written some ghetto python-scripts to upload a bunch of random
epub files on various online validators, scraped their answers, and diff'ed them
against their output of the same files, but cleaned up by mat2.

So, yeah, this was fun.

# Conclusion

As usual, help is [more than welcome](https://0xacab.org/jvoisin/mat2/issues?label_name%5B%5D=good+first+issue).