Title: MAT2 0.9.0
Date: 2019-05-10 22:30

There is a new minor version of MAT2, the
[0.9.0](https://0xacab.org/jvoisin/mat2/tags/0.9.0), with the support of
tar/tar.gz.tar.bz2/tar.xz files as major new feature.  It might also be the
very last one before the almighty 1.0, maybe, who knows…

# Changelog

- Add tar/tar.gz/tar.bz2/tar.xz archives support
- Add support for xhtml files
- Improve handling of read-only files
- Improve a bit the command line's documentation
- Fix a confusing error message
- Add even more tests
- Fix a possible mp3-related crash
- Usuals internal cleanups/refactorings

# What's happening Debian-side?

The last release, the [0.8.0]({filename}/metadata/mat2_0.8.0.md) was released
shortly after the [0.7.0]({filename}/metadata/mat2_0.7.0.md), because of the 
[Debian buster freeze](https://release.debian.org/buster/freeze_policy.html):
after the freeze, large/disruptive changes are no longer accepted, only
bugfixes. Hence why I rushed a bit the release, to get the .epub support in.

The 28<sup>th</sup> February, mat2-0.8.0-1 was
[uploaded into Debian](https://tracker.debian.org/news/1032952/accepted-mat2-080-1-source-into-unstable/)
The trailing `-1` in the package version is Debian-specific, and means that
it's the package's first revision.

The 1<sup>st</sup> of March, mat2-0.8.0-2 was
[uploaded](https://tracker.debian.org/news/1033243/accepted-mat2-080-2-source-all-into-experimental/),
to gracefully handle the transition between mat and mat2,
by declaring that mat2 is breaking (and replacing) mat, and shipping a
transitional package (also named mat, since mat isn't shipped anymore in Debian)
that effectively pulls mat2. The effects of this change can be seen on the
following graph:

[![Popcon graph showing the transition from mat to mat2](https://qa.debian.org/cgi-bin/popcon-png?packages=mat2%20mat&show_installed=on&want_legend=on&want_ticks=on&date_fmt=%25Y-%25m&beenhere=1)](https://qa.debian.org/popcon.php?package=mat2)

But everything isn't [super green](https://www.urbandictionary.com/define.php?term=Super%20green)
in Debian-land: the [nautilus-python](https://tracker.debian.org/pkg/nautilus-python)
package in Buster is still using python2, while mat2 requires python3.
Fortunately, a solution was founded: the Debian package is
[patching the extension](https://sources.debian.org/src/mat2/0.8.0-3/debian/patches/0001-nautilus-ext-python2.7.patch/)
to use the mat2 binary instead of calling libmat2. While this is a *super-gross* hack,
it makes it possible for everyone to clean their metadata via a simple
right-clic, which is what matters.

All of this is happening because mat2 is actively packaged in Debian by amazing
people: the original maintainer for mat used to be
[intrigeri](https://people.debian.org/~intrigeri/blog/), but nowadays it's
[georg](https://qa.debian.org/developer.php?login=georg@riseup.net) and
[jonas](https://qa.debian.org/developer.php?email=jonas%40freesources.org) that
are taking [good care of mat2](https://tracker.debian.org/pkg/mat2) in Debian.

# What's happening Fedora-side?

[Fedora 30](https://fedoraproject.org/wiki/Releases/30/Schedule) was released,
and while I'm quite sure it comes with a lot of amazing stuff, the only one I'm
caring about is the complete migration to Python3 for <s>Nautilus</s> Files and
its ecosystem! This means that the mat2 extension is now [working
properly](https://github.com/atenart/copr-build-mat2/issues/4), thanks to the
Fedora maintainer, [atenart](https://ack.tf)! To be fair, mat2 is only in
[COPR](https://copr.fedorainfracloud.org/) for now, but getting into Fedora is
on the [todo list](https://github.com/atenart/copr-build-mat2/issues/2).

# Speeding up the CI

mat2 is using [Gitlab's
CI](https://about.gitlab.com/product/continuous-integration/) to trigger a run
of the testsuite of each commit, and at least once every week, on Debian (with
and without [bubblewrap](https://github.com/projectatomic/bubblewrap)),
Archlinux, Fedora and Gentoo, to ensure that everything is working correctly on
those platforms. The CI is also used to find potential bugs via *linters* like
[pyflakes](https://pypi.org/project/pyflakes/) and
[pylint](http://pylint.pycqa.org/en/stable/), or typing-related issues via
[mypy](http://mypy-lang.org/) (because Python's typing system is awful,
reliability-wise).  Moreover, this is also how code coverage is enforced, to
make sure that all the paths in the codebase are triggered by the testsuite.

A downside of such an extensive testing, is the time it takes to run: around 5
minutes. While this doesn't sound to be a lot, when people are submitting a
merge-request, they want to quickly know if their code is acceptable or not.

Thanks to georg (again), who provided a privileged [gitlab
runner](https://docs.gitlab.com/runner/), the testsuite is now running on
tailored containers, with all the required dependencies already installed,
shrinking a whole testsuite run from ~300s to ~90s.

# A minor internal naming-related change

I'm not a native English speaker (hence why this blog is mostly made of
butchered sentences riddled with horrible grammatical mistakes), and while I can
read, write and speak it *fluently*, I don't know much about its history, where
it comes from, what shaped its evolution, the history of its speakers, …

Luckily, one of my flatmates was born in the US and is patient enough to
highlight and correct the mistakes I'm making when I'm speaking English at
home.  She was also king enough to hand me a copy of
[A Person Paper on Purity in Language](https://www.cs.virginia.edu/~evans/cs655/readings/purity.html)
by [Douglas Hofstadter](https://en.wikipedia.org/wiki/Douglas_Hofstadter).

Here is a small excerpt:

> Most of the clamor, as you certainly know by now, revolves around the age-old
usage of the noun "white" and words built from it, such as chairwhite,
mailwhite, repairwhite, clergywhite, middlewhite, Frenchwhite, forewhite,
whitepower, whiteslaughter, oneupuwhiteship, straw white, whitehandle, and so
on. The negrists claim that using the word "white," either on its own or as a
component, to talk about all the members of the human species is somehow
degrading to blacks and reinforces racism. Therefore the libbers propose that
we substitute "person" everywhere where "white" now occurs.

> Sensitive speakers of our secretary tongue of course find this preposterous.
There is great beauty to a phrase such as "All whites are created equal." Our
forebosses who framed the Declaration of Independence well understood the
poetry of our language.  Think how ugly it would be to say "All persons are
created equal," or "All whites and blacks are created equal." Besides, as any
schoolwhitey can tell you, such phrases are redundant. In most contexts, it is
self-evident when "white" is being used in an inclusive sense, in which case it
subsumes members of the darker race just as much as fairskins.

It made me realise that using the terms `whitelist`/`blacklist` wasn't as
innocuous as I thought it was, so we 
[replaced]( https://0xacab.org/jvoisin/mat2/commit/5ac91cd4f94a822c81bd0bc55a2f7034b31eee7a )
them with `allowlist`/`blocklist`.

# Tar files support

Python has a [zipfile](https://docs.python.org/3/library/zipfile.html) module
for handling zip files, and a
[tarfile](https://docs.python.org/3/library/tarfile.html) module for handling
tar files: they are sufficiently similar to be wrapped in a single *parser*
class in mat2, but also different enough that I spent a whole afternoon and a
good chunk of a night trying to make this happen.

To open a zip file, one can use `zipfile.ZipFile()`. To open a tar file, one
can use `tarfile.TarFile`, except that this will burst into flames with 3
nested exceptions about invalid headers and the fact that the ascii codec can't
decode some shit as soon as it's used on compressed files,
because as said in the [documentation](
https://docs.python.org/3/library/tarfile.html#tarfile.TarFile),
`tarfile.open` should be used instead.

To get all the members of a zip file, it's `ZipFile.infolist()`,
for tar, it's `TarFile.getmembers()`. `ZipFile.extract` isn't vulnerable to
path traversals, but `TarFile.extract` is. To add stuff to a zip, one `write`
them, but for tar, it's `add`. Zip archive members have a `filename` and
a `date_time` as a 6-members tuple, while tar ones have a `name` and a
`mtime` as a timestamp.

Amusingly, since tar files are supporting permissions, care had to be taken
to correctly handle unreadable/unwritable files, and to restore their
permissions after processing.

But the best part, the very best one is about security: for zip files, there is
a [nice warning](https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.extractall), 
and a [safe method](https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.extract).
However, for tar files, there is a [nice warning](https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.extractall),
and a … [other one](https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.extract),
but no safe method to extract stuff. There is a 4 years old bug open on
Python's bugtracker about this, with attached patches to provide a
secure-by-default way, but it's still being bikeshedded.

So I implemented checks myself, for:

- Absolute symlinks
- Relative symlinks
- Setuid files
- Setgid files
- External symlinks
- Hardlinks
- Block devices
- Character devices

Of course, I'm quite sure that I forgot some *interesting* cases, and that I'll
get a CVE about this sooner or later, but there isn't really a better solution
for now.

# External services

Because people are lazy, I added a [github
mirror](https://github.com/jvoisin/mat2) of mat2, automatically kept in sync
because gitlab is magic. Beside making it easier for people to contribute, this
allows me to throw mat2's codebase at various [static
analysers](https://lgtm.com/projects/g/jvoisin/mat2/alerts/?mode=list) that
didn't find any issue that the open-source trio
[pylint](https://www.pylint.org/)/[pyflakes](https://github.com/PyCQA/pyflakes)/[mypy](http://mypy-lang.org/)
didn't catch before. So maybe all those "fuck you pylint" and "mypy is stupid"
commits weren't in vain after all.

# Conclusion

Bolting archive support was an interesting software design problem with no
elegant solution (I would be happy to be proved wrong), the rest was mostly bug
fixes.

As usual, if you know some Python, help is
[more than welcome](https://0xacab.org/jvoisin/mat2/issues?label_name%5B%5D=good+first+issue).
