Title: Cleaning PDF metadata in depth
Date: 2015-08-25 13:30

I already [mentioned]( {filename}/metadata/some-funny-stuffs-about-pdf.md ) that
the PDF format is a real mess; making it non-trivial to process,
and thus non-trivial to remove every metadata that it could carry.

Some people are recommending [exiftool]( http://owl.phy.queensu.ca/~phil/exiftool/ )
for this, despite the [warning]( http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/PDF.html )
in its documentation:

> All metadata edits are reversible. While this would normally be considered an
> advantage, it is a potential security problem because old information is never
> actually deleted from the file. 

You can indeed restore metadata *removed* with this method with `exiftool -pdf-update:all= file.pdf`

[Others]( https://github.com/Glutanimate/PDFMtEd#pdfmted-editor-1 ) are using
exiftool and [qpdf]( http://qpdf.sourceforge.net/ ) to:

1. Append a new version of the metadata with exiftool
2. Remove unreferenced PDF objects (like old metadata) with qpdf

This method has several drawbacks in my opinion:

1. Nothing guarantees that your old metadata will actually be removed,
if they are referenced somewhere else in your file.
2. This approach won't clean metadata of files embedded within the PDF.

That's why [MAT]( https://mat.boum.org ) is using a different approach,
it's completely **re-rending** the PDF file,
on a [Cairo]( http://cairographics.org/ )'s
[PDF Surface]( http://cairographics.org/manual/cairo-PDF-Surfaces.html ),
to export it as a real PDF file, like a normal, physical printing.

This ensures that:

- Metadata from images are removed, since they are re-renderer
- Videos are transformed into *screenshots* (This is a actually
a feature, because it's making video-powered fingerprinting much more harder.),
- [Weird]( http://www.bluebeam.com/us/products/revu/3d-pdfs.asp ) embedded objects are discarded
- Javascript is disabled (goodby [exploit-kits]( https://helpx.adobe.com/security/products/reader.html ))
- ...

To my knowledge, this is for now the <s>best</s> less worse way to clean a PDF file;
but I'll be delighted to be proven otherwise ;)

(Ho, and by the way, since several people asked me about this,
I sat a [github mirror]( https://github.com/jvoisin/MAT ) up for MAT.
Send me pull-requests to prove me this it's worth keeping it alive.)
