terça-feira, 2 de fevereiro de 2016

Detecting File Type and Encoding In Python

Read this blog post in Brazilian Portuguese.

I was looking for a simple and fast Python library to implement proper file type detection and encoding detection into my rows library and found that there are many libraries available on Python Package Index which claim to do it. None of them attracted me because one of the following reasons:

  • Do not have a pythonic implementation,
  • Is not available on Debian to install as a package (it's important so people can install rows and its dependencies using pip or apt-get),
  • Is not maintained anymore, or
  • Have some missing feature.

None seemed to be the de-facto way to do it in Python (I think pythonistas do it in many ways). Many pythonistas use the chardet library but its results are wrong sometimes and it's pretty slow (specially if you need to detect during an API HTTP request, while the client is waiting).

But there should be one -- and preferably only one -- obvious way to do it.

So, I thought: why not use the file software, which is well known by all UNIX hackers, faster and most accurate than all the other solutions I know? To my surprise there was not such a good Python binding for file on PyPI (and calling it as a child process was not and option since it would add one more system-dependant package, not detectable during a pip install if missing and would also turn this solution less portable). Then, searching on its repository I found a simple Python wrapper, which was not available on PyPI at that time and was not that pythonic as I expected.

The Solution

Since it's free/libre software, I've created an issue on file bug tracker to solve the problem and Christos Zoulas (the current maintainer) asked for a patch, which I implemented, sent and was accepted. I'm pretty happy I can contribute to a software I've been using since my earlier times on GNU/Linux (2003? 2004?). During this process I found that the first commit on the free/libre file (which every GNU/Linux distribution and BSD flavor uses) implementation was done when I was less than 4 months old (!) -- and it's still maintained today.

Now you can use the official file Python binding: the new library is called file-magic and can be installed by running:

pip install file-magic

It provides some methods and attributes but the most important are pretty simple and intuitive to use: they return a namedtuple with the data you want! Let's take a tour through an example:

>>> import magic

>>> # You can pass the filename and it'll open the file for you:
>>> filename_detected = magic.detect_from_filename('turicas.jpg')
>>> print filename_detected
FileMagic(mime_type='image/jpeg', encoding='binary',
          name='JPEG image data, JFIF standard 1.02, aspect ratio, density 1x1, segment length 16, progressive, precision 8, 842x842, frames 3')
>>> # It's a `namedtuple` so you can access the attributes directly:
>>> print filename_detected.mime_type
image/jpeg

# If you have the file contents already, just use `detect_from_content`:
>>> with open('data.html') as fobj:
...     data = fobj.read()
>>> content_detected = magic.detect_from_content(data)
>>> print content_detected
FileMagic(mime_type='text/html', encoding='utf-8',
          name='HTML document, UTF-8 Unicode text')
>>> print content_detected.encoding
utf-8

There are still some things to be improved (like running tests in other platforms -- including Python 3) but it's pip-installable and usable now, so we can benefit from it. Feel free to contribute. :)

Hope you enjoy it!

Nenhum comentário:

Postar um comentário