Read this blog post in Brazilian Portuguese.
I was looking for a simple and fast Python library to implement proper file type detection and encoding detection into my rows library and found that there are many libraries available on Python Package Index which claim to do it. None of them attracted me because one of the following reasons:
- Do not have a pythonic implementation,
- Is not available on Debian to install as a package (it's important so people
can install rows and its dependencies using
pip
orapt-get
), - Is not maintained anymore, or
- Have some missing feature.
None seemed to be the de-facto way to do it in Python (I think pythonistas do it in many ways). Many pythonistas use the chardet library but its results are wrong sometimes and it's pretty slow (specially if you need to detect during an API HTTP request, while the client is waiting).
But there should be one -- and preferably only one -- obvious way to do it.
So, I thought: why not use the file
software, which is well
known by all UNIX hackers, faster and most accurate than all the other
solutions I know? To my surprise there was not such a good Python binding for
file
on PyPI (and calling it as a child process was
not and option since it would add one more system-dependant package, not
detectable during a pip install
if missing and would also turn this solution
less portable). Then, searching on its repository I found a
simple Python wrapper, which was not available on
PyPI at that time and was not that pythonic as I
expected.
The Solution
Since it's free/libre software, I've created an issue on file
bug
tracker to solve the problem and Christos Zoulas (the
current maintainer) asked for a patch, which I
implemented, sent and was accepted. I'm pretty happy I can
contribute to a software I've been using since my earlier times on GNU/Linux
(2003? 2004?). During this process I found that the first commit on the
free/libre file
(which every GNU/Linux distribution and BSD
flavor uses) implementation was done when I was less than 4 months old (!)
-- and it's still maintained today.
Now you can use the official file
Python binding: the
new library is called file-magic and can be installed by running:
pip install file-magic
It provides some methods and attributes but the most important are pretty simple and intuitive to use: they return a namedtuple with the data you want! Let's take a tour through an example:
>>> import magic
>>> # You can pass the filename and it'll open the file for you:
>>> filename_detected = magic.detect_from_filename('turicas.jpg')
>>> print filename_detected
FileMagic(mime_type='image/jpeg', encoding='binary',
name='JPEG image data, JFIF standard 1.02, aspect ratio, density 1x1, segment length 16, progressive, precision 8, 842x842, frames 3')
>>> # It's a `namedtuple` so you can access the attributes directly:
>>> print filename_detected.mime_type
image/jpeg
# If you have the file contents already, just use `detect_from_content`:
>>> with open('data.html') as fobj:
... data = fobj.read()
>>> content_detected = magic.detect_from_content(data)
>>> print content_detected
FileMagic(mime_type='text/html', encoding='utf-8',
name='HTML document, UTF-8 Unicode text')
>>> print content_detected.encoding
utf-8
There are still some things to be improved (like running tests in other platforms -- including Python 3) but it's pip-installable and usable now, so we can benefit from it. Feel free to contribute. :)
Hope you enjoy it!
Nenhum comentário:
Postar um comentário