Blog do @Turicas: março 2016

segunda-feira, 14 de março de 2016

Reading Parquet Files in Python with rows

Many people in the data science field use the parquet format to store tabular data, as it's the default format used by Apache Spark -- an efficient data storage format for analytics. The problem is: the format is binary (you can't just open it with your preferred code editor) and there's no such a good Python library to read -- not until today!

I found a Python library called parquet-python on GitHub but it's hard to use, doesn't have one code example, was not available on PyPI and it looks like it's not maintained anymore. So I decided to implement a parquet plugin (read-only) for my library rows: it uses the parquet-python library under the hood (I needed to upload it to PyPI so you can install it easly) and exposes the data in a pretty simple, pythonic way.

Installation

I didn't realese the rows version with this plugin yet, so you need to grab the most recent rows version by running:

pip install -U git+https://github.com/turicas/rows.git@develop

And also the dependency:

pip install parquet

If the data is compressed using Google's snappy you'll also need the library headers and other Python dependency -- install everything by running:

apt-get install libsnappy-dev
pip install python-snappy

Then you can use rows.import_from_parquet(filename) in your programs! \o/

Python Example

A quick Python code example:

import rows

table = rows.import_from_parquet('myfile.parquet')
for row in table:
    print row  # access fields values with `rows.field_name`

Note that the current implementation is not optimized (for example, it'll put everything into memory) but at least you can extract desired data and then convert to a more friendly format easily.

Converting Parquet to Other Formats with `rows`' CLI

You can convert Parquet files to many tabular formats (like CSV) by using the rows's command-line interface, so you don't need to code.

Install the rows CLI by running:

pip install rows[cli]

Now convert the parquet file:

rows convert myfile.parquet myfile.csv  # yes, simple like this!

You can replace csv with any other supported format (the list is always growing!), such as: txt, html, xls, xlsx and sqlite.

If your file is small enough you can actually see it without needing to save the output to another file by using the print subcommand:

rows print myfile.parquet  # will extract, convert and print data as text

And you can actually query data as in SQL (this CLI is awesome!), for example:

rows query 'nation_key < 10' tests/data/nation.dict.parquet \
     --output=data.csv

By running this command the CLI will:

Import data from tests/data/nation.dict.parquet file into memory;
Export to SQLite (:memory:);
Run the query (nation_key < 10) and get the results;
Convert the results to a new rows.Table object;
Export the table to CSV format and save it into data.csv (the result format could be html, xls, xlsx or any other write-plugin supported by rows).

With this addition to rows I think the library and its command-line interface became one of the tools every data scientist should have installed. ;-)

domingo, 6 de março de 2016

Searching dd-wrt Router Database with ddwrtdb

I really like the dd-wrt router operating system: I can install it on cheap routers (starting from 27 USD, as TP-Link WR741ND) to have a great Web configuration interface and performance (way better than general factory software).

I'm always looking for new router models to check if they're supported by dd-wrt and to compare prices/hardware specs (as I'm always buying new routers to help some friends with their Wi-Fi networks). The problem is: dd-wrt's website usability is not that good, specially the router database search. As I prefer to use my terminal instead of the Web browser, I've created a command-line tool to deal with it: it's called ddwrtdb and the code is available at my GitHub account!

It's also available on Python Package Index so you can install it using Python's pip by running:

pip install https://github.com/turicas/rows/archive/develop.zip
pip install ddwrtdb

That's it! Now run ddwrtdb --help to see the available commands (it's pretty intuitive). You can also check out the project' README for command examples.

This simple command-line tool (< 200 lines of Python code) was created using these awesome libraries:

click, to easily create a beautiful command-line interface;
lxml, to use XPath in order to parse HTML more easily;
requests, to make HTTP requests to dd-wrt's website;
rows, to automatically extract tables from HTML and to export data to any tabular format.