Many people in the data science field use the parquet format to store tabular data, as it's the default format used by Apache Spark -- an efficient data storage format for analytics. The problem is: the format is binary (you can't just open it with your preferred code editor) and there's no such a good Python library to read -- not until today!
I found a Python library called parquet-python on GitHub
but it's hard to use, doesn't have one code example, was not available on
PyPI and it looks like it's not maintained anymore. So I decided to
implement a parquet plugin (read-only) for my library
rows: it uses the parquet-python
library under the hood (I needed to upload it to PyPI so you
can install it easly) and exposes the data in a pretty simple, pythonic way.
I didn't realese the rows version with this plugin yet, so you need to grab the most recent rows version by running:
pip install -U git+https://github.com/turicas/rows.git@develop
And also the dependency:
pip install parquet
If the data is compressed using Google's snappy you'll also need the library headers and other Python dependency -- install everything by running:
apt-get install libsnappy-dev pip install python-snappy
Then you can use
rows.import_from_parquet(filename) in your programs! \o/
A quick Python code example:
import rows table = rows.import_from_parquet('myfile.parquet') for row in table: print row # access fields values with `rows.field_name`
Note that the current implementation is not optimized (for example, it'll put everything into memory) but at least you can extract desired data and then convert to a more friendly format easily.
Converting Parquet to Other Formats with
You can convert Parquet files to many tabular formats (like CSV) by using the rows's command-line interface, so you don't need to code.
Install the rows CLI by running:
pip install rows[cli]
Now convert the parquet file:
rows convert myfile.parquet myfile.csv # yes, simple like this!
You can replace
csv with any other supported format (the list is always
growing!), such as:
If your file is small enough you can actually see it without needing to save
the output to another file by using the
rows print myfile.parquet # will extract, convert and print data as text
And you can actually query data as in SQL (this CLI is awesome!), for example:
rows query 'nation_key < 10' tests/data/nation.dict.parquet \ --output=data.csv
By running this command the CLI will:
- Import data from
tests/data/nation.dict.parquetfile into memory;
- Export to SQLite (
- Run the query (
nation_key < 10) and get the results;
- Convert the results to a new
- Export the table to CSV format and save it into
data.csv(the result format could be
xlsxor any other write-plugin supported by rows).
With this addition to rows I think the library and its command-line interface became one of the tools every data scientist should have installed. ;-)