Many people in the data science field use the parquet format to store tabular data, as it's the default format used by Apache Spark -- an efficient data storage format for analytics. The problem is: the format is binary (you can't just open it with your preferred code editor) and there's no such a good Python library to read -- not until today!
I found a Python library called parquet-python on GitHub
but it's hard to use, doesn't have one code example, was not available on
PyPI and it looks like it's not maintained anymore. So I decided to
implement a parquet plugin (read-only) for my library
rows
: it uses the parquet-python
library under the hood (I needed to upload it to PyPI so you
can install it easly) and exposes the data in a pretty simple, pythonic way.
Installation
I didn't realese the rows version with this plugin yet, so you need to grab the most recent rows version by running:
pip install -U git+https://github.com/turicas/rows.git@develop
And also the dependency:
pip install parquet
If the data is compressed using Google's snappy you'll also need the library headers and other Python dependency -- install everything by running:
apt-get install libsnappy-dev
pip install python-snappy
Then you can use rows.import_from_parquet(filename)
in your programs! \o/
Python Example
A quick Python code example:
import rows
table = rows.import_from_parquet('myfile.parquet')
for row in table:
print row # access fields values with `rows.field_name`
Note that the current implementation is not optimized (for example, it'll put everything into memory) but at least you can extract desired data and then convert to a more friendly format easily.
Converting Parquet to Other Formats with rows
' CLI
You can convert Parquet files to many tabular formats (like CSV) by using the rows's command-line interface, so you don't need to code.
Install the rows CLI by running:
pip install rows[cli]
Now convert the parquet file:
rows convert myfile.parquet myfile.csv # yes, simple like this!
You can replace csv
with any other supported format (the list is always
growing!), such as: txt
, html
, xls
, xlsx
and sqlite
.
If your file is small enough you can actually see it without needing to save
the output to another file by using the print
subcommand:
rows print myfile.parquet # will extract, convert and print data as text
And you can actually query data as in SQL (this CLI is awesome!), for example:
rows query 'nation_key < 10' tests/data/nation.dict.parquet \
--output=data.csv
By running this command the CLI will:
- Import data from
tests/data/nation.dict.parquet
file into memory; - Export to SQLite (
:memory:
); - Run the query (
nation_key < 10
) and get the results; - Convert the results to a new
rows.Table
object; - Export the table to CSV format and save it into
data.csv
(the result format could behtml
,xls
,xlsx
or any other write-plugin supported by rows).
With this addition to rows I think the library and its command-line interface became one of the tools every data scientist should have installed. ;-)
Muito bom, valeu ;)
ResponderExcluirQue bom que gostou, Matheus! :-)
ExcluirFique à vontade para publicar aqui seu feedback sobre a biblioteca e o uso com arquivos parquet. =)
Try pyarrow packages, it's fastest than parquet-python or even try fastparquet. :)
ResponderExcluir