How to query Apache Arrow with chDB
Apache Arrow is a standardized column-oriented memory format that's gained popularity in the data community.
In this guide, we will learn how to query Apache Arrow using the Python
table function.
Setup
Let's first create a virtual environment:
And now we'll install chDB. Make sure you have version 2.0.2 or higher:
And now we're going to install PyArrow, pandas, and ipython:
We're going to use ipython
to run the commands in the rest of the guide, which you can launch by running:
You can also use the code in a Python script or in your favorite notebook.
Creating an Apache Arrow table from a file
Let's first download one of the Parquet files for the Ookla dataset, using the AWS CLI tool:
If you want to download more files, use aws s3 ls
to get a list of all the files and then update the above command.
Next, we'll import the Parquet module from the pyarrow
package:
And then we can read the Parquet file into an Apache Arrow table:
The schema is shown below:
And we can get the row and column count by calling the shape
attribute:
Querying Apache Arrow
Now let's query the Arrow table from chDB. First, let's import chDB:
And then we can describe the table:
We can also count the number of rows:
Now, let's do something a bit more interesting.
The following query excludes the quadkey
and tile.*
columns and then computes the average and max values for all remaining column: