How to query Pandas DataFrames with chDB
Pandas is a popular library for data manipulation and analysis in Python.
In version 2 of chDB, we've improved the performance of querying Pandas DataFrames and introduced the Python
table function.
In this guide, we will learn how to query Pandas using the Python
table function.
Setup
Let's first create a virtual environment:
And now we'll install chDB. Make sure you have version 2.0.2 or higher:
And now we're going to install Pandas and a couple of other libraries:
We're going to use ipython
to run the commands in the rest of the guide, which you can launch by running:
You can also use the code in a Python script or in your favorite notebook.
Creating a Pandas DataFrame from a URL
We're going to query some data from the StatsBomb GitHub repository. Let's first import requests and pandas:
Then, we'll load one of the matches JSON files into a DataFrame:
Let's have a look what data we'll be working with:
Next, we'll load one of the events JSON files and also add a column called match_id
to that DataFrame:
And again, let's have a look at the first row:
Querying Pandas DataFrames
Next, let's see how to query these DataFrames using chDB. We'll import the library:
We can query Pandas DataFrames by using the Python
table function:
So, if we wanted to list the columns in matches_df
, we could write the following:
We could then find out which referees have officiated more than one match by writing the following query:
Now, let's explore events_df
.
Joining Pandas DataFrames
We can also join DataFrames together in a query. For example, to get an overview of the match, we could write the following query:
Populating a table from a DataFrame
We can also create and populate ClickHouse tables from DataFrames. If we want to create a table in chDB, we need to use the Stateful Session API.
Let's import the session module:
Initialize a session:
Next, we'll create a database:
Then, create an events
table based on events_df
:
We can then run the query that returns the top pass recipient:
Joining a Pandas DataFrame and table
Finally, we can also update our join query to join the matches_df
DataFrame with the statsbomb.events
table: