Data Analysis With PandasAI, The Generative AI Python Library

October 27, 2023

1129

With PandasAI, you can turn complex queries into simple conversations, unlocking insights that reshape your approach to data analysis. Let’s see how…

Business users and data analysts can now become more productive at work than ever before, thanks to OpenAI’s ChatGPT artificial intelligence tool. This new technology’s capacity to facilitate the creation of SQL queries is a significant perk for business analysts. For years, business analysts have used SQL to query and analyse data. Learning and writing SQL is now easier than the conventional approach, thanks to the usage of AI technology, which can swiftly and automatically generate, edit, debug, and optimise SQL queries.

Python Pandas is an open source toolkit that gives data scientists and analysts the ability to manipulate and analyse data. In the preprocessing stage of machine learning and deep learning, the Pandas library is particularly popular. Now, with generative AI capabilities, the PandasAI module enables conversational data frame operation features for queries to the record set of a DataFrame.

Conversational query

PandasAI provides great strength to Pandas. A great amount of time is usually required to prepare a data set for final analysis. With the help of the PandasAI generative tool, it is possible to initiate the desired data analysis task on the raw data set with minimal preprocessing.

As PandasAI works with Pandas like a supplementary tool, one can apply generative tools on Panda’s DataFrames and get back resultant DataFrames. OpenAI’s PandasAI can engage in dialogue with a machine to get the desired results in DataFrame format without the need for writing lengthy queries or graphical Python codes.

Use of PandasAI: Installation and setup

The first task is to install PandasAI using the pip install command from the command line:

C:\mypython> pip install pandasai

Once installation is complete, the system is ready for action. Here, the Google Colab platform is used to demonstrate the role of PandasAI in data analysis. Start by calling https://colab.research.google.com/ and then import the necessary modules as shown here:

import pandas as pd
import numpy as np
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI

To upload the required input CSV file from the local directory, write down these file uploading instructions:

from google.colab import files
uploaded = files.upload()

Once this command is executed, Python will request an input file. A CSV file named CRISAT-WB_Rice.csv has been uploaded in this instance. The file contains 364 record sets detailing rice production in various districts of West Bengal. Interested readers can download the file using the link provided at the end of the article.

Exploring data with PandasAI

After uploading the input CSV file, proceed to read the CSV file from memory:

import pandas as pd
import io
df = pd.read_csv(io.StringIO(uploaded[‘ICRISAT-WB_Rice.csv’].decode(‘utf-8’)))

df.head()

The output is shown in Figure 1.

To call pandasai(), it is required to have a unique OpenAI API key. Obtain the key by creating an OpenAI account from https://openai.com/blog/chatgpt. From there it is possible to create an API key as shown here in Figure 2.

Figure 2: Generation of API key from ChatGPT

Once you have this key, you can call these two functions to start PandasAI:

oail = OpenAI(api_token=”Insert your API-KEY here”)
pandas_ai = PandasAI(oali, conversational=False)

For the ‘True’ value of the conversational option, PandasAI delivers a descriptive text record set, whereas it returns a tabular set of data sets for the ‘False’ value.

Year	Dist_Name	RICE_AREA _1000_ha	RICE_PRODUCTION_ 1000_tons	RICE YIELD Kg_per_ha
1966	Burdwan	436.48	597.28	1368.4
1967	Burdwan	459.82	626.65	1364.61
1968	Burdwan	473.8	687.22	1454.51
1969	Burdwan	476.1	629.26	1322.7
1970	Burdwan	463.2	617.88	1332.78
1971	Burdwan	500.8	616.16	1233.81
1972	Burdwan	496	709.42	1430.28
1973	Burdwan	526.2	664.17	1262.2
1974	Burdwan	535.4	806.35	1506.07

Working with Pandas DataFrame

The index of a record can be explored with the pandas_ai() call:

# Finding the index of a row using the value of a column
result = pandas_ai(df, “What is the index of 1982?”)
print(result)
16

To display the first 5 records of the data set you can write this textual query:

response = pandas_ai(df, “Show 5 rows of data in tabular form”)
print(response)

The output is shown in Figure 3.

A more detailed conversational query can be written as follows:

response = pandas_ai(df, “Show 15 rows of data in tabular form   columns Year, DIST_NAME, RICE_PRODUCTION_1000_tons of DIST_NAME Hooghly RICE_PRODUCTION_1000_tons GREATER THAN 400”)
print(response)

The output is shown in Figure 4.

Figure 4: Result of a detailed conversational query

Figure 5: Basic structures of a DataFrame

Similarly, we can also execute the following query over the DataFrame.

response = pandas_ai(df, “Show 15 rows of data in tabular form   columns Year, DIST_NAME and RICE_PRODUCTION_1000_tons DIST_NAME Bankura”)
print(response)

Data structure

It is possible to get all the basic structures of a DataFrame with simple textual instructions.

response = pandas_ai(df, “Show structure of data “)
print(response)

The shape of the DataFrame can also be explored with this instruction.

response = pandas_ai(df, “what is the shape of data?”)
print(response)
364 5

Figure 6: Distribution of rice production over the years

Data analysis

Overall basic statistics can easily be calculated with the following simple textual instruction:

response = pandas_ai(df, “what is the distribution  of RICE_PRODUCTION_1000_tans?”)
print(response)

count 364.000000
mean 641.068819
std 456.838868
min 51.490000
25% 289.392500
50% 493.310000
75% 848.247500
max 2145.290000
Name: RICE_PRODUCTION_1000_tons, dtype: float64

Skewness

To know the skewness of the distribution, you can use this simple query:

response = pandas_ai(df, “Show the skewness  of RICE_PRODUCTION_1000_tons”)
print(response)
1.2902333826377734

Figure 7: Distribution of district-wise rice production

Graphics

Graphical representation is essential for effective data analysis. Distributions of different parameters can easily be plotted with the pandasai module with the help of ChatGPT. For example, the distribution of rice production can be depicted as follows:

response = pandas_ai(df, “Show the distribution graph  of RICE_PRODUCTION_1000_tons with Year”)
print(response)

Similarly, the distribution of rice production in a district can be plotted with an instruction like:

response = pandas_ai(df, “Show the distribution graph  of RICE_PRODUCTION_1000_tons with Year DIST_NAME Hooghly”)
print(response)

A bar graph of the district-wise distribution of rice production can easily be drawn with an instruction like this:

response = pandas_ai(df, “Show the distribution graph  of average RICE_PRODUCTION_1000_tons with Dist_NAME “)
print(response)

District-wise average rice production can be plotted like this:

df_sorted_rp=df.sort_values(by=’RICE_PRODUCTION_1000_tons’)

response = pandas_ai(df_sorted_rp, “line plot average RICE_PRODUCTION_1000_tons with DIST_NAME”)
print(response)

Figure 8: Line plot of average rice production by district

PandasAI, a supplementary tool based on ChatGPT, enhances data analysis in the Pandas environment. Its unstructured query language support makes data set analysis easier. Due to this textual informal approach, it is a paradigm change in data exploration and subsequent analysis. Here, we have focused on a few examples of the application of PandasAI on a data set. Readers can explore the several possibilities of this textual query-based data analysis. Programs and the data set are available at https://drive.google.com/drive/folders/1gVvRcwDNDP4j_RbYdBm7Z0FKcUp95wNQ?usp=sharing. Interested readers can download the data and explore it further.

Conversational query

Use of PandasAI: Installation and setup

Exploring data with PandasAI

Working with Pandas DataFrame

Data structure

Data analysis

Skewness

Graphics

LEAVE A REPLY Cancel reply

Thought Leaders

HOW TOs

MOST POPULAR

Open Journey

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY