Data Analysis With PandasAI, The Generative AI Python Library

0
1070
Businessman in suit holding notebook with graphs and statistic

With PandasAI, you can turn complex queries into simple conversations, unlocking insights that reshape your approach to data analysis. Let’s see how…

Business users and data analysts can now become more productive at work than ever before, thanks to OpenAI’s ChatGPT artificial intelligence tool. This new technology’s capacity to facilitate the creation of SQL queries is a significant perk for business analysts. For years, business analysts have used SQL to query and analyse data. Learning and writing SQL is now easier than the conventional approach, thanks to the usage of AI technology, which can swiftly and automatically generate, edit, debug, and optimise SQL queries.

Python Pandas is an open source toolkit that gives data scientists and analysts the ability to manipulate and analyse data. In the preprocessing stage of machine learning and deep learning, the Pandas library is particularly popular. Now, with generative AI capabilities, the PandasAI module enables conversational data frame operation features for queries to the record set of a DataFrame.

Conversational query

PandasAI provides great strength to Pandas. A great amount of time is usually required to prepare a data set for final analysis. With the help of the PandasAI generative tool, it is possible to initiate the desired data analysis task on the raw data set with minimal preprocessing.

As PandasAI works with Pandas like a supplementary tool, one can apply generative tools on Panda’s DataFrames and get back resultant DataFrames. OpenAI’s PandasAI can engage in dialogue with a machine to get the desired results in DataFrame format without the need for writing lengthy queries or graphical Python codes.

Use of PandasAI: Installation and setup

The first task is to install PandasAI using the pip install command from the command line:

C:\mypython> pip install pandasai

Once installation is complete, the system is ready for action. Here, the Google Colab platform is used to demonstrate the role of PandasAI in data analysis. Start by calling https://colab.research.google.com/ and then import the necessary modules as shown here:

import pandas as pd
import numpy as np
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI

To upload the required input CSV file from the local directory, write down these file uploading instructions:

from google.colab import files
uploaded = files.upload()

Once this command is executed, Python will request an input file. A CSV file named CRISAT-WB_Rice.csv has been uploaded in this instance. The file contains 364 record sets detailing rice production in various districts of West Bengal. Interested readers can download the file using the link provided at the end of the article.

Exploring data with PandasAI

After uploading the input CSV file, proceed to read the CSV file from memory:

import pandas as pd
import io
df = pd.read_csv(io.StringIO(uploaded[‘ICRISAT-WB_Rice.csv’].decode(‘utf-8’)))

df.head()

The output is shown in Figure 1.

Rice production
Figure 1: Rice production

To call pandasai(), it is required to have a unique OpenAI API key. Obtain the key by creating an OpenAI account from https://openai.com/blog/chatgpt. From there it is possible to create an API key as shown here in Figure 2.

Generation of API key from ChatGPT
Figure 2: Generation of API key from ChatGPT

Once you have this key, you can call these two functions to start PandasAI:

oail = OpenAI(api_token=”Insert your API-KEY here”)
pandas_ai = PandasAI(oali, conversational=False)

For the ‘True’ value of the conversational option, PandasAI delivers a descriptive text record set, whereas it returns a tabular set of data sets for the ‘False’ value.

Year Dist_Name RICE_AREA _1000_ha RICE_PRODUCTION_
1000_tons
RICE YIELD Kg_per_ha
1966 Burdwan 436.48 597.28 1368.4
1967 Burdwan 459.82 626.65 1364.61
1968 Burdwan 473.8 687.22 1454.51
1969 Burdwan 476.1 629.26 1322.7
1970 Burdwan 463.2 617.88 1332.78
1971 Burdwan 500.8 616.16 1233.81
1972 Burdwan 496 709.42 1430.28
1973 Burdwan 526.2 664.17 1262.2
1974 Burdwan 535.4 806.35 1506.07

Working with Pandas DataFrame

The index of a record can be explored with the pandas_ai() call:

# Finding the index of a row using the value of a column
result = pandas_ai(df, “What is the index of 1982?”)
print(result)
16

To display the first 5 records of the data set you can write this textual query:

response = pandas_ai(df, “Show 5 rows of data in tabular form”)
print(response)

The output is shown in Figure 3.

 Output in tabular form
Figure 3: Output in tabular form

A more detailed conversational query can be written as follows:

response = pandas_ai(df, “Show 15 rows of data in tabular form   columns Year, DIST_NAME, RICE_PRODUCTION_1000_tons of DIST_NAME Hooghly RICE_PRODUCTION_1000_tons GREATER THAN 400”)
print(response)

The output is shown in Figure 4.

Result of a detailed conversational query
Figure 4: Result of a detailed conversational query
Basic structures of a DataFrame
Figure 5: Basic structures of a DataFrame

Similarly, we can also execute the following query over the DataFrame.

response = pandas_ai(df, “Show 15 rows of data in tabular form   columns Year, DIST_NAME and RICE_PRODUCTION_1000_tons DIST_NAME Bankura”)
print(response)

Data structure

It is possible to get all the basic structures of a DataFrame with simple textual instructions.

response = pandas_ai(df, “Show structure of data “)
print(response)

The shape of the DataFrame can also be explored with this instruction.

response = pandas_ai(df, “what is the shape of data?”)
print(response)
364 5
Distribution of rice production over the years
Figure 6: Distribution of rice production over the years

Data analysis

Overall basic statistics can easily be calculated with the following simple textual instruction:

response = pandas_ai(df, “what is the distribution  of RICE_PRODUCTION_1000_tans?”)
print(response)

count 364.000000
mean 641.068819
std 456.838868
min 51.490000
25% 289.392500
50% 493.310000
75% 848.247500
max 2145.290000
Name: RICE_PRODUCTION_1000_tons, dtype: float64

Skewness

To know the skewness of the distribution, you can use this simple query:

response = pandas_ai(df, “Show the skewness  of RICE_PRODUCTION_1000_tons”)
print(response)
1.2902333826377734
Distribution of district-wise rice production
Figure 7: Distribution of district-wise rice production

Graphics

Graphical representation is essential for effective data analysis. Distributions of different parameters can easily be plotted with the pandasai module with the help of ChatGPT. For example, the distribution of rice production can be depicted as follows:

response = pandas_ai(df, “Show the distribution graph  of RICE_PRODUCTION_1000_tons with Year”)
print(response)

Similarly, the distribution of rice production in a district can be plotted with an instruction like:

response = pandas_ai(df, “Show the distribution graph  of RICE_PRODUCTION_1000_tons with Year DIST_NAME Hooghly”)
print(response)

A bar graph of the district-wise distribution of rice production can easily be drawn with an instruction like this:

response = pandas_ai(df, “Show the distribution graph  of average RICE_PRODUCTION_1000_tons with Dist_NAME “)
print(response)

District-wise average rice production can be plotted like this:

df_sorted_rp=df.sort_values(by=’RICE_PRODUCTION_1000_tons’)

response = pandas_ai(df_sorted_rp, “line plot average RICE_PRODUCTION_1000_tons with DIST_NAME”)
print(response)
Line plot of average rice production by district
Figure 8: Line plot of average rice production by district

PandasAI, a supplementary tool based on ChatGPT, enhances data analysis in the Pandas environment. Its unstructured query language support makes data set analysis easier. Due to this textual informal approach, it is a paradigm change in data exploration and subsequent analysis. Here, we have focused on a few examples of the application of PandasAI on a data set. Readers can explore the several possibilities of this textual query-based data analysis. Programs and the data set are available at https://drive.google.com/drive/folders/1gVvRcwDNDP4j_RbYdBm7Z0FKcUp95wNQ?usp=sharing. Interested readers can download the data and explore it further.

Previous articleElon Musk Envisions X as an Open Source News Platform
Next articleUnleashing the Power of ReactPy
The author is a member of IEEE, IET, with more than 20 years of experience in open source versions of UNIX operating systems and Sun Solaris. He is presently working on data analysis and machine learning using a neural network and different statistical tools. He has also jointly authored a textbook called ‘MATLAB for Engineering and Science’. He can be reached at dipankarray@ieee.org.

LEAVE A REPLY

Please enter your comment!
Please enter your name here