Python Pandas Library
Pandas Library is one of the most commonly used libraries in Python. In fact, it is a very useful Python library that can work wonders together with the NumPy library.

The Pandas Library is similar to Numpy, but it is more famous for series and big data analysis. Let’s start our article by sharing the source documentation addresses about Python Pandas Library.

Installation, Documentation:Installation, Documentation:https://pandas.pydata.org/
GitHub : https://github.com/pandas-dev/pandas
GitHub Pandas Education: https://github.com/omersahintr/BootCampEdu/tree/main/Pandas
The Pandas library is usually used together with the Numpy library. Therefore, we call both libraries when importing.
Index
Python Pandas Startup Commands
Series in Pandas
How to create a Pandas Series from a dictionary variable, a numpy array or a list variable?
Let’s take a look at how to do this with two different methods. Protocol names and corresponding port numbers will be defined as Pandas series. The screen output of both methods is given with comment lines.
import pandas as pd
import numpy as np
##Python Pandas Series ##
#Method-1
dictionary = {"http" : 80, "https" : 443, "rdc" : 3389, "smtp" : 25 }
print(pd.Series(dictionary)) # http 80
# https 443
# rdc 3389
# smtp 25
# dtype: int64
#Method-2
ports = [80, 443, 3389, 25]
protocols = ["http", "https", "rdc", "smtp"]
print(pd.Series(index=protocols, data=ports)) # http 80
# https 443
# rdc 3389
# smtp 25
# dtype: object
PythonCreating a Pandas Series from a NumPy Array
How to create a series in Pandas from an existing NumPy array?
Since NumPy arrays are shown without an index, Pandas will automatically assign an index starting at 0 when it converts to Series form. The screen output is again written as a comment line.
import pandas as pd
import numpy as np
##NumPy Array to Python Pandas Series
n_array = np.arange(20,70,10)
print(pd.Series(n_array)) # 0 20
# 1 30
# 2 40
# 3 50
# 4 60
# dtype: int64
PythonCombining a List and a Numpy Sequence to Create a Pandas Series
Let’s create a Pandas Series using two different data types. The screen output is given as a comment line in the code block. The important point to note here is that the number of elements in the variables to be merged must be equal.
import pandas as pd
import numpy as np
##NumpyArray and Python List merge to Python Pandas Series:
nArray = np.array([10,20,30,40]) # a NumPy Array
pList = ["Mercedes","BMW", "Tesla","Audi"] # a List
pSeries = pd.Series(index=pList,data=nArray) # Merge to Pandas Series
print(pSeries) # Mercedes 10
# BMW 20
# Tesla 30
# Audi 40
print(pSeries["Tesla"]) # 30
PythonMathematical Operations in Pandas Series
The mathematical operations in Pandas Series are similar to NumPy arrays. One of the differences is that Pandas has indexes.
Let’s create our codes over a scenario.
Scenario: 3 different brands of cars run 2 different races according to their speed indexes. Let’s write python codes that calculate the car with the highest speed index according to the sum of these two races:
import pandas as pd
import numpy as np
## Series Math Operation in Python Pandas:
race1 = pd.Series(data=[220,250,280], index=["Audi","Mercedes", "Tesla"])
race2 = pd.Series(data=[260,230,300], index=["Audi","Mercedes", "Tesla"])
result = race1 + race2
print(result)
print(f"The Winner is {result[result == pd.Series.max(result)]}")
#Results:
Audi 480
Mercedes 480
Tesla 580
dtype: int64
The Winner is Tesla 580
PythonNote: Car brands and speed indexes are randomly selected. They do not reflect reality.
Mathematical Operations on Pandas Series Consisting of Data with Different Indexes
First Data Series contains data for the cities of London, Tokyo, NewYork.
Second Data Series contains data for the cities of London, Tokyo, Istanbul.
For the cities of London and Tokyo that are present in both series, addition is performed in the series, but since the cities of New York and Istanbul are not present in both series at the same time, the result“NaN” is returned. You can see the results in the description line.
import pandas as pd
import numpy as np
## Pandas Series Operations with Different indexes
serie1 = pd.Series(data=[100,200,300], index=["London","Tokyo","New York"])
serie2 = pd.Series(data=[140,250,410], index=["London","Tokyo","Istanbul"])
operation = serie1 + serie2
print(operation) # Istanbul NaN
# London 240.0
# New York NaN
# Tokyo 450.0
# dtype: float64 Python Pandas
PythonPandas DataFrame
How to work with DataFrame data types, which allow more extensive operations than Series?
- data= np.array([[1,2,3],[0,3,2],[8,9,7],[3,5,8]])
- data_frame = pd.DataFrame(data)
a NumPy array is converted to a Pandas Data Frame.
import pandas as pd
import numpy as np
## Pandas DataFrame:
data = np.array([[1,2,3],[0,3,2],[8,9,7],[3,5,8]])
data_frame = pd.DataFrame(data)
print(data_frame[1][2]) # : 9
print(data_frame) # 0 1 2
# 0 1 2 3
# 1 0 3 2
# 2 8 9 7
# 3 3 5 8
PythonThe numbers 0-1-2 column-index and 0-1-2-3 row-index on the edge are quite confusing. Pandas DataFrame object has a solution for this. If you want, you can name rows and columns by assigning them from an array (or even List) variable. You can see the example below.
In our example we prepared with Pandas Library, the number of entries Neo, Morpheus, Agent Smith and Trinity made to the Matrix according to the months are kept in a DataFrame object.
import pandas as pd
import numpy as np
## Pandas DataFrame row-name and column-name define:
data = np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]
data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined
print(data_frame)
# January February March April
#Neo 1 2 3 2
#Morpheus 1 0 3 2
#Smith 3 8 9 7
#Trinity 3 5 8 9
PythonHow do we pull any information we want from this new DataFrame. This is also very simple.
- data_frame[“March”] we list the number of entries of all contacts in March in Pandas Series.
- (data_frame[“March”]).Neo and Neo’s number of logins in March. Or;
- data_frame[[“March”][“Neo”]]indicates the number of entries of Neo in March.
import pandas as pd
import numpy as np
## Pandas DataFrame row-name and column-name define:
data = np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]
data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined
print(f"Neo on March {(data_frame["March"]).Neo}") # : Neo on March 3
print(data_frame["March"]["Neo"]) # : 3 --same to up
print(data_frame["April"])
# Neo 2
# Morpheus 2
# Smith 7
# Trinity 9
print((data_frame["April"].mean())) # 5.0
PythonData Extraction with loc() and iloc() Method
- data_frame.loc[“Neo”] and the number of entries of Neo in all months.
- data_frame.loc[“Neo”].mean() is used to calculate Neo’s average entry into the matrix at all times.
- data_frame.iloc[0] and again Neo’s (because its index is 0) input numbers.
import pandas as pd
import numpy as np
## Pandas DataFrame row-name and column-name define:
data = np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]
data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined
print(data_frame)
# January February March April
#Neo 1 2 3 2
#Morpheus 1 0 3 2
#Smith 3 8 9 7
#Trinity 3 5 8 9
print(f"Neo on March {(data_frame["March"]).Neo}") # : Neo on March 3
print(data_frame["March"]["Neo"]) # : 3 --same to up
print(data_frame["April"])
# Neo 2
# Morpheus 2
# Smith 7
# Trinity 9
print((data_frame["April"].mean())) # 5.0
print((data_frame.loc["Neo"])) #: Neo's values by months
# January 1
# February 2
# March 3
# April 2
print((data_frame.loc["Neo"]).mean()) # : 2.0
print((data_frame.iloc[0]).mean()) # : 2.0
PythonAdding a New Data Set (May)
There is a very easy method for this. Let’s add May, which is 2 times the values of April. Pandas Library provides a very flexible working environment in data sets.
- data_frame[“May”] = data_frame[“April”] * 2
import pandas as pd
import numpy as np
## Pandas DataFrame row-name and column-name define:
data = np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]
data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined
data_frame["May"] = data_frame["April"] * 2 # : create new month and April x 2
print(data_frame)
# January February March April May
# Neo 1 2 3 2 4
# Morpheus 1 0 3 2 4
# Smith 3 8 9 7 14
# Trinity 3 5 8 9 18
PythonExtracting an Existing Row or Column and Printing the Remaining Part
For example, we will use the drop() method to drop Neo’s values or February values and print the remaining data on the screen.
- data_frame.drop(labels=”Neo”, axis=0) row delete
- data_frame.drop(labels=”February”, axis=1) column delete.
After the drop() method, the original data of the DataFrame is not changed. Only the modified version is displayed on the screen. If you call the DataFrame object again, the old version will be displayed.
import pandas as pd
import numpy as np
## Pandas DataFrame row-name and column-name define:
data = np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]
data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined
# drop row operations:
print(data_frame.drop(labels="Neo", axis=0))
# January February March April May
# Morpheus 1 0 3 2 4
# Smith 3 8 9 7 14
# Trinity 3 5 8 9 18
# drop column operations:
print(data_frame.drop(labels="February",axis=1))
# January March April May
# Neo 1 3 2 4
# Morpheus 1 3 2 4
# Smith 3 9 7 14
# Trinity 3 8 9 18
PythonIf you want to drop a Python Pandas DataFrame object and want the original data set to change accordingly, you must set the inplace= parameter to True . Since this parameter is false by default, if you do not specify it, your original data set will not change. Let’s see in the example;
import pandas as pd
import numpy as np
## Pandas DataFrame row-name and column-name define:
data = np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]
data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined
# drop column operations changes original data
data_frame.drop(labels="April",axis=1, inplace=True)
print(data_frame)
# January February March May
# Neo 1 2 3 4
# Morpheus 1 0 3 4
# Smith 3 8 9 14
# Trinity 3 5 8 18
PythonGreater Than, Less Than, Equality and Equivalence Comparisons (>,<, ==, !=,)
How to use mathematical greater than, less than and equivalence expressions in a Pandas DataFrame object.
import pandas as pd
import numpy as np
## Pandas DataFrame row-name and column-name define:
data = np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]
data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined
# match (<> == !=) operations:
print(">2 elements:")
print(data_frame[data_frame>2])
# >2 elements:
# January February March May
# Neo NaN NaN 3 4
# Morpheus NaN NaN 3 4
# Smith 3.0 8.0 9 14
# Trinity 3.0 5.0 8 18
print("<4 elements:")
print(data_frame[data_frame<4])
# <4 elements:
# January February March May
# Neo 1 2.0 3.0 NaN
# Morpheus 1 0.0 3.0 NaN
# Smith 3 NaN NaN NaN
# Trinity 3 NaN NaN NaN
print("=3 elements")
print(data_frame[data_frame==3])
# =3 elements
# January February March May
# Neo NaN NaN 3.0 NaN
# Morpheus NaN NaN 3.0 NaN
# Smith 3.0 NaN NaN NaN
# Trinity 3.0 NaN NaN NaN
print("!=3 elements")
print(data_frame[data_frame!=3])
# !=3 elements
# January February March May
# Neo 1.0 2 NaN 4
# Morpheus 1.0 0 NaN 4
# Smith NaN 8 9.0 14
# Trinity NaN 5 8.0 18
PythonHere, NaN will replace the elements that do not meet the condition.
If you want, you can use condition expressions only within 1 month or only within 1 person’s data. Let’s see in the example:
import pandas as pd
import numpy as np
## Pandas DataFrame row-name and column-name define:
data = np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]
data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined
# just on rows or columns matching
print(">4 just in May:")
print(data_frame[data_frame["May"]>4])
# >4 just in May:
# January February March May
# Smith 3 8 9 14
# Trinity 3 5 8 18
PythonIndex Reset Operation
- data_frame.reset_index()
command to change the indexes in the Data Frame to 0-1-2…..
- data_frame.set_index(“col_name”)
You can specify the column you want as index. This way you will not change the indexes. It will only be valid in the demonstration. To change permanently
- data_frame.set_index(“col_name”, axis=1, inplace=True)
You should write it as.
Let’s see in the example:
import pandas as pd
import numpy as np
## Pandas DataFrame row-name and column-name define:
data = np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]
data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined
# set_index() and reset_index() methods:
print("reset_index():")
print(data_frame.reset_index())
# reset_index():
# index January February March May
# 0 Neo 1 2 3 4
# 1 Morpheus 1 0 3 4
# 2 Smith 3 8 9 14
# 3 Trinity 3 5 8 18
print("index --> May columns")
print(data_frame.set_index("May"))
# index --> May columns
# January February March
# May
# 4 1 2 3
# 4 1 0 3
# 14 3 8 9
# 18 3 5 8
PythonAdvanced Pandas Operations
The data set prepared with the Python Pandas Library may have missing data. This will distract you from the actual result in your calculations when working with large data sets.
- data_frame.dropna() – Remove Rows containing na
- data_frame.dropna(axis=1) – Remove Columns containing na.
import numpy as np
import pandas as pd
dict = {"Neo":[30,50,np.nan], "Morpheus":[10,np.nan,40], "Trinity":[20,30,40]}
data_frame = pd.DataFrame(dict)
print(data_frame)
# Neo Morpheus Trinity
# 0 30.0 10.0 20
# 1 50.0 NaN 30
# 2 NaN 40.0 40
# dropna() method:
print(data_frame.dropna())
# Neo Morpheus Trinity
# 0 30.0 10.0 20
print(data_frame.dropna(axis=1))
# Trinity
# 0 20
# 1 30
# 2 40
PythonAs seen in the example, rows or columns containing “na” and “NaN” are completely removed.
Using drop() in Data Sets with More NaNs
- new_data_frame.dropna(axis=1, thresh=2)
columns with more than 2 NaNs are dropped. The threshold value here determines the number of valid data in a set. Given 2, it determines that there are 2 normal data in the data of Neo, Morpheus, Trinity and Smith.
import numpy as np
import pandas as pd
# with more NoN counts data
new_dict = {"Neo":[30,50,np.nan], "Morpheus":[10,np.nan,40], "Trinity":[20,30,40], "Smith":[15,np.nan,np.nan]}
new_data_frame = pd.DataFrame(new_dict)
print(new_data_frame.dropna(axis=1, thresh=2)) # 2 and more NaN's has dropped. Smith's data removed.
# Neo Morpheus Trinity
# 0 30.0 10.0 20
# 1 50.0 NaN 30
# 2 NaN 40.0 40
print(new_data_frame.dropna(axis=1, thresh=1))
# Neo Morpheus Trinity Smith
# 0 30.0 10.0 20 15.0
# 1 50.0 NaN 30 NaN
# 2 NaN 40.0 40 NaN
PythonFill Missing Data (NaN) with Other Data
NaN değerlerin yerine farklı değerler ile doldur.
- new_data_frame.fillna(20) -substitute 20 for the values.
import numpy as np
import pandas as pd
new_dict = {"Neo":[30,50,np.nan], "Morpheus":[10,np.nan,40], "Trinity":[20,30,40], "Smith":[15,np.nan,np.nan]}
new_data_frame = pd.DataFrame(new_dict)
# fillna() methods:
print(new_data_frame.fillna(20))
# Neo Morpheus Trinity Smith
# 0 30.0 10.0 20 15.0
# 1 50.0 20.0 30 20.0
# 2 20.0 40.0 40 20.0
PythonPandas Data Grouping
It is a frequently used method in the Pandas Library. It is used to group data of the same type. You can then perform Pandas operations within this data (max, min, count etc.).
- sales_data_frames.groupby(“Column Name”) returns the memory address. You can run almost all Pandas operations with a period at the end.
- sales_data_frames.groupby(“Column Name“).count()
- sales_data_frames.groupby(“Column Name“).mean(“numerc Column Name“)
- sales_data_frames.groupby(“Column Name“).mean(numeric_only=True) averages only the numeric columns.
- sales_data_frames.groupby(“Column Name“).describe() with summary statistics.
- sales_data_frames.groupby(“Column Name“).describe().transpose() you can change the position of rows and columns to present a more understandable view.
You can examine the sample code and screen outputs below.
import numpy as np
import pandas as pd
# Group Data
sales_data = {"Code":["C++","PHP","Python","PHP","C++","PHP"],"Names":["Neo","Morpheus","Trinity","Smith","Adam","Joe"], "Money":[250,120,200,350,300,100]}
sales_data_frame = pd.DataFrame(sales_data)
print(sales_data_frame)
# Code Names Money
# 0 C++ Neo 250
# 1 PHP Morpheus 120
# 2 Python Trinity 200
# 3 PHP Smith 350
# 4 C++ Adam 300
# 5 PHP Joe 100
groups = sales_data_frame.groupby("Code")
print(groups.count())
# Names Money
# Code
# C++ 2 2
# PHP 3 3
# Python 1 1
print(groups.mean("Money"))
# Money
# Code
# C++ 275.0
# PHP 190.0
# Python 200.0
print(groups.min("Money"))
# Money
# Code
# C++ 250
# PHP 100
# Python 200
print(groups.max(numeric_only=True))
# Money
# Code
# C++ 300
# PHP 350
# Python 200
print(groups.describe())
# Money
# count mean std min 25% 50% 75% max
# Code
# C++ 2.0 275.0 35.355339 250.0 262.5 275.0 287.5 300.0
# PHP 3.0 190.0 138.924440 100.0 110.0 120.0 235.0 350.0
# Python 1.0 200.0 NaN 200.0 200.0 200.0 200.0 200.0
print(groups.describe().transpose())
# Code C++ PHP Python
# Money count 2.000000 3.00000 1.0
# mean 275.000000 190.00000 200.0
# std 35.355339 138.92444 NaN
# min 250.000000 100.00000 200.0
# 25% 262.500000 110.00000 200.0
# 50% 275.000000 120.00000 200.0
# 75% 287.500000 235.00000 200.0
# max 300.000000 350.00000 200.0
PythonData Frame Concatenation – concat()
The concat() method is used to combine multiple Pandas Data Frame objects into a single Data Frame.
- concat([new_data_frame1,new_data_frame2,new_data_frame3],ignore_index=True), 3 data frames are concatenated and the index numbers are redefined with the last parameter ignore_index=True.
import pandas as pd
import numpy as np
new_dict1 = {"Name":["John","Alice","Bob","Doe","Smith"],
"Sports":["Football","Basketball","Cricket","Tennis","Baseball"],
"Calori":[230,240,220,250,260]
}
new_dict2 = {"Name":["Jack","Adam","Brain","Doh","Sarah"],
"Sports":["Football","Basketball","Cricket","Tennis","Baseball"],
"Calori":[130,140,120,150,160]
}
new_dict3 = {"Name":["Homer","Jason","Track","Dora","Fiona"],
"Sports":["Football","Basketball","Cricket","Tennis","Baseball"],
"Calori":[330,340,320,350,360]
}
new_data_frame1 = pd.DataFrame(new_dict1)
new_data_frame2 = pd.DataFrame(new_dict2)
new_data_frame3 = pd.DataFrame(new_dict3)
#Concatenation of DataFrames
new_data_frame = pd.concat([new_data_frame1,new_data_frame2,new_data_frame3],ignore_index=True)
"""##### Name Sports Calori
0 John Football 230
1 Alice Basketball 240
2 Bob Cricket 220
3 Doe Tennis 250
4 Smith Baseball 260
5 Jack Football 130
6 Adam Basketball 140
7 Brain Cricket 120
8 Doh Tennis 150
9 Sarah Baseball 160
10 Homer Football 330
11 Jason Basketball 340
12 Track Cricket 320
13 Dora Tennis 350
14 Fiona Baseball 360
#####"""
print(new_data_frame)
PythonMerge with Merge Method
It provides a more useful representation by merging based on common data. It will be more understandable on the example:
The Sports column will be our intersection set. This means that more than one person plays the same sport. To display this in a responsive way, we set the parameter as on=”Sports”.
import pandas as pd
import numpy as np
new_dict1 = {"Name":["John","Alice","Bob","Doe","Smith"],
"Sports":["Football","Basketball","Cricket","Tennis","Baseball"],
"Calori":[230,240,220,250,260]
}
new_dict2 = {"Name":["Jack","Adam","Brain","Doh","Sarah"],
"Sports":["Football","Basketball","Cricket","Tennis","Baseball"],
"Calori":[130,140,120,150,160]
}
new_data_frame1 = pd.DataFrame(new_dict1)
new_data_frame2 = pd.DataFrame(new_dict2)
#Merging of DataFrames
merge_data_frame = pd.merge(new_data_frame1,new_data_frame2,on="Sports")
print(merge_data_frame)
"""#### Name_x Sports Calori_x Name_y Calori_y
0 John Football 230 Jack 130
1 Alice Basketball 240 Adam 140
2 Bob Cricket 220 Brain 120
3 Doe Tennis 250 Doh 150
4 Smith Baseball 260 Sarah 160
#####"""
PythonPulling Unique Data with Unique() and Nunique() Methods
For example, there is a personal list of students in a university. The columns in the list contain first name, last name and department name. Here, the unique method in the Pandas Library is used to pull the list of all departments from tens of thousands of data. nunique gives the total number of different data.
import pandas as pd
import numpy as np
#unique values in the DataFrame
new_sales_dict = {"Name":["John","Alice","Bob","Doe","Smith"],
"Sales":[230,240,220,250,260],
"Profit":[30,40,20,50,60]}
new_sales_frame = pd.DataFrame(new_sales_dict)
print(new_sales_frame)
"""#### Name Sales Profit
0 John 230 30
1 Alice 240 40
2 Bob 220 20
3 Doe 250 50
4 Smith 260 60
#####"""
print(new_sales_frame["Name"].unique())
# ['John' 'Alice' 'Bob' 'Doe' 'Smith']
print(new_sales_frame["Name"].nunique())
# 5
PythonApplying a Special Function to Data Frame Data
We will use the apply method to apply a function in Python Pandas or a function you define to the Data Frame data. Let’s define a VAT calculation function and apply it to the Data Frame elements.
import pandas as pd
import numpy as np
#unique values in the DataFrame
new_sales_dict = {"Name":["John","Alice","Bob","Doe","Smith"],
"Sales":[230,240,220,250,260],
"Profit":[30,40,20,50,60]}
new_sales_frame = pd.DataFrame(new_sales_dict)
#VAT Calculation in DataFrame
def vat_calc(sales):
return sales*0.18
print(new_sales_frame["Sales"].apply(vat_calc))
"""#### 0 41.4
1 43.2
2 39.6
3 45.0
4 46.8
Name: Sales, dtype: float64
#####"""
PythonExcel File Operations with Pandas
With the Python Pandas Library, we can process Excel data by reading and writing Excel tables. It has an extremely simple code structure.
- data_frame = pd.read_excel(“File.xlsx”)
All other operations will proceed exactly the same as with the Data Frame object. Don’t forget to import the openpyxl library to import the Excel table into the Data Frame object in a healthy way.
import pandas as pd
import numpy as np
import openpyxl
data_frame = pd.read_excel("sales.xlsx")
print(data_frame)
# 1) How many rows of data are there in total?
count_rows = data_frame["Employee Name"].count()
print(f"Total Rows Count: {count_rows}")
# Total Rows Count: 100
# 2) What is the average salary this company pays?
averaje_salary = data_frame["Salary"].mean()
print(f"Average Salary: {averaje_salary:.2f}")
# Average Salary: 725.84
# 3) How does the average salary compare across departments at this company?
averaje_salary_of_department = data_frame.groupby("Department")["Salary"].mean()
print(f"Average Salary of Department: \n{averaje_salary_of_department.round(2)}")
""" Average Salary of Department:
Department
Finance 805.41
HR 640.94
Marketing 707.00
Sales 656.67
Software Development 826.41
Name: Salary, dtype: float64
"""
# 4) How does the average salary compare by title (senior - junior) at this company?
averaje_salary_of_title = data_frame.groupby("Title")["Salary"].mean()
print(f"Average Salary of Title: \n{averaje_salary_of_title.round(2)}")
""" Average Salary of Title:
Title
C-level 1058.33
Junior 440.62
Mid 641.45
Mid-Senior 725.50
Senior 867.78
Name: Salary, dtype: float64
"""
#5) On average, how many percent more salary does a senior person earn than a junior person?
match_avg = data_frame.groupby("Title")["Salary"].mean()
senior_avg = match_avg["Senior"].round(2)
junior_avg = match_avg["Junior"].round(2)
percent_avg = (((senior_avg - junior_avg)/senior_avg)*100).round(2)
print(f"mean differences= {senior_avg} % {junior_avg} = {percent_avg}")
""" 867.78 % 440.62 = 49.22
"""
# 6) How much more, on average, is the salary of a senior person
# in the software development department compared to a junior person?
salaries_developers = data_frame.groupby("Department")["Salary"].mean()
salaries_juniors = data_frame.groupby("Title")["Salary"].mean()
salary_minus = (salaries_developers["Software Development"] - salaries_juniors["Junior"]).round(2)
print(f"Developers - Juniors= {salary_minus}")
""" Developers - Juniors= 385.78
"""
#7) How much more salary does a c-level person in the finance department earn on average
# compared to a mid-senior person?
salary_finance = data_frame.loc[data_frame["Department"] == "Finance"].groupby("Title")["Salary"].mean()
print(f"Result(7) = {salary_finance["C-level"]-salary_finance["Mid-Senior"]}")
""" Result(7) = 240.0
"""
#8) How many times more c-level employees are there in the
# software development department than in the marketing department?
developer_count = (data_frame.loc[data_frame["Department"] == "Software Development"]).groupby("Title")["Salary"].count()
marketing_count = (data_frame.loc[data_frame["Department"] == "Marketing"]).groupby("Department")["Salary"].count()
print(f"Result(8) = {developer_count["C-level"] + marketing_count["Marketing"]}")
""" Result(8) = 24
"""
Python