Python Pandas Library

26.03.2025

Pandas Library is one of the most commonly used libraries in Python. In fact, it is a very useful Python library that can work wonders together with the NumPy library.

The Pandas Library is similar to Numpy, but it is more famous for series and big data analysis. Let’s start our article by sharing the source documentation addresses about Python Pandas Library.

Installation, Documentation:Installation, Documentation:https://pandas.pydata.org/
GitHub : https://github.com/pandas-dev/pandas
GitHub Pandas Education: https://github.com/omersahintr/BootCampEdu/tree/main/Pandas

The Pandas library is usually used together with the Numpy library. Therefore, we call both libraries when importing.

Index

Python Pandas Startup Commands
Pandas DataFrame
Advanced Pandas Operations
- Using drop() in Data Sets with More NaNs
- Fill Missing Data (NaN) with Other Data
Pandas Data Grouping
Data Frame Concatenation – concat()
- Merge with Merge Method
Pulling Unique Data with Unique() and Nunique() Methods
Applying a Special Function to Data Frame Data
Excel File Operations with Pandas

Python Pandas Startup Commands

Series in Pandas

How to create a Pandas Series from a dictionary variable, a numpy array or a list variable?

Let’s take a look at how to do this with two different methods. Protocol names and corresponding port numbers will be defined as Pandas series. The screen output of both methods is given with comment lines.

import pandas as pd
import numpy as np

##Python Pandas Series ##
#Method-1
dictionary = {"http" : 80, "https" : 443, "rdc" : 3389, "smtp" : 25 }
print(pd.Series(dictionary)) #  http     80
                             #  https    443
                             #  rdc      3389
                             #  smtp     25
                             #  dtype: int64
#Method-2
ports = [80, 443, 3389, 25]
protocols = ["http", "https", "rdc", "smtp"]

print(pd.Series(index=protocols, data=ports)) #  http     80
                                              #  https    443
                                              #  rdc      3389
                                              #  smtp     25
                                              # dtype: object

Python

Creating a Pandas Series from a NumPy Array

How to create a series in Pandas from an existing NumPy array?

Since NumPy arrays are shown without an index, Pandas will automatically assign an index starting at 0 when it converts to Series form. The screen output is again written as a comment line.

import pandas as pd
import numpy as np

##NumPy Array to Python Pandas Series

n_array = np.arange(20,70,10)
print(pd.Series(n_array)) # 0    20
                          # 1    30
                          # 2    40
                          # 3    50
                          # 4    60
                          # dtype: int64

Python

Combining a List and a Numpy Sequence to Create a Pandas Series

Let’s create a Pandas Series using two different data types. The screen output is given as a comment line in the code block. The important point to note here is that the number of elements in the variables to be merged must be equal.

import pandas as pd
import numpy as np

##NumpyArray and Python List merge to Python Pandas Series:

nArray = np.array([10,20,30,40]) # a NumPy Array
pList = ["Mercedes","BMW", "Tesla","Audi"] # a List

pSeries = pd.Series(index=pList,data=nArray) # Merge to Pandas Series
print(pSeries) # Mercedes    10
               # BMW         20
               # Tesla       30
               # Audi        40
               
print(pSeries["Tesla"]) # 30

Python

Mathematical Operations in Pandas Series

The mathematical operations in Pandas Series are similar to NumPy arrays. One of the differences is that Pandas has indexes.

Let’s create our codes over a scenario.

Scenario: 3 different brands of cars run 2 different races according to their speed indexes. Let’s write python codes that calculate the car with the highest speed index according to the sum of these two races:

import pandas as pd
import numpy as np

## Series Math Operation in Python Pandas:

race1 = pd.Series(data=[220,250,280], index=["Audi","Mercedes", "Tesla"])
race2 = pd.Series(data=[260,230,300], index=["Audi","Mercedes", "Tesla"])

result = race1 + race2

print(result)
print(f"The Winner is {result[result == pd.Series.max(result)]}")

#Results:
Audi        480
Mercedes    480
Tesla       580
dtype: int64

The Winner is Tesla    580

Python

Note: Car brands and speed indexes are randomly selected. They do not reflect reality.

Mathematical Operations on Pandas Series Consisting of Data with Different Indexes

First Data Series contains data for the cities of London, Tokyo, NewYork.
Second Data Series contains data for the cities of London, Tokyo, Istanbul.

Graph Drawing Tutorial with MatPlotLib Library

For the cities of London and Tokyo that are present in both series, addition is performed in the series, but since the cities of New York and Istanbul are not present in both series at the same time, the result“NaN” is returned. You can see the results in the description line.

import pandas as pd
import numpy as np

## Pandas Series Operations with Different indexes

serie1 = pd.Series(data=[100,200,300], index=["London","Tokyo","New York"])
serie2 = pd.Series(data=[140,250,410], index=["London","Tokyo","Istanbul"])

operation = serie1 + serie2

print(operation) # Istanbul      NaN
                 # London      240.0
                 # New York      NaN
                 # Tokyo       450.0
                 # dtype: float64 Python Pandas

Python

Pandas DataFrame

How to work with DataFrame data types, which allow more extensive operations than Series?

data= np.array([[1,2,3],[0,3,2],[8,9,7],[3,5,8]])
data_frame = pd.DataFrame(data)

a NumPy array is converted to a Pandas Data Frame.

import pandas as pd
import numpy as np

## Pandas DataFrame:

data =  np.array([[1,2,3],[0,3,2],[8,9,7],[3,5,8]])

data_frame = pd.DataFrame(data)

print(data_frame[1][2]) # : 9

print(data_frame) #        0  1  2
                      # 0  1  2  3
                      # 1  0  3  2
                      # 2  8  9  7
                      # 3  3  5  8

Python

The numbers 0-1-2 column-index and 0-1-2-3 row-index on the edge are quite confusing. Pandas DataFrame object has a solution for this. If you want, you can name rows and columns by assigning them from an array (or even List) variable. You can see the example below.

In our example we prepared with Pandas Library, the number of entries Neo, Morpheus, Agent Smith and Trinity made to the Matrix according to the months are kept in a DataFrame object.

import pandas as pd
import numpy as np

## Pandas DataFrame row-name and column-name define:

data =  np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]

data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined


print(data_frame)

#             January  February  March  April
#Neo                1         2      3      2
#Morpheus           1         0      3      2
#Smith              3         8      9      7
#Trinity            3         5      8      9

Python

How do we pull any information we want from this new DataFrame. This is also very simple.

data_frame[“March”] we list the number of entries of all contacts in March in Pandas Series.
(data_frame[“March”]).Neo and Neo’s number of logins in March. Or;
data_frame[[“March”][“Neo”]]indicates the number of entries of Neo in March.

import pandas as pd
import numpy as np

## Pandas DataFrame row-name and column-name define:

data =  np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]

data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined


print(f"Neo on March {(data_frame["March"]).Neo}") # : Neo on March 3

print(data_frame["March"]["Neo"]) # : 3 --same to up


print(data_frame["April"])
                                # Neo            2
                                # Morpheus       2
                                # Smith          7
                                # Trinity        9

print((data_frame["April"].mean())) # 5.0

Python

Data Extraction with loc() and iloc() Method

data_frame.loc[“Neo”] and the number of entries of Neo in all months.
data_frame.loc[“Neo”].mean() is used to calculate Neo’s average entry into the matrix at all times.
data_frame.iloc[0] and again Neo’s (because its index is 0) input numbers.

import pandas as pd
import numpy as np

## Pandas DataFrame row-name and column-name define:

data =  np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]

data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined


print(data_frame)
                    #               January  February  March  April
                    #Neo                1         2      3      2
                    #Morpheus           1         0      3      2
                    #Smith              3         8      9      7
                    #Trinity            3         5      8      9

print(f"Neo on March {(data_frame["March"]).Neo}") # : Neo on March 3

print(data_frame["March"]["Neo"]) # : 3 --same to up


print(data_frame["April"])
                                # Neo            2
                                # Morpheus       2
                                # Smith          7
                                # Trinity        9

print((data_frame["April"].mean())) # 5.0

print((data_frame.loc["Neo"])) #: Neo's values by months
                                # January     1
                                # February    2
                                # March       3
                                # April       2

print((data_frame.loc["Neo"]).mean()) # : 2.0
print((data_frame.iloc[0]).mean()) # : 2.0

Python

Adding a New Data Set (May)

There is a very easy method for this. Let’s add May, which is 2 times the values of April. Pandas Library provides a very flexible working environment in data sets.

data_frame[“May”] = data_frame[“April”] * 2

import pandas as pd
import numpy as np

## Pandas DataFrame row-name and column-name define:

data =  np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]

data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined

data_frame["May"] = data_frame["April"] * 2 # : create new month and April x 2
print(data_frame)
                #              January  February  March  April  May
                # Neo                1         2      3      2    4
                # Morpheus           1         0      3      2    4
                # Smith              3         8      9      7   14
                # Trinity            3         5      8      9   18

Python

Extracting an Existing Row or Column and Printing the Remaining Part

For example, we will use the drop() method to drop Neo’s values or February values and print the remaining data on the screen.

data_frame.drop(labels=”Neo”, axis=0) row delete
data_frame.drop(labels=”February”, axis=1) column delete.

After the drop() method, the original data of the DataFrame is not changed. Only the modified version is displayed on the screen. If you call the DataFrame object again, the old version will be displayed.

import pandas as pd
import numpy as np

## Pandas DataFrame row-name and column-name define:

data =  np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]

data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined

# drop row operations:
print(data_frame.drop(labels="Neo", axis=0))
                #              January  February  March  April  May
                # Morpheus           1         0      3      2    4
                # Smith              3         8      9      7   14
                # Trinity            3         5      8      9   18

# drop column operations:
print(data_frame.drop(labels="February",axis=1))
                #              January  March  April  May
                # Neo                1      3      2    4
                # Morpheus           1      3      2    4
                # Smith              3      9      7   14
                # Trinity            3      8      9   18

Python

If you want to drop a Python Pandas DataFrame object and want the original data set to change accordingly, you must set the inplace= parameter to True . Since this parameter is false by default, if you do not specify it, your original data set will not change. Let’s see in the example;

import pandas as pd
import numpy as np

## Pandas DataFrame row-name and column-name define:

data =  np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]

data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined

# drop column operations changes original data
data_frame.drop(labels="April",axis=1, inplace=True)
print(data_frame) 
                # January  February  March  May
                # Neo             1         2      3    4
                # Morpheus        1         0      3    4
                # Smith           3         8      9   14
                # Trinity         3         5      8   18

Python

Greater Than, Less Than, Equality and Equivalence Comparisons (>,<, ==, !=,)

How to use mathematical greater than, less than and equivalence expressions in a Pandas DataFrame object.

import pandas as pd
import numpy as np

## Pandas DataFrame row-name and column-name define:

data =  np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]

data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined

# match (<> == !=) operations:
print(">2 elements:")
print(data_frame[data_frame>2])
                # >2 elements:
                #           January  February  March  May
                # Neo           NaN       NaN      3    4
                # Morpheus      NaN       NaN      3    4
                # Smith         3.0       8.0      9   14
                # Trinity       3.0       5.0      8   18

print("<4 elements:")
print(data_frame[data_frame<4])
                # <4 elements:
                #           January  February  March  May
                # Neo             1       2.0    3.0  NaN
                # Morpheus        1       0.0    3.0  NaN
                # Smith           3       NaN    NaN  NaN
                # Trinity         3       NaN    NaN  NaN

print("=3 elements")
print(data_frame[data_frame==3])
                # =3 elements
                #           January  February  March  May
                # Neo           NaN       NaN    3.0  NaN
                # Morpheus      NaN       NaN    3.0  NaN
                # Smith         3.0       NaN    NaN  NaN
                # Trinity       3.0       NaN    NaN  NaN

print("!=3 elements")
print(data_frame[data_frame!=3])
                # !=3 elements
                #           January  February  March  May
                # Neo           1.0         2    NaN    4
                # Morpheus      1.0         0    NaN    4
                # Smith         NaN         8    9.0   14
                # Trinity       NaN         5    8.0   18

Python

Here, NaN will replace the elements that do not meet the condition.

If you want, you can use condition expressions only within 1 month or only within 1 person’s data. Let’s see in the example:

import pandas as pd
import numpy as np

## Pandas DataFrame row-name and column-name define:

data =  np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]

data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined

# just on rows or columns matching
print(">4 just in May:")
print(data_frame[data_frame["May"]>4])
                # >4 just in May:
                #          January  February  March  May
                # Smith          3         8      9   14
                # Trinity        3         5      8   18

Python

Index Reset Operation

data_frame.reset_index()

command to change the indexes in the Data Frame to 0-1-2…..

data_frame.set_index(“col_name”)

You can specify the column you want as index. This way you will not change the indexes. It will only be valid in the demonstration. To change permanently

data_frame.set_index(“col_name”, axis=1, inplace=True)

You should write it as.

Let’s see in the example:

import pandas as pd
import numpy as np

## Pandas DataFrame row-name and column-name define:

data =  np.array([[1,2,3,2],[1,0,3,2],[3,8,9,7],[3,5,8,9]])
name_data = ["Neo","Morpheus","Smith","Trinity"]
period_data = ["January","February","March","April"]

data_frame = pd.DataFrame(data, index=name_data, columns=period_data) #column and row name has defined

# set_index() and reset_index() methods:
print("reset_index():")
print(data_frame.reset_index())
                # reset_index():
                #       index  January  February  March  May
                # 0       Neo        1         2      3    4
                # 1  Morpheus        1         0      3    4
                # 2     Smith        3         8      9   14
                # 3   Trinity        3         5      8   18

print("index --> May columns")
print(data_frame.set_index("May"))
                # index --> May columns
                #      January  February  March
                # May                          
                # 4          1         2      3
                # 4          1         0      3
                # 14         3         8      9
                # 18         3         5      8

Python

Advanced Pandas Operations

The data set prepared with the Python Pandas Library may have missing data. This will distract you from the actual result in your calculations when working with large data sets.

data_frame.dropna() – Remove Rows containing na
data_frame.dropna(axis=1) – Remove Columns containing na.

import numpy as np
import pandas as pd



dict = {"Neo":[30,50,np.nan], "Morpheus":[10,np.nan,40], "Trinity":[20,30,40]}

data_frame = pd.DataFrame(dict)

print(data_frame)
                #     Neo  Morpheus  Trinity
                # 0  30.0      10.0       20
                # 1  50.0       NaN       30
                # 2   NaN      40.0       40

# dropna() method:
print(data_frame.dropna())
                #     Neo  Morpheus  Trinity
                # 0  30.0      10.0       20
                
print(data_frame.dropna(axis=1))
                #    Trinity
                # 0       20
                # 1       30
                # 2       40

Python

As seen in the example, rows or columns containing “na” and “NaN” are completely removed.

Using drop() in Data Sets with More NaNs

new_data_frame.dropna(axis=1, thresh=2)

columns with more than 2 NaNs are dropped. The threshold value here determines the number of valid data in a set. Given 2, it determines that there are 2 normal data in the data of Neo, Morpheus, Trinity and Smith.

import numpy as np
import pandas as pd



 # with more NoN counts data
new_dict = {"Neo":[30,50,np.nan], "Morpheus":[10,np.nan,40], "Trinity":[20,30,40], "Smith":[15,np.nan,np.nan]}
new_data_frame = pd.DataFrame(new_dict)


print(new_data_frame.dropna(axis=1, thresh=2)) # 2 and more NaN's has dropped. Smith's data removed.
                #     Neo  Morpheus  Trinity
                # 0  30.0      10.0       20
                # 1  50.0       NaN       30
                # 2   NaN      40.0       40
                
print(new_data_frame.dropna(axis=1, thresh=1))
                #     Neo  Morpheus  Trinity  Smith
                # 0  30.0      10.0       20   15.0
                # 1  50.0       NaN       30    NaN
                # 2   NaN      40.0       40    NaN

Python

Fill Missing Data (NaN) with Other Data

NaN değerlerin yerine farklı değerler ile doldur.

new_data_frame.fillna(20) -substitute 20 for the values.

import numpy as np
import pandas as pd


new_dict = {"Neo":[30,50,np.nan], "Morpheus":[10,np.nan,40], "Trinity":[20,30,40], "Smith":[15,np.nan,np.nan]}
new_data_frame = pd.DataFrame(new_dict)


# fillna() methods:
print(new_data_frame.fillna(20))
                #     Neo  Morpheus  Trinity  Smith
                # 0  30.0      10.0       20   15.0
                # 1  50.0      20.0       30   20.0
                # 2  20.0      40.0       40   20.0

Python

Pandas Data Grouping

It is a frequently used method in the Pandas Library. It is used to group data of the same type. You can then perform Pandas operations within this data (max, min, count etc.).

sales_data_frames.groupby(“Column Name”) returns the memory address. You can run almost all Pandas operations with a period at the end.
sales_data_frames.groupby(“Column Name“).count()
sales_data_frames.groupby(“Column Name“).mean(“numerc Column Name“)
sales_data_frames.groupby(“Column Name“).mean(numeric_only=True) averages only the numeric columns.
sales_data_frames.groupby(“Column Name“).describe() with summary statistics.
sales_data_frames.groupby(“Column Name“).describe().transpose() you can change the position of rows and columns to present a more understandable view.

Turtle Library and Python Game Design

You can examine the sample code and screen outputs below.

import numpy as np
import pandas as pd


# Group Data
sales_data = {"Code":["C++","PHP","Python","PHP","C++","PHP"],"Names":["Neo","Morpheus","Trinity","Smith","Adam","Joe"], "Money":[250,120,200,350,300,100]}
sales_data_frame = pd.DataFrame(sales_data)
print(sales_data_frame)
                #      Code     Names  Money
                # 0     C++       Neo    250
                # 1     PHP  Morpheus    120
                # 2  Python   Trinity    200
                # 3     PHP     Smith    350
                # 4     C++      Adam    300
                # 5     PHP       Joe    100

groups = sales_data_frame.groupby("Code")

print(groups.count())
                #         Names  Money
                # Code
                # C++         2      2
                # PHP         3      3
                # Python      1      1

print(groups.mean("Money"))
                #         Money
                # Code
                # C++     275.0
                # PHP     190.0
                # Python  200.0

print(groups.min("Money"))
                #         Money
                # Code         
                # C++       250
                # PHP       100
                # Python    200
                
print(groups.max(numeric_only=True))
                #         Money
                # Code
                # C++       300
                # PHP       350
                # Python    200

print(groups.describe())
#        Money
#        count   mean         std    min    25%    50%    75%    max
# Code
# C++      2.0  275.0   35.355339  250.0  262.5  275.0  287.5  300.0
# PHP      3.0  190.0  138.924440  100.0  110.0  120.0  235.0  350.0
# Python   1.0  200.0         NaN  200.0  200.0  200.0  200.0  200.0

print(groups.describe().transpose())
                # Code                C++        PHP  Python
                # Money count    2.000000    3.00000     1.0
                #       mean   275.000000  190.00000   200.0
                #       std     35.355339  138.92444     NaN
                #       min    250.000000  100.00000   200.0
                #       25%    262.500000  110.00000   200.0
                #       50%    275.000000  120.00000   200.0
                #       75%    287.500000  235.00000   200.0
                #       max    300.000000  350.00000   200.0

Python

Data Frame Concatenation – concat()

The concat() method is used to combine multiple Pandas Data Frame objects into a single Data Frame.

concat([new_data_frame1,new_data_frame2,new_data_frame3],ignore_index=True), 3 data frames are concatenated and the index numbers are redefined with the last parameter ignore_index=True.

import pandas as pd
import numpy as np

new_dict1 = {"Name":["John","Alice","Bob","Doe","Smith"],
            "Sports":["Football","Basketball","Cricket","Tennis","Baseball"],
            "Calori":[230,240,220,250,260]
            }

new_dict2 = {"Name":["Jack","Adam","Brain","Doh","Sarah"],
            "Sports":["Football","Basketball","Cricket","Tennis","Baseball"],
            "Calori":[130,140,120,150,160]
            }

new_dict3 = {"Name":["Homer","Jason","Track","Dora","Fiona"],
            "Sports":["Football","Basketball","Cricket","Tennis","Baseball"],
            "Calori":[330,340,320,350,360]
            }

new_data_frame1 = pd.DataFrame(new_dict1)
new_data_frame2 = pd.DataFrame(new_dict2)
new_data_frame3 = pd.DataFrame(new_dict3)

#Concatenation of DataFrames
new_data_frame = pd.concat([new_data_frame1,new_data_frame2,new_data_frame3],ignore_index=True)
"""#####             Name      Sports  Calori
                0    John    Football     230
                1   Alice  Basketball     240
                2     Bob     Cricket     220
                3     Doe      Tennis     250
                4   Smith    Baseball     260
                5    Jack    Football     130
                6    Adam  Basketball     140
                7   Brain     Cricket     120
                8     Doh      Tennis     150
                9   Sarah    Baseball     160
                10  Homer    Football     330
                11  Jason  Basketball     340
                12  Track     Cricket     320
                13   Dora      Tennis     350
                14  Fiona    Baseball     360
#####"""

print(new_data_frame)

Python

Merge with Merge Method

It provides a more useful representation by merging based on common data. It will be more understandable on the example:

The Sports column will be our intersection set. This means that more than one person plays the same sport. To display this in a responsive way, we set the parameter as on=”Sports”.

import pandas as pd
import numpy as np

new_dict1 = {"Name":["John","Alice","Bob","Doe","Smith"],
            "Sports":["Football","Basketball","Cricket","Tennis","Baseball"],
            "Calori":[230,240,220,250,260]
            }

new_dict2 = {"Name":["Jack","Adam","Brain","Doh","Sarah"],
            "Sports":["Football","Basketball","Cricket","Tennis","Baseball"],
            "Calori":[130,140,120,150,160]
            }

new_data_frame1 = pd.DataFrame(new_dict1)
new_data_frame2 = pd.DataFrame(new_dict2)

#Merging of DataFrames
merge_data_frame = pd.merge(new_data_frame1,new_data_frame2,on="Sports")

print(merge_data_frame)
"""####       Name_x      Sports  Calori_x Name_y  Calori_y
            0   John    Football       230   Jack       130
            1  Alice  Basketball       240   Adam       140
            2    Bob     Cricket       220  Brain       120
            3    Doe      Tennis       250    Doh       150
            4  Smith    Baseball       260  Sarah       160
#####"""

Python

Pulling Unique Data with Unique() and Nunique() Methods

For example, there is a personal list of students in a university. The columns in the list contain first name, last name and department name. Here, the unique method in the Pandas Library is used to pull the list of all departments from tens of thousands of data. nunique gives the total number of different data.

import pandas as pd
import numpy as np

#unique values in the DataFrame
new_sales_dict = {"Name":["John","Alice","Bob","Doe","Smith"],
                  "Sales":[230,240,220,250,260],
                  "Profit":[30,40,20,50,60]}
                  
new_sales_frame = pd.DataFrame(new_sales_dict)

print(new_sales_frame)
"""####     Name  Sales  Profit
            0  John    230      30
            1 Alice    240      40
            2   Bob    220      20
            3   Doe    250      50
            4 Smith    260      60
#####"""

print(new_sales_frame["Name"].unique()) 
# ['John' 'Alice' 'Bob' 'Doe' 'Smith']

print(new_sales_frame["Name"].nunique())
# 5

Python

Applying a Special Function to Data Frame Data

We will use the apply method to apply a function in Python Pandas or a function you define to the Data Frame data. Let’s define a VAT calculation function and apply it to the Data Frame elements.

import pandas as pd
import numpy as np

#unique values in the DataFrame
new_sales_dict = {"Name":["John","Alice","Bob","Doe","Smith"],
                  "Sales":[230,240,220,250,260],
                  "Profit":[30,40,20,50,60]}
                  
new_sales_frame = pd.DataFrame(new_sales_dict)

#VAT Calculation in DataFrame
def vat_calc(sales):
    return sales*0.18

print(new_sales_frame["Sales"].apply(vat_calc))
"""####     0    41.4
            1    43.2
            2    39.6
            3    45.0
            4    46.8
            Name: Sales, dtype: float64
#####"""

Python

Excel File Operations with Pandas

With the Python Pandas Library, we can process Excel data by reading and writing Excel tables. It has an extremely simple code structure.

data_frame = pd.read_excel(“File.xlsx”)

All other operations will proceed exactly the same as with the Data Frame object. Don’t forget to import the openpyxl library to import the Excel table into the Data Frame object in a healthy way.

import pandas as pd
import numpy as np
import openpyxl

data_frame = pd.read_excel("sales.xlsx")
print(data_frame)

# 1) How many rows of data are there in total?

count_rows = data_frame["Employee Name"].count()
print(f"Total Rows Count: {count_rows}")
                # Total Rows Count: 100

# 2) What is the average salary this company pays?

averaje_salary = data_frame["Salary"].mean()
print(f"Average Salary: {averaje_salary:.2f}")
                # Average Salary: 725.84

# 3) How does the average salary compare across departments at this company?

averaje_salary_of_department = data_frame.groupby("Department")["Salary"].mean()
print(f"Average Salary of Department: \n{averaje_salary_of_department.round(2)}")
"""             Average Salary of Department:
                Department
                Finance                 805.41
                HR                      640.94
                Marketing               707.00
                Sales                   656.67
                Software Development    826.41
                Name: Salary, dtype: float64
"""

# 4) How does the average salary compare by title (senior - junior) at this company?
averaje_salary_of_title = data_frame.groupby("Title")["Salary"].mean()
print(f"Average Salary of Title: \n{averaje_salary_of_title.round(2)}")
"""         Average Salary of Title: 
                Title
                C-level       1058.33
                Junior         440.62
                Mid            641.45
                Mid-Senior     725.50
                Senior         867.78
                Name: Salary, dtype: float64
"""

#5) On average, how many percent more salary does a senior person earn than a junior person?
match_avg = data_frame.groupby("Title")["Salary"].mean()
senior_avg = match_avg["Senior"].round(2)
junior_avg = match_avg["Junior"].round(2)

percent_avg = (((senior_avg - junior_avg)/senior_avg)*100).round(2)
print(f"mean differences= {senior_avg} % {junior_avg} = {percent_avg}")
"""             867.78 % 440.62 = 49.22 
"""

# 6) How much more, on average, is the salary of a senior person
# in the software development department compared to a junior person?

salaries_developers = data_frame.groupby("Department")["Salary"].mean()
salaries_juniors = data_frame.groupby("Title")["Salary"].mean()

salary_minus = (salaries_developers["Software Development"] - salaries_juniors["Junior"]).round(2)
print(f"Developers - Juniors= {salary_minus}")
"""             Developers - Juniors= 385.78
"""

#7) How much more salary does a c-level person in the finance department earn on average
# compared to a mid-senior person?

salary_finance = data_frame.loc[data_frame["Department"] == "Finance"].groupby("Title")["Salary"].mean()


print(f"Result(7) = {salary_finance["C-level"]-salary_finance["Mid-Senior"]}")
"""             Result(7) = 240.0
"""

#8) How many times more c-level employees are there in the
# software development department than in the marketing department?

developer_count = (data_frame.loc[data_frame["Department"] == "Software Development"]).groupby("Title")["Salary"].count()
marketing_count = (data_frame.loc[data_frame["Department"] == "Marketing"]).groupby("Department")["Salary"].count()
print(f"Result(8) = {developer_count["C-level"] + marketing_count["Marketing"]}")
"""             Result(8) = 24
"""