Data Structures in Pandas

28 min readJul 3, 2021

Pandas is an open-source library that uses for working with relational or labeled data both easily and intuitively. It provides various data structures and operations for manipulating numerical data and time series. It offers a tool for cleaning and processes your data. It is the most popular Python library that is used for data analysis. In this article, We are going to learn about Pandas Data structure.

It supports two data structures:

Series
Dataframe

Series

Pandas is a one-dimensional labeled array and capable of holding data of any type (integer, string, float, python objects, etc.)

Syntax: pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
Parameters:

data: array- Contains data stored in Series.
index: array-like or Index (1d)
dtype: str, numpy.dtype, or ExtensionDtype, optional
name: str, optional
copy: bool, default False

Example 1: Series holing the char data type.

Python3

import pandas as pd

# a simple char list

list = ['g', 'e', 'e', 'k', 's']

# create series form a char list

res = pd.Series(list)

print(res)

Output:

Using the Pandas `read_csv()` and `.to_csv()` Functions

A comma-separated values (CSV) file is a plaintext file with a .csv extension that holds tabular data. This is one of the most popular file formats for storing large amounts of data. Each row of the CSV file represents a single table row. The values in the same row are by default separated with commas, but you could change the separator to a semicolon, tab, space, or some other character.

Write a CSV File

You can save your Pandas DataFrame as a CSV file with .to_csv():

>>>

>>> df.to_csv('data.csv')

That’s it! You’ve created the file data.csv in your current working directory. You can expand the code block below to see how your CSV file should look:

data.csvShow/Hide

This text file contains the data separated with commas. The first column contains the row labels. In some cases, you’ll find them irrelevant. If you don’t want to keep them, then you can pass the argument index=False to .to_csv().

Read a CSV File

Once your data is saved in a CSV file, you’ll likely want to load and use it from time to time. You can do that with the Pandas read_csv() function:

>>>

>>> df = pd.read_csv('data.csv', index_col=0)
>>> df
        COUNTRY      POP      AREA       GDP       CONT     IND_DAY
CHN       China  1398.72   9596.96  12234.78       Asia         NaN
IND       India  1351.16   3287.26   2575.67       Asia  1947-08-15
USA          US   329.74   9833.52  19485.39  N.America  1776-07-04
IDN   Indonesia   268.07   1910.93   1015.54       Asia  1945-08-17
BRA      Brazil   210.32   8515.77   2055.51  S.America  1822-09-07
PAK    Pakistan   205.71    881.91    302.14       Asia  1947-08-14
NGA     Nigeria   200.96    923.77    375.77     Africa  1960-10-01
BGD  Bangladesh   167.09    147.57    245.63       Asia  1971-03-26
RUS      Russia   146.79  17098.25   1530.75        NaN  1992-06-12
MEX      Mexico   126.58   1964.38   1158.23  N.America  1810-09-16
JPN       Japan   126.22    377.97   4872.42       Asia         NaN
DEU     Germany    83.02    357.11   3693.20     Europe         NaN
FRA      France    67.02    640.68   2582.49     Europe  1789-07-14
GBR          UK    66.44    242.50   2631.23     Europe         NaN
ITA       Italy    60.36    301.34   1943.84     Europe         NaN
ARG   Argentina    44.94   2780.40    637.49  S.America  1816-07-09
DZA     Algeria    43.38   2381.74    167.56     Africa  1962-07-05
CAN      Canada    37.59   9984.67   1647.12  N.America  1867-07-01
AUS   Australia    25.47   7692.02   1408.68    Oceania         NaN
KAZ  Kazakhstan    18.53   2724.90    159.41       Asia  1991-12-16

In this case, the Pandas read_csv() function returns a new DataFrame with the data and labels from the file data.csv, which you specified with the first argument. This string can be any valid path, including URLs.

The parameter index_col specifies the column from the CSV file that contains the row labels. You assign a zero-based column index to this parameter. You should determine the value of index_col when the CSV file contains the row labels to avoid loading them as data.

You’ll learn more about using Pandas with CSV files later on in this tutorial. You can also check out Reading and Writing CSV Files in Python to see how to handle CSV files with the built-in Python library csv as well.

Slicing, Indexing, Manipulating and Cleaning Pandas Dataframe

Difficulty Level : Expert
Last Updated : 05 Sep, 2020

With the help of Pandas, we can perform many functions on data set like Slicing, Indexing, Manipulating, and Cleaning Data frame.

Case 1: Slicing Pandas Data frame using DataFrame.iloc[]

Example 1: Slicing Rows

Python3

# importing pandas library

import pandas as pd

# Initializing the nested list with Data set

player_list = [['M.S.Dhoni', 36, 75, 5428000],

['A.B.D Villers', 38, 74, 3428000],

['V.Kholi', 31, 70, 8428000],

['S.Smith', 34, 80, 4428000],

['C.Gayle', 40, 100, 4528000],

['J.Root', 33, 72, 7028000],

['K.Peterson', 42, 85, 2528000]]

# creating a pandas dataframe

df = pd.DataFrame(player_list, columns=['Name', 'Age', 'Weight', 'Salary'])

# data frame before slicing

df

Output:

Python3

# Slicing rows in data frame

df1 = df.iloc[0:4]

# data frame after slicing

df1

Output:

In the above example, we sliced the rows from the data frame.

Example 2: Slicing Columns

Python3

# importing pandas library

import pandas as pd

# Initializing the nested list with Data set

player_list = [['M.S.Dhoni', 36, 75, 5428000],

['A.B.D Villers', 38, 74, 3428000],

['V.Kholi', 31, 70, 8428000],

['S.Smith', 34, 80, 4428000],

['C.Gayle', 40, 100, 4528000],

['J.Root', 33, 72, 7028000],

['K.Peterson', 42, 85, 2528000]]

# creating a pandas dataframe

df = pd.DataFrame(player_list, columns=['Name', 'Age', 'Weight', 'Salary'])

# data frame before slicing

df

Output:

Python3

# Slicing columnss in data frame

df1 = df.iloc[:,0:2]

# data frame after slicing

df1

Output:

In the above example, we sliced the columns from the data frame.

Case 2: Indexing Pandas Data frame

Python3

# importing pandas library

import pandas as pd

# Initializing the nested list with Data set

player_list = [['M.S.Dhoni', 36, 75, 5428000],

['A.B.D Villers', 38, 74, 3428000],

['V.Kholi', 31, 70, 8428000],

['S.Smith', 34, 80, 4428000],

['C.Gayle', 40, 100, 4528000],

['J.Root', 33, 72, 7028000],

['K.Peterson', 42, 85, 2528000]]

# creating a pandas dataframe and indexing it using Aplhabets

df = pd.DataFrame(player_list, columns=['Name', 'Age', 'Weight', 'Salary'],

index=['A', 'B', 'C', 'D', 'E', 'F', 'G'])

# Displaying data frame

df

Output:

In the above example, we do indexing of the data frame.

Case 3: Manipulating Pandas Data frame

Manipulation of the data frame can be done in multiple ways like applying functions, changing a data type of columns, splitting, adding rows and columns to a data frame, etc.

Example 1: Applying lambda function to a column using Dataframe.assign()

Python3

# importing pandas library

import pandas as pd

# creating and initializing a list

values = [['Rohan', 455], ['Elvish', 250], ['Deepak', 495],

['Sai', 400], ['Radha', 350], ['Vansh', 450]]

# creating a pandas dataframe

df = pd.DataFrame(values, columns=['Name', 'Univ_Marks'])

# Applying lambda function to find percentage of

# 'Univ_Marks' column using df.assign()

df = df.assign(Percentage=lambda x: (x['Univ_Marks'] / 500 * 100))

# displaying the data frame

df

Output:

In the above example, the lambda function is applied to the ‘Univ_Marks’ column and a new column ‘Percentage’ is formed with the help of it.

Example 2: Sorting the Data frame in Ascending order

Python3

# importing pandas library

import pandas as pd

# Initializing the nested list with Data set

player_list = [['M.S.Dhoni', 36, 75, 5428000],

['A.B.D Villers', 38, 74, 3428000],

['V.Kholi', 31, 70, 8428000],

['S.Smith', 34, 80, 4428000],

['C.Gayle', 40, 100, 4528000],

['J.Root', 33, 72, 7028000],

['K.Peterson', 42, 85, 2528000]]

# creating a pandas dataframe

df = pd.DataFrame(player_list, columns=['Name', 'Age', 'Weight', 'Salary'])

# Sorting by column 'Weight'

df.sort_values(by=['Weight'])

Output:

In the above example, we sort the data frame by column ‘Weight”.

Case 4: Cleaning Pandas Data frame

Python3

# importing pandas and Numpy libraries

import pandas as pd

import numpy as np

# Initializing the nested list with Data set

player_list = [['M.S.Dhoni', 36, 75, 5428000],

['A.B.D Villers', np.nan, 74, np.nan],

['V.Kholi', 31, 70, 8428000],

['S.Smith', 34, 80, 4428000],

['C.Gayle', np.nan, 100, np.nan],

[np.nan, 33, np.nan, 7028000],

['K.Peterson', 42, 85, 2528000]]

# creating a pandas dataframe

df = pd.DataFrame(player_list, columns=['Name', 'Age', 'Weight', 'Salary'])

df

Output:

Python3

# Checking for missing values

df.isnull().sum()

Output:

Python3

# dropping or cleaning the missing data

df= df.dropna()

df

Output:

Pythonic Data Cleaning With Pandas and NumPy

Dropping Columns in a `DataFrame`

Often, you’ll find that not all the categories of data in a dataset are useful to you. For example, you might have a dataset containing student information (name, grade, standard, parents’ names, and address) but want to focus on analyzing student grades.

In this case, the address or parents’ names categories are not important to you. Retaining these unneeded categories will take up unnecessary space and potentially also bog down runtime.

Pandas provides a handy way of removing unwanted columns or rows from a DataFrame with the drop() function. Let’s look at a simple example where we drop a number of columns from a DataFrame.

First, let’s create a DataFrame out of the CSV file ‘BL-Flickr-Images-Book.csv’. In the examples below, we pass a relative path to pd.read_csv, meaning that all of the datasets are in a folder named Datasets in our current working directory:

>>>

>>> df = pd.read_csv('Datasets/BL-Flickr-Images-Book.csv')
>>> df.head()    Identifier             Edition Statement      Place of Publication  \
0         206                           NaN                    London
1         216                           NaN  London; Virtue & Yorston
2         218                           NaN                    London
3         472                           NaN                    London
4         480  A new edition, revised, etc.                    London  Date of Publication              Publisher  \
0         1879 [1878]       S. Tinsley & Co.
1                1868           Virtue & Co.
2                1869  Bradbury, Evans & Co.
3                1851          James Darling
4                1857   Wertheim & Macintosh                                               Title     Author  \
0                  Walter Forbes. [A novel.] By A. A      A. A.
1  All for Greed. [A novel. The dedication signed...  A., A. A.
2  Love the Avenger. By the author of “All for Gr...  A., A. A.
3  Welsh Sketches, chiefly ecclesiastical, to the...  A., E. S.
4  [The World in which I live, and my place in it...  A., E. S.                                   Contributors  Corporate Author  \
0                               FORBES, Walter.               NaN
1  BLAZE DE BURY, Marie Pauline Rose - Baroness               NaN
2  BLAZE DE BURY, Marie Pauline Rose - Baroness               NaN
3                   Appleyard, Ernest Silvanus.               NaN
4                           BROOME, John Henry.               NaN   Corporate Contributors Former owner  Engraver Issuance type  \
0                     NaN          NaN       NaN   monographic
1                     NaN          NaN       NaN   monographic
2                     NaN          NaN       NaN   monographic
3                     NaN          NaN       NaN   monographic
4                     NaN          NaN       NaN   monographic                                          Flickr URL  \
0  http://www.flickr.com/photos/britishlibrary/ta...
1  http://www.flickr.com/photos/britishlibrary/ta...
2  http://www.flickr.com/photos/britishlibrary/ta...
3  http://www.flickr.com/photos/britishlibrary/ta...
4  http://www.flickr.com/photos/britishlibrary/ta...                            Shelfmarks
0    British Library HMNTS 12641.b.30.
1    British Library HMNTS 12626.cc.2.
2    British Library HMNTS 12625.dd.1.
3    British Library HMNTS 10369.bbb.15.
4    British Library HMNTS 9007.d.28.

When we look at the first five entries using the head() method, we can see that a handful of columns provide ancillary information that would be helpful to the library but isn’t very descriptive of the books themselves: Edition Statement, Corporate Author, Corporate Contributors, Former owner, Engraver, Issuance type and Shelfmarks.

We can drop these columns in the following way:

>>>

>>> to_drop = ['Edition Statement',
...            'Corporate Author',
...            'Corporate Contributors',
...            'Former owner',
...            'Engraver',
...            'Contributors',
...            'Issuance type',
...            'Shelfmarks']>>> df.drop(to_drop, inplace=True, axis=1)

Above, we defined a list that contains the names of all the columns we want to drop. Next, we call the drop() function on our object, passing in the inplace parameter as True and the axis parameter as 1. This tells Pandas that we want the changes to be made directly in our object and that it should look for the values to be dropped in the columns of the object.

When we inspect the DataFrame again, we’ll see that the unwanted columns have been removed:

>>>

>>> df.head()
   Identifier      Place of Publication Date of Publication  \
0         206                    London         1879 [1878]
1         216  London; Virtue & Yorston                1868
2         218                    London                1869
3         472                    London                1851
4         480                    London                1857               Publisher                                              Title  \
0       S. Tinsley & Co.                  Walter Forbes. [A novel.] By A. A
1           Virtue & Co.  All for Greed. [A novel. The dedication signed...
2  Bradbury, Evans & Co.  Love the Avenger. By the author of “All for Gr...
3          James Darling  Welsh Sketches, chiefly ecclesiastical, to the...
4   Wertheim & Macintosh  [The World in which I live, and my place in it...      Author                                         Flickr URL
0      A. A.  http://www.flickr.com/photos/britishlibrary/ta...
1  A., A. A.  http://www.flickr.com/photos/britishlibrary/ta...
2  A., A. A.  http://www.flickr.com/photos/britishlibrary/ta...
3  A., E. S.  http://www.flickr.com/photos/britishlibrary/ta...
4  A., E. S.  http://www.flickr.com/photos/britishlibrary/ta...

Alternatively, we could also remove the columns by passing them to the columns parameter directly instead of separately specifying the labels to be removed and the axis where Pandas should look for the labels:

>>>

>>> df.drop(columns=to_drop, inplace=True)

This syntax is more intuitive and readable. What we’re trying to do here is directly apparent.

If you know in advance which columns you’d like to retain, another option is to pass them to the usecols argument of pd.read_csv.

Changing the Index of a `DataFrame`

A Pandas Index extends the functionality of NumPy arrays to allow for more versatile slicing and labeling. In many cases, it is helpful to use a uniquely valued identifying field of the data as its index.

For example, in the dataset used in the previous section, it can be expected that when a librarian searches for a record, they may input the unique identifier (values in the Identifier column) for a book:

>>>

>>> df['Identifier'].is_unique
True

Let’s replace the existing index with this column using set_index:

>>>

>>> df = df.set_index('Identifier')
>>> df.head()
                Place of Publication Date of Publication  \
206                           London         1879 [1878]
216         London; Virtue & Yorston                1868
218                           London                1869
472                           London                1851
480                           London                1857                        Publisher  \
206              S. Tinsley & Co.
216                  Virtue & Co.
218         Bradbury, Evans & Co.
472                 James Darling
480          Wertheim & Macintosh                                                        Title     Author  \
206                         Walter Forbes. [A novel.] By A. A      A. A.
216         All for Greed. [A novel. The dedication signed...  A., A. A.
218         Love the Avenger. By the author of “All for Gr...  A., A. A.
472         Welsh Sketches, chiefly ecclesiastical, to the...  A., E. S.
480         [The World in which I live, and my place in it...  A., E. S.                                                   Flickr URL
206         http://www.flickr.com/photos/britishlibrary/ta...
216         http://www.flickr.com/photos/britishlibrary/ta...
218         http://www.flickr.com/photos/britishlibrary/ta...
472         http://www.flickr.com/photos/britishlibrary/ta...
480         http://www.flickr.com/photos/britishlibrary/ta...

Technical Detail: Unlike primary keys in SQL, a Pandas Index doesn’t make any guarantee of being unique, although many indexing and merging operations will notice a speedup in runtime if it is.

We can access each record in a straightforward way with loc[]. Although loc[] may not have all that intuitive of a name, it allows us to do label-based indexing, which is the labeling of a row or record without regard to its position:

>>>

>>> df.loc[206]
Place of Publication                                               London
Date of Publication                                           1879 [1878]
Publisher                                                S. Tinsley & Co.
Title                                   Walter Forbes. [A novel.] By A. A
Author                                                              A. A.
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 206, dtype: object

In other words, 206 is the first label of the index. To access it by position, we could use df.iloc[0], which does position-based indexing.

Technical Detail: .loc[] is technically a class instance and has some special syntax that doesn’t conform exactly to most plain-vanilla Python instance methods.

Previously, our index was a RangeIndex: integers starting from 0, analogous to Python’s built-in range. By passing a column name to set_index, we have changed the index to the values in Identifier.

You may have noticed that we reassigned the variable to the object returned by the method with df = df.set_index(...). This is because, by default, the method returns a modified copy of our object and does not make the changes directly to the object. We can avoid this by setting the inplace parameter:

df.set_index('Identifier', inplace=True)

Tidying up Fields in the Data

So far, we have removed unnecessary columns and changed the index of our DataFrame to something more sensible. In this section, we will clean specific columns and get them to a uniform format to get a better understanding of the dataset and enforce consistency. In particular, we will be cleaning Date of Publication and Place of Publication.

Upon inspection, all of the data types are currently the object dtype, which is roughly analogous to str in native Python.

It encapsulates any field that can’t be neatly fit as numerical or categorical data. This makes sense since we’re working with data that is initially a bunch of messy strings:

>>>

>>> df.get_dtype_counts()
object    6

One field where it makes sense to enforce a numeric value is the date of publication so that we can do calculations down the road:

>>>

>>> df.loc[1905:, 'Date of Publication'].head(10)
Identifier
1905           1888
1929    1839, 38-54
2836        [1897?]
2854           1865
2956        1860-63
2957           1873
3017           1866
3131           1899
4598           1814
4884           1820
Name: Date of Publication, dtype: object

A particular book can have only one date of publication. Therefore, we need to do the following:

Remove the extra dates in square brackets, wherever present: 1879 [1878]
Convert date ranges to their “start date”, wherever present: 1860–63; 1839, 38–54
Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?]
Convert the string nan to NumPy’s NaN value

Synthesizing these patterns, we can actually take advantage of a single regular expression to extract the publication year:

>>>

regex = r'^(\d{4})'

The regular expression above is meant to find any four digits at the beginning of a string, which suffices for our case. The above is a raw string (meaning that a backslash is no longer an escape character), which is standard practice with regular expressions.

The \d represents any digit, and {4} repeats this rule four times. The ^ character matches the start of a string, and the parentheses denote a capturing group, which signals to Pandas that we want to extract that part of the regex. (We want ^ to avoid cases where [ starts off the string.)

Let’s see what happens when we run this regex across our dataset:

>>>

>>> extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
>>> extr.head()
Identifier
206    1879
216    1868
218    1869
472    1851
480    1857
Name: Date of Publication, dtype: object

Further Reading: Not familiar with regex? You can inspect the expression above at regex101.com and learn all about regular expressions with Regular Expressions: Regexes in Python.

Technically, this column still has object dtype, but we can easily get its numerical version with pd.to_numeric:

>>>

>>> df['Date of Publication'] = pd.to_numeric(extr)
>>> df['Date of Publication'].dtype
dtype('float64')

This results in about one in every ten values being missing, which is a small price to pay for now being able to do computations on the remaining valid values:

>>>

>>> df['Date of Publication'].isnull().sum() / len(df)
0.11717147339205986

Combining `str` Methods with NumPy to Clean Columns

Above, you may have noticed the use of df['Date of Publication'].str. This attribute is a way to access speedy string operations in Pandas that largely mimic operations on native Python strings or compiled regular expressions, such as .split(), .replace(), and .capitalize().

To clean the Place of Publication field, we can combine Pandas str methods with NumPy’s np.where function, which is basically a vectorized form of Excel’s IF() macro. It has the following syntax:

>>>

>>> np.where(condition, then, else)

Here, condition is either an array-like object or a Boolean mask. then is the value to be used if condition evaluates to True, and else is the value to be used otherwise.

Essentially, .where() takes each element in the object used for condition, checks whether that particular element evaluates to True in the context of the condition, and returns an ndarray containing then or else, depending on which applies.

It can be nested into a compound if-then statement, allowing us to compute values based on multiple conditions:

>>>

>>> np.where(condition1, x1, 
        np.where(condition2, x2, 
            np.where(condition3, x3, ...)))

We’ll be making use of these two functions to clean Place of Publication since this column has string objects. Here are the contents of the column:

>>>

>>> df['Place of Publication'].head(10)
Identifier
206                                  London
216                London; Virtue & Yorston
218                                  London
472                                  London
480                                  London
481                                  London
519                                  London
667     pp. 40. G. Bryan & Co: Oxford, 1898
874                                 London]
1143                                 London
Name: Place of Publication, dtype: object

We see that for some rows, the place of publication is surrounded by other unnecessary information. If we were to look at more values, we would see that this is the case for only some rows that have their place of publication as ‘London’ or ‘Oxford’.

Let’s take a look at two specific entries:

>>>

>>> df.loc[4157862]
Place of Publication                                  Newcastle-upon-Tyne
Date of Publication                                                  1867
Publisher                                                      T. Fordyce
Title                   Local Records; or, Historical Register of rema...
Author                                                        T.  Fordyce
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4157862, dtype: object>>> df.loc[4159587]
Place of Publication                                  Newcastle upon Tyne
Date of Publication                                                  1834
Publisher                                                Mackenzie & Dent
Title                   An historical, topographical and descriptive v...
Author                                               E. (Eneas) Mackenzie
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4159587, dtype: object

These two books were published in the same place, but one has hyphens in the name of the place while the other does not.

To clean this column in one sweep, we can use str.contains() to get a Boolean mask.

We clean the column as follows:

>>>

>>> pub = df['Place of Publication']
>>> london = pub.str.contains('London')
>>> london[:5]
Identifier
206    True
216    True
218    True
472    True
480    True
Name: Place of Publication, dtype: bool>>> oxford = pub.str.contains('Oxford')

We combine them with np.where:

>>>

df['Place of Publication'] = np.where(london, 'London',
                                      np.where(oxford, 'Oxford',
                                               pub.str.replace('-', ' ')))>>> df['Place of Publication'].head()
Identifier
206    London
216    London
218    London
472    London
480    London
Name: Place of Publication, dtype: object

Here, the np.where function is called in a nested structure, with condition being a Series of Booleans obtained with str.contains(). The contains() method works similarly to the built-in in keyword used to find the occurrence of an entity in an iterable (or substring in a string).

The replacement to be used is a string representing our desired place of publication. We also replace hyphens with a space with str.replace() and reassign to the column in our DataFrame.

Although there is more dirty data in this dataset, we will discuss only these two columns for now.

Let’s have a look at the first five entries, which look a lot crisper than when we started out:

>>>

>>> df.head()
           Place of Publication Date of Publication              Publisher  \
206                      London                1879        S. Tinsley & Co.
216                      London                1868           Virtue & Co.
218                      London                1869  Bradbury, Evans & Co.
472                      London                1851          James Darling
480                      London                1857   Wertheim & Macintosh                                                        Title    Author  \
206                         Walter Forbes. [A novel.] By A. A        AA
216         All for Greed. [A novel. The dedication signed...   A. A A.
218         Love the Avenger. By the author of “All for Gr...   A. A A.
472         Welsh Sketches, chiefly ecclesiastical, to the...   E. S A.
480         [The World in which I live, and my place in it...   E. S A.                                                   Flickr URL
206         http://www.flickr.com/photos/britishlibrary/ta...
216         http://www.flickr.com/photos/britishlibrary/ta...
218         http://www.flickr.com/photos/britishlibrary/ta...
472         http://www.flickr.com/photos/britishlibrary/ta...
480         http://www.flickr.com/photos/britishlibrary/ta...

Note: At this point, Place of Publication would be a good candidate for conversion to a Categorical dtype, because we can encode the fairly small unique set of cities with integers. (The memory usage of a Categorical is proportional to the number of categories plus the length of the data; an object dtype is a constant times the length of the data.)

Cleaning the Entire Dataset Using the `applymap` Function

In certain situations, you will see that the “dirt” is not localized to one column but is more spread out.

There are some instances where it would be helpful to apply a customized function to each cell or element of a DataFrame. Pandas .applymap() method is similar to the in-built map() function and simply applies a function to all the elements in a DataFrame.

Let’s look at an example. We will create a DataFrame out of the “university_towns.txt” file:

$ head Datasets/univerisity_towns.txt
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]

We see that we have periodic state names followed by the university towns in that state: StateA TownA1 TownA2 StateB TownB1 TownB2.... If we look at the way state names are written in the file, we’ll see that all of them have the “[edit]” substring in them.

We can take advantage of this pattern by creating a list of (state, city) tuples and wrapping that list in a DataFrame:

>>>

>>> university_towns = []
>>> with open('Datasets/university_towns.txt') as file:
...     for line in file:
...         if '[edit]' in line:
...             # Remember this `state` until the next is found
...             state = line
...         else:
...             # Otherwise, we have a city; keep `state` as last-seen
...             university_towns.append((state, line))>>> university_towns[:5]
[('Alabama[edit]\n', 'Auburn (Auburn University)[1]\n'),
 ('Alabama[edit]\n', 'Florence (University of North Alabama)\n'),
 ('Alabama[edit]\n', 'Jacksonville (Jacksonville State University)[2]\n'),
 ('Alabama[edit]\n', 'Livingston (University of West Alabama)[2]\n'),
 ('Alabama[edit]\n', 'Montevallo (University of Montevallo)[2]\n')]

We can wrap this list in a DataFrame and set the columns as “State” and “RegionName”. Pandas will take each element in the list and set State to the left value and RegionName to the right value.

The resulting DataFrame looks like this:

>>>

>>> towns_df = pd.DataFrame(university_towns,
...                         columns=['State', 'RegionName'])>>> towns_df.head()
 State                                         RegionName
0  Alabama[edit]\n                    Auburn (Auburn University)[1]\n
1  Alabama[edit]\n           Florence (University of North Alabama)\n
2  Alabama[edit]\n  Jacksonville (Jacksonville State University)[2]\n
3  Alabama[edit]\n       Livingston (University of West Alabama)[2]\n
4  Alabama[edit]\n         Montevallo (University of Montevallo)[2]\n

While we could have cleaned these strings in the for loop above, Pandas makes it easy. We only need the state name and the town name and can remove everything else. While we could use Pandas’ .str() methods again here, we could also use applymap() to map a Python callable to each element of the DataFrame.

We have been using the term element, but what exactly do we mean by it? Consider the following “toy” DataFrame:

>>>

0           1
0    Mock     Dataset
1  Python     Pandas
2    Real     Python
3   NumPy     Clean

In this example, each cell (‘Mock’, ‘Dataset’, ‘Python’, ‘Pandas’, etc.) is an element. Therefore, applymap() will apply a function to each of these independently. Let’s define that function:

>>>

>>> def get_citystate(item):
...     if ' (' in item:
...         return item[:item.find(' (')]
...     elif '[' in item:
...         return item[:item.find('[')]
...     else:
...         return item

Pandas’ .applymap() only takes one parameter, which is the function (callable) that should be applied to each element:

>>>

>>> towns_df =  towns_df.applymap(get_citystate)

First, we define a Python function that takes an element from the DataFrame as its parameter. Inside the function, checks are performed to determine whether there’s a ( or [ in the element or not.

Depending on the check, values are returned accordingly by the function. Finally, the applymap() function is called on our object. Now the DataFrame is much neater:

>>>

>>> towns_df.head()
     State    RegionName
0  Alabama        Auburn
1  Alabama      Florence
2  Alabama  Jacksonville
3  Alabama    Livingston
4  Alabama    Montevallo

The applymap() method took each element from the DataFrame, passed it to the function, and the original value was replaced by the returned value. It’s that simple!

Technical Detail: While it is a convenient and versatile method, .applymap can have significant runtime for larger datasets, because it maps a Python callable to each individual element. In some cases, it can be more efficient to do vectorized operations that utilize Cython or NumPY (which, in turn, makes calls in C) under the hood.

Renaming Columns and Skipping Rows

Often, the datasets you’ll work with will have either column names that are not easy to understand, or unimportant information in the first few and/or last rows, such as definitions of the terms in the dataset, or footnotes.

In that case, we’d want to rename columns and skip certain rows so that we can drill down to necessary information with correct and sensible labels.

To demonstrate how we can go about doing this, let’s first take a glance at the initial five rows of the “olympics.csv” dataset:

$ head -n 5 Datasets/olympics.csv
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,? Summer,01 !,02 !,03 !,Total,? Winter,01 !,02 !,03 !,Total,? Games,01 !,02 !,03 !,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70

Now, we’ll read it into a Pandas DataFrame:

>>>

>>> olympics_df = pd.read_csv('Datasets/olympics.csv')
>>> olympics_df.head()
                   0         1     2     3     4      5         6     7     8  \
0                NaN  ? Summer  01 !  02 !  03 !  Total  ? Winter  01 !  02 !
1  Afghanistan (AFG)        13     0     0     2      2         0     0     0
2      Algeria (ALG)        12     5     2     8     15         3     0     0
3    Argentina (ARG)        23    18    24    28     70        18     0     0
4      Armenia (ARM)         5     1     2     9     12         6     0     0      9     10       11    12    13    14              15
0  03 !  Total  ? Games  01 !  02 !  03 !  Combined total
1     0      0       13     0     0     2               2
2     0      0       15     5     2     8              15
3     0      0       41    18    24    28              70
4     0      0       11     1     2     9              12

This is messy indeed! The columns are the string form of integers indexed at 0. The row which should have been our header (i.e. the one to be used to set the column names) is at olympics_df.iloc[0]. This happened because our CSV file starts with 0, 1, 2, …, 15.

Also, if we were to go to the source of this dataset, we’d see that NaN above should really be something like “Country”, ? Summer is supposed to represent “Summer Games”, 01 ! should be “Gold”, and so on.

Therefore, we need to do two things:

Skip one row and set the header as the first (0-indexed) row
Rename the columns

We can skip rows and set the header while reading the CSV file by passing some parameters to the read_csv() function.

This function takes a lot of optional parameters, but in this case we only need one (header) to remove the 0th row:

>>>

>>> olympics_df = pd.read_csv('Datasets/olympics.csv', header=1)
>>> olympics_df.head()
          Unnamed: 0  ? Summer  01 !  02 !  03 !  Total  ? Winter  \
0        Afghanistan (AFG)        13     0     0     2      2         0
1            Algeria (ALG)        12     5     2     8     15         3
2          Argentina (ARG)        23    18    24    28     70        18
3            Armenia (ARM)         5     1     2     9     12         6
4  Australasia (ANZ) [ANZ]         2     3     4     5     12         0   01 !.1  02 !.1  03 !.1  Total.1  ? Games  01 !.2  02 !.2  03 !.2  \
0       0       0       0        0       13       0       0       2
1       0       0       0        0       15       5       2       8
2       0       0       0        0       41      18      24      28
3       0       0       0        0       11       1       2       9
4       0       0       0        0        2       3       4       5   Combined total
0               2
1              15
2              70
3              12
4              12

We now have the correct row set as the header and all unnecessary rows removed. Take note of how Pandas has changed the name of the column containing the name of the countries from NaN to Unnamed: 0.

To rename the columns, we will make use of a DataFrame’s rename() method, which allows you to relabel an axis based on a mapping (in this case, a dict).

Let’s start by defining a dictionary that maps current column names (as keys) to more usable ones (the dictionary’s values):

>>>

>>> new_names =  {'Unnamed: 0': 'Country',
...               '? Summer': 'Summer Olympics',
...               '01 !': 'Gold',
...               '02 !': 'Silver',
...               '03 !': 'Bronze',
...               '? Winter': 'Winter Olympics',
...               '01 !.1': 'Gold.1',
...               '02 !.1': 'Silver.1',
...               '03 !.1': 'Bronze.1',
...               '? Games': '# Games',
...               '01 !.2': 'Gold.2',
...               '02 !.2': 'Silver.2',
...               '03 !.2': 'Bronze.2'}

We call the rename() function on our object:

>>>

>>> olympics_df.rename(columns=new_names, inplace=True)

Setting inplace to True specifies that our changes be made directly to the object. Let’s see if this checks out:

>>>

>>> olympics_df.head()
                   Country  Summer Olympics  Gold  Silver  Bronze  Total  \
0        Afghanistan (AFG)               13     0       0       2      2
1            Algeria (ALG)               12     5       2       8     15
2          Argentina (ARG)               23    18      24      28     70
3            Armenia (ARM)                5     1       2       9     12
4  Australasia (ANZ) [ANZ]                2     3       4       5     12   Winter Olympics  Gold.1  Silver.1  Bronze.1  Total.1  # Games  Gold.2  \
0                0       0         0         0        0       13       0
1                3       0         0         0        0       15       5
2               18       0         0         0        0       41      18
3                6       0         0         0        0       11       1
4                0       0         0         0        0        2       3   Silver.2  Bronze.2  Combined total
0         0         2               2
1         2         8              15
2        24        28              70
3         2         9              12
4         4         5              12

Python Data Cleaning: Recap and Resources

In this tutorial, you learned how you can drop unnecessary information from a dataset using the drop() function, as well as how to set an index for your dataset so that items in it can be referenced easily.

Moreover, you learned how to clean object fields with the .str() accessor and how to clean the entire dataset using the applymap() method. Lastly, we explored how to skip rows in a CSV file and rename columns using the rename() method.

Knowing about data cleaning is very important, because it is a big part of data science. You now have a basic understanding of how Pandas and NumPy can be leveraged to clean datasets!

Check out the links below to find additional resources that will help you on your Python data science journey:

The Pandas documentation
The NumPy documentation
Python for Data Analysis by Wes McKinney, the creator of Pandas
Pandas Cookbook by Ted Petrou, a data science trainer and consultant

pandas.DataFrame.aggregate

DataFrame.aggregate(func=None, axis=0, *args, **kwargs)

Aggregate using one or more operations over the specified axis.

Parametersfuncfunction, str, list or dict

Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.

Accepted combinations are:

function
string function name
list of functions and/or function names, e.g. [np.sum, 'mean']
dict of axis labels -> functions, function names or list of such.

axis{0 or ‘index’, 1 or ‘columns’}, default 0

If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.

*args

Positional arguments to pass to func.

**kwargs

Keyword arguments to pass to func.

Returnsscalar, Series or DataFrame

The return can be:

scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions

Return scalar, Series or DataFrame.

The aggregation operations are always performed over an axis, either theindex (default) or the column axis. This behavior is different fromnumpy aggregation functions (mean, median, prod, sum, std,var), where the default is to compute the aggregation of the flattenedarray, e.g., numpy.mean(arr_2d) as opposed tonumpy.mean(arr_2d, axis=0).agg is an alias for aggregate. Use the alias.

Merge, join, concatenate and compare

pandas provides various facilities for easily combining together Series or DataFrame with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

In addition, pandas also provides utilities to compare two Series or DataFrame and summarize their differences.

pandas.DataFrame.merge

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)[source]

Merge DataFrame or named Series objects with a database-style join.

A named Series object is treated as a DataFrame with a single named column.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.

ParametersrightDataFrame or named Series

Object to merge with.

how{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’

Type of merge to be performed.

left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
cross: creates the cartesian product from both frames, preserves the order of the left keys.
New in version 1.2.0.

onlabel or list

Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

left_onlabel or list, or array-like

Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.

right_onlabel or list, or array-like

Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.

left_indexbool, default False

Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels.

right_indexbool, default False

Use the index from the right DataFrame as the join key. Same caveats as left_index.

sortbool, default False

Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword).

suffixeslist-like, default is (“_x”, “_y”)

A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. Pass a value of None instead of a string to indicate that the column name from left or right should be left as-is, with no suffix. At least one of the values must not be None.

copybool, default True

If False, avoid copy if possible.

indicatorbool or str, default False

If True, adds a column to the output DataFrame called “_merge” with information on the source of each row. The column can be given a different name by providing a string argument. The column will have a Categorical type with the value of “left_only” for observations whose merge key only appears in the left DataFrame, “right_only” for observations whose merge key only appears in the right DataFrame, and “both” if the observation’s merge key is found in both DataFrames.

validatestr, optional

If specified, checks if merge is of specified type.

“one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
“one_to_many” or “1:m”: check if merge keys are unique in left dataset.
“many_to_one” or “m:1”: check if merge keys are unique in right dataset.
“many_to_many” or “m:m”: allowed, but does not result in checks.

ReturnsDataFrame

A DataFrame of the two merged objects.

Data Structures in Pandas

Series

Using the Pandas read_csv() and .to_csv() Functions

Write a CSV File

Read a CSV File

Pythonic Data Cleaning With Pandas and NumPy

Dropping Columns in a DataFrame

Changing the Index of a DataFrame

Tidying up Fields in the Data

Combining str Methods with NumPy to Clean Columns

Cleaning the Entire Dataset Using the applymap Function

Renaming Columns and Skipping Rows

Python Data Cleaning: Recap and Resources

pandas.DataFrame.aggregate

Merge, join, concatenate and compare

pandas.DataFrame.merge

Written by hemanshi

Using the Pandas `read_csv()` and `.to_csv()` Functions

Dropping Columns in a `DataFrame`

Changing the Index of a `DataFrame`

Combining `str` Methods with NumPy to Clean Columns

Cleaning the Entire Dataset Using the `applymap` Function