What is Pandas?

hemanshi
12 min readJul 2, 2021

--

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name “Pandas” has a reference to both “Panel Data”, and “Python Data Analysis” and was created by Wes McKinney in 2008.

Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

Intro to data structures

We’ll start with a quick, non-comprehensive overview of the fundamental data structures in pandas to get you started. The fundamental behavior about data types, indexing, and axis labeling / alignment apply across all of the objects. To get started, import NumPy and load pandas into your namespace:

In [1]: import numpy as npIn [2]: import pandas as pd

Here is a basic tenet to keep in mind: data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly by you.

We’ll give a brief intro to the data structures, then consider all of the broad categories of functionality and methods in separate sections.

Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

>>> s = pd.Series(data, index=index)

Here, data can be many different things:

  • a Python dict
  • an ndarray
  • a scalar value (like 5)

The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:

From ndarray

If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].

In [3]: s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])In [4]: s
Out[4]:
a 0.469112
b -0.282863
c -1.509059
d -1.135632
e 1.212112
dtype: float64
In [5]: s.index
Out[5]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
In [6]: pd.Series(np.random.randn(5))
Out[6]:
0 -0.173215
1 0.119209
2 -1.044236
3 -0.861849
4 -2.104569
dtype: float64

Note

pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time. The reason for being lazy is nearly all performance-based (there are many instances in computations, like parts of GroupBy, where the index is not used).

From dict

Series can be instantiated from dicts:

In [7]: d = {"b": 1, "a": 0, "c": 2}In [8]: pd.Series(d)
Out[8]:
b 1
a 0
c 2
dtype: int64

Note

When the data is a dict, and an index is not passed, the Series index will be ordered by the dict’s insertion order, if you’re using Python version >= 3.6 and pandas version >= 0.23.

If you’re using Python < 3.6 or pandas < 0.23, and an index is not passed, the Series index will be the lexically ordered list of dict keys.

In the example above, if you were on a Python version lower than 3.6 or a pandas version lower than 0.23, the Series would be ordered by the lexical order of the dict keys (i.e. ['a', 'b', 'c'] rather than ['b', 'a', 'c']).

If an index is passed, the values in data corresponding to the labels in the index will be pulled out.

In [9]: d = {"a": 0.0, "b": 1.0, "c": 2.0}In [10]: pd.Series(d)
Out[10]:
a 0.0
b 1.0
c 2.0
dtype: float64
In [11]: pd.Series(d, index=["b", "c", "d", "a"])
Out[11]:
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64

Note

NaN (not a number) is the standard missing data marker used in pandas.

From scalar value

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

In [12]: pd.Series(5.0, index=["a", "b", "c", "d", "e"])
Out[12]:
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64

Series is ndarray-like

Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. However, operations such as slicing will also slice the index.

In [13]: s[0]
Out[13]: 0.4691122999071863
In [14]: s[:3]
Out[14]:
a 0.469112
b -0.282863
c -1.509059
dtype: float64
In [15]: s[s > s.median()]
Out[15]:
a 0.469112
e 1.212112
dtype: float64
In [16]: s[[4, 3, 1]]
Out[16]:
e 1.212112
d -1.135632
b -0.282863
dtype: float64
In [17]: np.exp(s)
Out[17]:
a 1.598575
b 0.753623
c 0.221118
d 0.321219
e 3.360575
dtype: float64

Note

We will address array-based indexing like s[[4, 3, 1]] in section.

Like a NumPy array, a pandas Series has a dtype.

In [18]: s.dtype
Out[18]: dtype('float64')

This is often a NumPy dtype. However, pandas and 3rd-party libraries extend NumPy’s type system in a few places, in which case the dtype would be an ExtensionDtype. Some examples within pandas are Categorical data and Nullable integer data type. See dtypes for more.

If you need the actual array backing a Series, use Series.array.

In [19]: s.array
Out[19]:
<PandasArray>
[ 0.4691122999071863, -0.2828633443286633, -1.5090585031735124,
-1.1356323710171934, 1.2121120250208506]
Length: 5, dtype: float64

Accessing the array can be useful when you need to do some operation without the index (to disable automatic alignment, for example).

Series.array will always be an ExtensionArray. Briefly, an ExtensionArray is a thin wrapper around one or more concrete arrays like a numpy.ndarray. pandas knows how to take an ExtensionArray and store it in a Series or a column of a DataFrame. See dtypes for more.

While Series is ndarray-like, if you need an actual ndarray, then use Series.to_numpy().

In [20]: s.to_numpy()
Out[20]: array([ 0.4691, -0.2829, -1.5091, -1.1356, 1.2121])

Even if the Series is backed by a ExtensionArray, Series.to_numpy() will return a NumPy ndarray.

Series is dict-like

A Series is like a fixed-size dict in that you can get and set values by index label:

In [21]: s["a"]
Out[21]: 0.4691122999071863
In [22]: s["e"] = 12.0In [23]: s
Out[23]:
a 0.469112
b -0.282863
c -1.509059
d -1.135632
e 12.000000
dtype: float64
In [24]: "e" in s
Out[24]: True
In [25]: "f" in s
Out[25]: False

If a label is not contained, an exception is raised:

>>> s["f"]
KeyError: 'f'

Using the get method, a missing label will return None or specified default:

In [26]: s.get("f")In [27]: s.get("f", np.nan)
Out[27]: nan

The pandas DataFrame

A DataFrame is a two-dimensional data structure composed of rows and columns — exactly like a simple spreadsheet or a SQL table. Each column of a DataFrame is a pandas Series. These columns should be of the same length, but they can be of different data types — float, int, bool, and so on. DataFrames are both value-mutable and size-mutable (Series, by contrast, is only value-mutable, not size-mutable. The length of a Series cannot be changed although the values can be changed). This lets us perform operations that would alter values held within the DataFrame or add/delete columns to/from the DataFrame.

A DataFrame consists of three components.

  • Two-dimensional data (Values)
  • Row index
  • Column index

The DataFrame has two index arrays. The first index array has very similar functions to the index array in series. In fact, each label is associated with all the values in the row. The second array contains a series of labels, each associated with a particular column. There are two axes (dimensions) for a DataFrame which are commonly referred to as axis 0 and 1, or the row/index axis and the column axis respectively.

DataFrame creation: Introduction

The general construct for creating a DataFrame data structure is:

A DataFrame is the most commonly used data structure in pandas. The DataFrame() class constructor accepts many different types of arguments:

Creating a pandas DataFrame from another object

The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

In [2]:

data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

Out[2]:

0    0.25
1 0.50
2 0.75
3 1.00
dtype: float64

As we see in the output, the Series wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes. The values are simply a familiar NumPy array:

In [3]:

data.values

Out[3]:

array([ 0.25,  0.5 ,  0.75,  1.  ])

The index is an array-like object of type pd.Index, which we'll discuss in more detail momentarily.

In [4]:

data.index

Out[4]:

RangeIndex(start=0, stop=4, step=1)

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [5]:

data[1]

Out[5]:

0.5

In [6]:

data[1:3]

Out[6]:

1    0.50
2 0.75
dtype: float64

As we will see, though, the Pandas Series is much more general and flexible than the one-dimensional NumPy array that it emulates.

Using the Pandas read_csv() and .to_csv() Functions

A comma-separated values (CSV) file is a plaintext file with a .csv extension that holds tabular data. This is one of the most popular file formats for storing large amounts of data. Each row of the CSV file represents a single table row. The values in the same row are by default separated with commas, but you could change the separator to a semicolon, tab, space, or some other character.

Write a CSV File

You can save your Pandas DataFrame as a CSV file with .to_csv():

>>>

>>> df.to_csv('data.csv')

That’s it! You’ve created the file data.csv in your current working directory. You can expand the code block below to see how your CSV file should look:

data.csvShow/Hide

This text file contains the data separated with commas. The first column contains the row labels. In some cases, you’ll find them irrelevant. If you don’t want to keep them, then you can pass the argument index=False to .to_csv().

Read a CSV File

Once your data is saved in a CSV file, you’ll likely want to load and use it from time to time. You can do that with the Pandas read_csv() function:

>>>

>>> df = pd.read_csv('data.csv', index_col=0)
>>> df
COUNTRY POP AREA GDP CONT IND_DAY
CHN China 1398.72 9596.96 12234.78 Asia NaN
IND India 1351.16 3287.26 2575.67 Asia 1947-08-15
USA US 329.74 9833.52 19485.39 N.America 1776-07-04
IDN Indonesia 268.07 1910.93 1015.54 Asia 1945-08-17
BRA Brazil 210.32 8515.77 2055.51 S.America 1822-09-07
PAK Pakistan 205.71 881.91 302.14 Asia 1947-08-14
NGA Nigeria 200.96 923.77 375.77 Africa 1960-10-01
BGD Bangladesh 167.09 147.57 245.63 Asia 1971-03-26
RUS Russia 146.79 17098.25 1530.75 NaN 1992-06-12
MEX Mexico 126.58 1964.38 1158.23 N.America 1810-09-16
JPN Japan 126.22 377.97 4872.42 Asia NaN
DEU Germany 83.02 357.11 3693.20 Europe NaN
FRA France 67.02 640.68 2582.49 Europe 1789-07-14
GBR UK 66.44 242.50 2631.23 Europe NaN
ITA Italy 60.36 301.34 1943.84 Europe NaN
ARG Argentina 44.94 2780.40 637.49 S.America 1816-07-09
DZA Algeria 43.38 2381.74 167.56 Africa 1962-07-05
CAN Canada 37.59 9984.67 1647.12 N.America 1867-07-01
AUS Australia 25.47 7692.02 1408.68 Oceania NaN
KAZ Kazakhstan 18.53 2724.90 159.41 Asia 1991-12-16

In this case, the Pandas read_csv() function returns a new DataFrame with the data and labels from the file data.csv, which you specified with the first argument. This string can be any valid path, including URLs.

The parameter index_col specifies the column from the CSV file that contains the row labels. You assign a zero-based column index to this parameter. You should determine the value of index_col when the CSV file contains the row labels to avoid loading them as data.

You’ll learn more about using Pandas with CSV files later on in this tutorial. You can also check out Reading and Writing CSV Files in Python to see how to handle CSV files with the built-in Python library csv as well.

pandas.DataFrame.to_sql

DataFrame.to_sql(name, con, schema=None, if_exists=’fail’, index=True, index_label=None, chunksize=None, dtype=None, method=None)[source]

Write records stored in a DataFrame to a SQL database.

Databases supported by SQLAlchemy [1] are supported. Tables can be newly created, appended to, or overwritten.

Parametersnamestr

Name of SQL table.

consqlalchemy.engine.(Engine or Connection) or sqlite3.Connection

Using SQLAlchemy makes it possible to use any DB supported by that library. Legacy support is provided for sqlite3.Connection objects. The user is responsible for engine disposal and connection closure for the SQLAlchemy connectable See here.

schemastr, optional

Specify the schema (if database flavor supports this). If None, use default schema.

if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’

How to behave if the table already exists.

  • fail: Raise a ValueError.
  • replace: Drop the table before inserting new values.
  • append: Insert new values to the existing table.

indexbool, default True

Write DataFrame index as a column. Uses index_label as the column name in the table.

index_labelstr or sequence, default None

Column label for index column(s). If None is given (default) and index is True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.

chunksizeint, optional

Specify the number of rows in each batch to be written at a time. By default, all rows will be written at once.

dtypedict or scalar, optional

Specifying the datatype for columns. If a dictionary is used, the keys should be the column names and the values should be the SQLAlchemy types or strings for the sqlite3 legacy mode. If a scalar is provided, it will be applied to all columns.

method{None, ‘multi’, callable}, optional

Controls the SQL insertion clause used:

  • None : Uses standard SQL INSERT clause (one per row).
  • ‘multi’: Pass multiple values in a single INSERT clause.
  • callable with signature (pd_table, conn, keys, data_iter).

Details and a sample callable implementation can be found in the section insert method.

THROUGH API

Many APIs allow applications to automatically talk to one another, exchange data, and do so in a highly scalable and automated way.

This post is going to show you a basic example of how you can use the pandas apply method to enrich a dataframe with API data.

A few times you might want to use an external API

  • Calculate shipping costs with Fedex/UPS/etc between your warehouse and your customer
  • Extract Latitude and Longitude from a street address or zipcode
  • Trigger an automation in Zapier
  • Email a customer an order status update

Really, the list is endless. We’re going to be focusing on integrating 1 API.

Running through this post is 10x better if you use a jupyter notebook, you can find this post’s notebook here.

1. The Google Geocoding API

Keep in mind that Google’s Geocoding API has a number of free requests but after a certain point, they start to charge you.

1. Sign up on Google Cloud Compute (GCP) 2. Login to the console. 3. Go to the Geocoding API here click “Enable”. If that link fails, do a search for Geocoding API, open it, and click “Enable”

You might have to create a project and do a few other configuration to get to the point where you can activate the Geocoding API

4. Get your Google Cloud API Credentials: — Do a search for “Credentials” within GCP — Click “Create Credentials” > “API Key” — Copy the key value, mine was AIzaSyBSXMpu6lqd8kViIpy1GNWQ1symTXdMRzw this is your Google Cloud API key.

Google Cloud Compute API Keys have unrestricted privileges so it’s highly recommend that you restrict the API key to the Geocoding API as well as to your local IP address.

2. Install Requirements

Copy

pip install pandas requests

3. Let’s code

Copy

import datetime
import pandas as pd
import requests

Copy

# our dataset
data = {"addresses": ['Newport Beach, California', 'New York City', 'London, England', 10001, 'Sydney, Au']}

Copy

# Calling DataFrame constructor on addresses list 
df = pd.DataFrame(data)
df

We only have 1 data point here. But what we’re about to do can work for nearly any number of columns in a similar way.

Let’s take a look at how the pandas apply works with a simple example.

Copy

# create a throw-away dataframe
df_throwaway = df.copy()
def apply_this_function(passed_row):
passed_row['new_col'] = True
passed_row['added'] = datetime.datetime.now()
return passed_row
df_throwaway.apply(apply_this_function, axis=1) # axis=1 is important to use the row itself

As we see, the pandas apply function works really well to add additional columns to our current columns. But we can go one step further, we can actually create new columns and row values based on what’s within the row itself.

Copy

# create another throw-away dataframe
df_example_2 = df.copy()
def apply_this_other_function(row):
column_name = 'addresses'
address_value = row[column_name]
if isinstance(address_value, int):
row[column_name] = address_value * 2
return row
df_example_2.apply(apply_this_other_function, axis=1) # axis=1 is important to use the row itself

Now we see two main things about the .apply method, in each row we can:

1. Add new columns based on other column’s values (within that row) 2. We can change the any value of any column within a single row)

In other words, .apply enables us to change rows very dynamically. Now it's time to call our Geocoding API.

Copy

# create a working example. I like using a copy of the source data in case we make mistakes
rest_api_df = df.copy()
GOOGLE_API_KEY = 'your_api_key_from_above'
def extract_lat_long_via_address(address_or_zipcode):
lat, lng = None, None
api_key = GOOGLE_API_KEY
base_url = "https://maps.googleapis.com/maps/api/geocode/json"
endpoint = f"{base_url}?address={address_or_zipcode}&key={api_key}"
# see how our endpoint includes our API key? Yes this is yet another reason to restrict the key
r = requests.get(endpoint)
if r.status_code not in range(200, 299):
return None, None
try:
'''
This try block incase any of our inputs are invalid. This is done instead
of actually writing out handlers for all kinds of responses.
'''
results = r.json()['results'][0]
lat = results['geometry']['location']['lat']
lng = results['geometry']['location']['lng']
except:
pass
return lat, lng

def enrich_with_geocoding_api(row):
column_name = 'addresses'
address_value = row[column_name]
address_lat, address_lng = extract_lat_long_via_address(address_value)
row['lat'] = address_lat
row['lng'] = address_lng
return row
rest_api_df.apply(enrich_with_geocoding_api, axis=1) # axis=1 is important to use the row itself

--

--