What is Pandas?

12 min readJul 2, 2021

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name “Pandas” has a reference to both “Panel Data”, and “Python Data Analysis” and was created by Wes McKinney in 2008.

Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

Intro to data structures

We’ll start with a quick, non-comprehensive overview of the fundamental data structures in pandas to get you started. The fundamental behavior about data types, indexing, and axis labeling / alignment apply across all of the objects. To get started, import NumPy and load pandas into your namespace:

In [1]: import numpy as npIn [2]: import pandas as pd

Here is a basic tenet to keep in mind: data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly by you.

We’ll give a brief intro to the data structures, then consider all of the broad categories of functionality and methods in separate sections.

Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

>>> s = pd.Series(data, index=index)

Here, data can be many different things:

a Python dict
an ndarray
a scalar value (like 5)

The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:

From ndarray

If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].

In [3]: s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])In [4]: s
Out[4]: 
a    0.469112
b   -0.282863
c   -1.509059
d   -1.135632
e    1.212112
dtype: float64In [5]: s.index
Out[5]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')In [6]: pd.Series(np.random.randn(5))
Out[6]: 
0   -0.173215
1    0.119209
2   -1.044236
3   -0.861849
4   -2.104569
dtype: float64

Note

pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time. The reason for being lazy is nearly all performance-based (there are many instances in computations, like parts of GroupBy, where the index is not used).

From dict

Series can be instantiated from dicts:

In [7]: d = {"b": 1, "a": 0, "c": 2}In [8]: pd.Series(d)
Out[8]: 
b    1
a    0
c    2
dtype: int64

Note

When the data is a dict, and an index is not passed, the Series index will be ordered by the dict’s insertion order, if you’re using Python version >= 3.6 and pandas version >= 0.23.

If you’re using Python < 3.6 or pandas < 0.23, and an index is not passed, the Series index will be the lexically ordered list of dict keys.

In the example above, if you were on a Python version lower than 3.6 or a pandas version lower than 0.23, the Series would be ordered by the lexical order of the dict keys (i.e. ['a', 'b', 'c'] rather than ['b', 'a', 'c']).

If an index is passed, the values in data corresponding to the labels in the index will be pulled out.

In [9]: d = {"a": 0.0, "b": 1.0, "c": 2.0}In [10]: pd.Series(d)
Out[10]: 
a    0.0
b    1.0
c    2.0
dtype: float64In [11]: pd.Series(d, index=["b", "c", "d", "a"])
Out[11]: 
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

Note

NaN (not a number) is the standard missing data marker used in pandas.

From scalar value

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

In [12]: pd.Series(5.0, index=["a", "b", "c", "d", "e"])
Out[12]: 
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

Series is ndarray-like

Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. However, operations such as slicing will also slice the index.

In [13]: s[0]
Out[13]: 0.4691122999071863In [14]: s[:3]
Out[14]: 
a    0.469112
b   -0.282863
c   -1.509059
dtype: float64In [15]: s[s > s.median()]
Out[15]: 
a    0.469112
e    1.212112
dtype: float64In [16]: s[[4, 3, 1]]
Out[16]: 
e    1.212112
d   -1.135632
b   -0.282863
dtype: float64In [17]: np.exp(s)
Out[17]: 
a    1.598575
b    0.753623
c    0.221118
d    0.321219
e    3.360575
dtype: float64

Note

We will address array-based indexing like s[[4, 3, 1]] in section.

Like a NumPy array, a pandas Series has a dtype.

In [18]: s.dtype
Out[18]: dtype('float64')

This is often a NumPy dtype. However, pandas and 3rd-party libraries extend NumPy’s type system in a few places, in which case the dtype would be an ExtensionDtype. Some examples within pandas are Categorical data and Nullable integer data type. See dtypes for more.

If you need the actual array backing a Series, use Series.array.

In [19]: s.array
Out[19]: 
<PandasArray>
[ 0.4691122999071863, -0.2828633443286633, -1.5090585031735124,
 -1.1356323710171934,  1.2121120250208506]
Length: 5, dtype: float64

Accessing the array can be useful when you need to do some operation without the index (to disable automatic alignment, for example).

Series.array will always be an ExtensionArray. Briefly, an ExtensionArray is a thin wrapper around one or more concrete arrays like a numpy.ndarray. pandas knows how to take an ExtensionArray and store it in a Series or a column of a DataFrame. See dtypes for more.

While Series is ndarray-like, if you need an actual ndarray, then use Series.to_numpy().

In [20]: s.to_numpy()
Out[20]: array([ 0.4691, -0.2829, -1.5091, -1.1356,  1.2121])

Even if the Series is backed by a ExtensionArray, Series.to_numpy() will return a NumPy ndarray.

Series is dict-like

A Series is like a fixed-size dict in that you can get and set values by index label:

In [21]: s["a"]
Out[21]: 0.4691122999071863In [22]: s["e"] = 12.0In [23]: s
Out[23]: 
a     0.469112
b    -0.282863
c    -1.509059
d    -1.135632
e    12.000000
dtype: float64In [24]: "e" in s
Out[24]: TrueIn [25]: "f" in s
Out[25]: False

If a label is not contained, an exception is raised:

>>> s["f"]
KeyError: 'f'

Using the get method, a missing label will return None or specified default:

In [26]: s.get("f")In [27]: s.get("f", np.nan)
Out[27]: nan

The pandas DataFrame

A DataFrame is a two-dimensional data structure composed of rows and columns — exactly like a simple spreadsheet or a SQL table. Each column of a DataFrame is a pandas Series. These columns should be of the same length, but they can be of different data types — float, int, bool, and so on. DataFrames are both value-mutable and size-mutable (Series, by contrast, is only value-mutable, not size-mutable. The length of a Series cannot be changed although the values can be changed). This lets us perform operations that would alter values held within the DataFrame or add/delete columns to/from the DataFrame.

A DataFrame consists of three components.

Two-dimensional data (Values)
Row index
Column index

The DataFrame has two index arrays. The first index array has very similar functions to the index array in series. In fact, each label is associated with all the values in the row. The second array contains a series of labels, each associated with a particular column. There are two axes (dimensions) for a DataFrame which are commonly referred to as axis 0 and 1, or the row/index axis and the column axis respectively.

DataFrame creation: Introduction

The general construct for creating a DataFrame data structure is:

A DataFrame is the most commonly used data structure in pandas. The DataFrame() class constructor accepts many different types of arguments:

Creating a pandas DataFrame from another object

The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

In [2]:

data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

Out[2]:

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As we see in the output, the Series wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes. The values are simply a familiar NumPy array:

In [3]:

data.values

Out[3]:

array([ 0.25,  0.5 ,  0.75,  1.  ])

The index is an array-like object of type pd.Index, which we'll discuss in more detail momentarily.

In [4]:

data.index

Out[4]:

RangeIndex(start=0, stop=4, step=1)

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [5]:

data[1]

Out[5]:

0.5

In [6]:

data[1:3]

Out[6]:

1    0.50
2    0.75
dtype: float64

As we will see, though, the Pandas Series is much more general and flexible than the one-dimensional NumPy array that it emulates.

Using the Pandas `read_csv()` and `.to_csv()` Functions

A comma-separated values (CSV) file is a plaintext file with a .csv extension that holds tabular data. This is one of the most popular file formats for storing large amounts of data. Each row of the CSV file represents a single table row. The values in the same row are by default separated with commas, but you could change the separator to a semicolon, tab, space, or some other character.

Write a CSV File

You can save your Pandas DataFrame as a CSV file with .to_csv():

>>>

>>> df.to_csv('data.csv')

That’s it! You’ve created the file data.csv in your current working directory. You can expand the code block below to see how your CSV file should look:

data.csvShow/Hide

This text file contains the data separated with commas. The first column contains the row labels. In some cases, you’ll find them irrelevant. If you don’t want to keep them, then you can pass the argument index=False to .to_csv().

Read a CSV File

Once your data is saved in a CSV file, you’ll likely want to load and use it from time to time. You can do that with the Pandas read_csv() function:

>>>

>>> df = pd.read_csv('data.csv', index_col=0)
>>> df
        COUNTRY      POP      AREA       GDP       CONT     IND_DAY
CHN       China  1398.72   9596.96  12234.78       Asia         NaN
IND       India  1351.16   3287.26   2575.67       Asia  1947-08-15
USA          US   329.74   9833.52  19485.39  N.America  1776-07-04
IDN   Indonesia   268.07   1910.93   1015.54       Asia  1945-08-17
BRA      Brazil   210.32   8515.77   2055.51  S.America  1822-09-07
PAK    Pakistan   205.71    881.91    302.14       Asia  1947-08-14
NGA     Nigeria   200.96    923.77    375.77     Africa  1960-10-01
BGD  Bangladesh   167.09    147.57    245.63       Asia  1971-03-26
RUS      Russia   146.79  17098.25   1530.75        NaN  1992-06-12
MEX      Mexico   126.58   1964.38   1158.23  N.America  1810-09-16
JPN       Japan   126.22    377.97   4872.42       Asia         NaN
DEU     Germany    83.02    357.11   3693.20     Europe         NaN
FRA      France    67.02    640.68   2582.49     Europe  1789-07-14
GBR          UK    66.44    242.50   2631.23     Europe         NaN
ITA       Italy    60.36    301.34   1943.84     Europe         NaN
ARG   Argentina    44.94   2780.40    637.49  S.America  1816-07-09
DZA     Algeria    43.38   2381.74    167.56     Africa  1962-07-05
CAN      Canada    37.59   9984.67   1647.12  N.America  1867-07-01
AUS   Australia    25.47   7692.02   1408.68    Oceania         NaN
KAZ  Kazakhstan    18.53   2724.90    159.41       Asia  1991-12-16

In this case, the Pandas read_csv() function returns a new DataFrame with the data and labels from the file data.csv, which you specified with the first argument. This string can be any valid path, including URLs.

The parameter index_col specifies the column from the CSV file that contains the row labels. You assign a zero-based column index to this parameter. You should determine the value of index_col when the CSV file contains the row labels to avoid loading them as data.

You’ll learn more about using Pandas with CSV files later on in this tutorial. You can also check out Reading and Writing CSV Files in Python to see how to handle CSV files with the built-in Python library csv as well.

pandas.DataFrame.to_sql

DataFrame.to_sql(name, con, schema=None, if_exists=’fail’, index=True, index_label=None, chunksize=None, dtype=None, method=None)[source]

Write records stored in a DataFrame to a SQL database.

Databases supported by SQLAlchemy [1] are supported. Tables can be newly created, appended to, or overwritten.

Parametersnamestr

Name of SQL table.

consqlalchemy.engine.(Engine or Connection) or sqlite3.Connection

Using SQLAlchemy makes it possible to use any DB supported by that library. Legacy support is provided for sqlite3.Connection objects. The user is responsible for engine disposal and connection closure for the SQLAlchemy connectable See here.

schemastr, optional

Specify the schema (if database flavor supports this). If None, use default schema.

if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’

How to behave if the table already exists.

fail: Raise a ValueError.
replace: Drop the table before inserting new values.
append: Insert new values to the existing table.

indexbool, default True

Write DataFrame index as a column. Uses index_label as the column name in the table.

index_labelstr or sequence, default None

Column label for index column(s). If None is given (default) and index is True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.

chunksizeint, optional

Specify the number of rows in each batch to be written at a time. By default, all rows will be written at once.

dtypedict or scalar, optional

Specifying the datatype for columns. If a dictionary is used, the keys should be the column names and the values should be the SQLAlchemy types or strings for the sqlite3 legacy mode. If a scalar is provided, it will be applied to all columns.

method{None, ‘multi’, callable}, optional

Controls the SQL insertion clause used:

None : Uses standard SQL INSERT clause (one per row).
‘multi’: Pass multiple values in a single INSERT clause.
callable with signature (pd_table, conn, keys, data_iter).

Details and a sample callable implementation can be found in the section insert method.

THROUGH API

Many APIs allow applications to automatically talk to one another, exchange data, and do so in a highly scalable and automated way.

This post is going to show you a basic example of how you can use the pandas apply method to enrich a dataframe with API data.

A few times you might want to use an external API

Calculate shipping costs with Fedex/UPS/etc between your warehouse and your customer
Extract Latitude and Longitude from a street address or zipcode
Trigger an automation in Zapier
Email a customer an order status update

Really, the list is endless. We’re going to be focusing on integrating 1 API.

Running through this post is 10x better if you use a jupyter notebook, you can find this post’s notebook here.

1. The Google Geocoding API

Keep in mind that Google’s Geocoding API has a number of free requests but after a certain point, they start to charge you.

1. Sign up on Google Cloud Compute (GCP) 2. Login to the console. 3. Go to the Geocoding API here click “Enable”. If that link fails, do a search for Geocoding API, open it, and click “Enable”

You might have to create a project and do a few other configuration to get to the point where you can activate the Geocoding API

4. Get your Google Cloud API Credentials: — Do a search for “Credentials” within GCP — Click “Create Credentials” > “API Key” — Copy the key value, mine was AIzaSyBSXMpu6lqd8kViIpy1GNWQ1symTXdMRzw this is your Google Cloud API key.

Google Cloud Compute API Keys have unrestricted privileges so it’s highly recommend that you restrict the API key to the Geocoding API as well as to your local IP address.

2. Install Requirements

Copy

pip install pandas requests

3. Let’s code

Copy

import datetime
import pandas as pd
import requests

Copy

# our dataset
data = {"addresses": ['Newport Beach, California', 'New York City', 'London, England', 10001, 'Sydney, Au']}

Copy

# Calling DataFrame constructor on addresses list 
df = pd.DataFrame(data) 
df

We only have 1 data point here. But what we’re about to do can work for nearly any number of columns in a similar way.

Let’s take a look at how the pandas apply works with a simple example.

Copy

# create a throw-away dataframe
df_throwaway = df.copy()def apply_this_function(passed_row):
    passed_row['new_col'] = True
    passed_row['added'] = datetime.datetime.now()
    return passed_rowdf_throwaway.apply(apply_this_function, axis=1) # axis=1 is important to use the row itself

As we see, the pandas apply function works really well to add additional columns to our current columns. But we can go one step further, we can actually create new columns and row values based on what’s within the row itself.

Copy

# create another throw-away dataframe
df_example_2 = df.copy()def apply_this_other_function(row):
    column_name = 'addresses'
    address_value = row[column_name]
    if isinstance(address_value, int):
        row[column_name] = address_value * 2
    return rowdf_example_2.apply(apply_this_other_function, axis=1) # axis=1 is important to use the row itself

Now we see two main things about the .apply method, in each row we can:

1. Add new columns based on other column’s values (within that row) 2. We can change the any value of any column within a single row)

In other words, .apply enables us to change rows very dynamically. Now it's time to call our Geocoding API.

Copy

# create a working example. I like using a copy of the source data in case we make mistakes
rest_api_df = df.copy()
GOOGLE_API_KEY = 'your_api_key_from_above' def extract_lat_long_via_address(address_or_zipcode):
    lat, lng = None, None
    api_key = GOOGLE_API_KEY
    base_url = "https://maps.googleapis.com/maps/api/geocode/json"
    endpoint = f"{base_url}?address={address_or_zipcode}&key={api_key}"
    # see how our endpoint includes our API key? Yes this is yet another reason to restrict the key
    r = requests.get(endpoint)
    if r.status_code not in range(200, 299):
        return None, None
    try:
        '''
        This try block incase any of our inputs are invalid. This is done instead
        of actually writing out handlers for all kinds of responses.
        '''
        results = r.json()['results'][0]
        lat = results['geometry']['location']['lat']
        lng = results['geometry']['location']['lng']
    except:
        pass
    return lat, lng
    
def enrich_with_geocoding_api(row):
    column_name = 'addresses'
    address_value = row[column_name]
    address_lat, address_lng = extract_lat_long_via_address(address_value)
    row['lat'] = address_lat
    row['lng'] = address_lng
    return rowrest_api_df.apply(enrich_with_geocoding_api, axis=1) # axis=1 is important to use the row itself

What is Pandas?

Why Use Pandas?

Intro to data structures

Series

Series is ndarray-like

Series is dict-like

The pandas DataFrame

DataFrame creation: Introduction

Creating a pandas DataFrame from another object

The Pandas Series Object

Using the Pandas read_csv() and .to_csv() Functions

Write a CSV File

Read a CSV File

pandas.DataFrame.to_sql

THROUGH API

1. The Google Geocoding API

2. Install Requirements

3. Let’s code

Written by hemanshi

Using the Pandas `read_csv()` and `.to_csv()` Functions