A Brief Guide to Pandas for Time-Series Data Analysis

Quarterly Average Electricity Production from 1985 to 2018

Brief Guide To Pandas For Time Series Analysys

A series of data is taken in a timely manner. For example total sales at a showroom every day, daily billing data for a service like Apple Music, hourly readings of a telemetry device, and so on.

Pandas is a powerful Python library that provides various functionalities for time series analysis. Here are some key features and techniques in pandas that you can use for time series analysis:

Loading and Manipulating Time Series Data: Pandas provides convenient tools for reading time series data from various formats (CSV, Excel, SQL databases, etc.) and transforming it into a pandas DataFrame with a datetime index. You can use functions like read_csv(), read_excel(), and read_sql() to load the data. Once the data is loaded, you can manipulate it using pandas’ indexing, slicing, filtering, and resampling capabilities.

import pandas as pd

# Load time series data
data = pd.read_csv('data.csv')

# Convert a column to datetime index
data['datetime_column'] = pd.to_datetime(data['datetime_column'])
data.set_index('datetime_column', inplace=True)

# Accessing and slicing time series data
subset = data['start_date':'end_date']

Handling Missing Values: Time series data often contains missing values that need to be handled before analysis. Pandas provides functions like dropna(), fillna(), and interpolate() to handle missing data by dropping the rows, filling with specific values, or interpolating based on neighboring values.

# Drop rows with missing values
data = data.dropna()

# Fill missing values with a specific value
data = data.fillna(0)

# Interpolate missing values
data = data.interpolate()

Resampling and Frequency Conversion: Pandas allows you to resample time series data to different frequencies (e.g., downsampling or upsampling) using functions like resample() and asfreq(). This can be useful for aggregating or interpolating data at different time intervals.y

# Resample data to monthly frequency
monthly_data = data.resample('M').mean()

# Upsample data to a higher frequency using interpolation
upsampled_data = data.resample('D').interpolate()
Time Series Forecasting

Forecasting and predicting the sales/values/outcomes based on historical data points.

Helpful Methods in Pandas for Time-Series

shift() – Shift index by desired number of dates/periods. It can work as LAG and LEAD in SQL.

## points shifted by -3 or lagged by 3
df['Lag_by_3'] = df.shift(-3)

# points shifted by 3 or lead by 3
df['Lead_by_3'] = df.shift(3)

rolling() – for any calculation related to the rolling window i.e. rolling/moving average, moving standard deviation, etc.

This is used to get the rolling/moving average, mean, median, % changes, and sum on a particular window.

Example: Rolling Sum of Sales with window of 7 days. Here for day d, we will take the sum sales data from d-6 and add it to d.

The rolling sum of 7 days window at day d = Sum(Sales from d-6 to d)

# get rolling sum with window of 7 days (given we've day-wise data)
df.rolling(7).sum()

resample() – for upsampling or downsampling of time-series data.

# get quarterly mean from our time series
quarterly_df = df.resample('Q').mean()

# it will show 4 rows of data for year 2017 if Date column is used as index
quarterly_df['2017']

diff() –  get the discrete difference of data points over the given axis (columns, periods, etc.)

# it will calculate difference of data points with period=1
df['Sales'].diff()

You can find examples of these methods in this Kaggle Notebook:

Further readings and applications

If we don’t have more data for our time series, we can use Pandas and related eco-system. Every business has some kind of time-series data and we have just scratched the surface of time-series with simple exploratory data analysis. We have a lot of advanced libraries like Prophet that the industry is using very heavily for time-series analysis and forecasting. Stay tuned for more on that.

Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply