Pandas To PySpark Transition For Beginners

Today PySpark is one of the favorite choices among data engineers and data scientists. This tutorial is for someone who has already worked with pandas and looking for a fresh start or already working with PySpark. Though I personally feel pandas is way superior to pyspark in terms of functions, documentation, open-source community support. But Pyspark does all the magic when you are dealing with real-world data and data ingestion pipelines. This tutorial is not about which tool is best but to understand basic code commands differences between them.

Pyspark build over spark offers many features such as Spark SQL, DataFrames, Streaming, Machine Learning support. Pyspark uses a Spark interface in Python that helps to scale distributed environments.

When to use pySpark over Pandas ?

Dealing with large datasets in pandas can be painful due to system memory constraints while perform complex operations. PySpark can be one of the alternatives when dealing with big data. Pandas users might find the commands used in pyspark quite similar in most cases. So the question comes is why pyspark is faster than pandas? The main driver program in the spark application controls the functions and executes parallel operations on the cluster. Spark then uses data partition across the nodes in the cluster to collect the elements parallelly. Eventually, distributed computing helps pyspark to perform operations quickly and faster than pandas.

However, Pandas is still very powerful in complex calculations and performing data transformations. It has rich open source community support and good documentation which ease the developer’s life whereas pyspark is still evolving and lack documentation. To get the best of both worlds one must use both pyspark and pandas alternatively whenever required. For eg: pyspark dataframes can be easily converted to pandas dataframes.

spark_df.toPandas() #To convert Spark DataFrame to Pandas DataFrame

Pandas To PySpark Code Comparison

The below comparison includes: creating a dataframe, reading top and last records, checking datatypes, group by aggregations, joins, statistical summary, etc.

Pandas

PySpark

# Using Pandas

import pandas as pd

df = pd.DataFrame({
'Name':['Ajay','Arun','Deepak'],
'Age':[29,21,34]})

# shape of a dataframe
df.shape

# display top 5 rows
df.head(5)


# display last 5 rows
df.tail(5)

# Atributes
df.columns

#Rename Columns
df.rename(columns={'old'='new'})

# Drop Columns
df.drop('column',axis=1)

# To filter records
df[df.col>50]

# To fill null values
df.fillna(0)

# Groupby Aggragation
df.groupby(['col1','col2']).agg
({'col1':'Sum','col2':'avg'})

# Summary Statistics
df.describe()

# DF datatype
df.info()

# Correlation
# pearson coefficient is the only method available at this time
df.corr() 
df['A'].corr(df['B'])

# Covariance
df.A.cov(df.B)

# Log Transformation
df['log_A'] = np.log(df.A)

# Joins
left.merge(right, on='Key') And left.merge(right, left_on='col1', right_on='col2')
# Using Pyspark

from pyspark.sql import SparkSession

df = spark.createDataFrame([
('Ajay',29),('Arun',31),('Deepak',34)],['Name','Age']).show()

# shape of a dataframe
df.count(), len(df.columns)

# display top 5 rows
df.show(5)
df.head(5)

# display last 5 rows
df.tail(5)

# Atributes
df.columns

#Rename Columns
df.withColumnRenamed('old'='new')

# Drop Columns
df.drop()

# To filter records
df[df.col>50]

# To fill null values
df.fillna(0)

# Groupby Aggragation
df.groupby(['col1','col2']).agg
({'col1':'Sum','col2':'avg'})

# Summary Statistics
dataframe.summary().show()
df.describe().show()

# DF datatype
df.dtypes

# Correlation
# pearson coefficient is the only method available at this time
df.corr('A', 'B') 

# Covariance
df['A'].cov(df['B])

# Log Transformation
df.withColumn('log_A', f.log(df.A))

# Joins
left.join(right,on='Key') and left.join(right, left.col1==right.col2)


One may still debate pandas vs pyspark but Pandas has lots of features such as support for time series analysis, EDA, Hypothesis test, etc which brings back us to the same topic when we should know when to use pyspark and pandas. Both support SQL functionalities which means one can easily use SQL queries just like in any relational database.

Also Read: Using MLFlow to Track your ML Model

RDD in PySpark

Spark goes around a concept called RDD which means Resilient Distributed Dataset which is used for fault-tolerant collection of elements operated in parallel. All the transformations in pyspark are lazy, which means they do not get computed until a user runs it. Pyspark should be the first choice for data engineers as it provides: In-Memory Computation, Lazy Evaluation, Partitioning, Persistence, and Coarse-grained operations.

There are two ways to create RDD:

  1. Parallelized collections: The elements are collected to be distributed parallely
  2. External Datasets: Can created distributed datasets from HDFS, local files, Amazon S3, Cassandra etc.
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)

distFile = sc.textFile("mydataset.txt")

PySpark Best Practices

  1. use pyspark.sql.functions and other built is functions available in PySpark for faster computaion (Runs in JVM)
  2. Use the same version of python and packages on the cluster as driver.
  3. For data visualization, Spark built in UI support from databricks (eg: Azure Databricks)
  4. SparkML lib for machine learning related work.

Things not to do:

  1. Iterate through rows, works fine in pandas but not pyspark
  2. df.toPandas.head() – Instead use df.limit(5).toPandas() : Memory run out issue can happen

Data Visualization in PySpark

Unlike Pandas supports many rich libraries in python such as matplotlib, seaborn, plotly, and many more. Pyspark doesn’t have any inbuilt support for data visualization. However, data scientists can still make the best use of platforms like Databricks which has an inbuilt UI for data visualization.

In General, we can take some samples of data and convert them to pandas. And then easily plot and understand the hypothesis behind the samples which can be easily interpreted for the entire population.

Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply