I have a pair of pyspark dataframes and I am looking to compute the total sum of points in the second dataframe, which will be based on the

Question

I have a pair of pyspark dataframes and I am looking to compute the total sum of points in the second dataframe, which will be based on the

Welcome to my first data frame that showcases player points

	Playername	pid	matchid	points
0	Virat Kohli	10	2	0
1	Ravichandran Ashwin	11	2	9
2	Gautam Gambhir	12	2	1
3	Ravindra Jadeja	13	2	7
4	Amit Mishra	14	2	2
5	Mohammed Shami	15	2	2
6	Karun Nair	16	2	4
7	Hardik Pandya	17	2	0
8	Cheteshwar Pujara	18	2	9
9	Ajinkya Rahane	19	2	5

Moving on to the second dataframe where I need to calculate the sum based on the players listed in each row

https://i.stack.imgur.com/8CRm0.png

Desired output can be seen here https://i.stack.imgur.com/3B15m.png

I have a solution, but I'm looking for an efficient method using pyspark

## A Function that returns the corresponding points for a Player from df1
def replacepoints(x):
    return df1['points'].where(df1['Playername']==x).sum()

## Iterating through all names and replacing them with their respective points to calculate total points for each row

df3 = df2[['p1','p2','p3','p4','p5','p6','p7','p8','p9','p10','p11']].copy()
# df3
length = len(df3)
for i in range(length):
    j_len = len(df3.iloc[i])
    for j in range(j_len):
        name = df3.iloc[i][j]
        df3.iloc[i][j] = replacepoints(name)

## df3 now only contains points
# df3

## Storing the sum of points
points = df3.sum(axis=1)
points

# Adding points to df2 points column
df2['points'] = points

python apache-spark pyspark

Answer 1

Answer №1

Python Code Snippet

import pandas as pd

df_player_points = pd.read_csv('player_points.csv')

df_small_input_spark = pd.read_csv('small_input_spark.csv')

player_names = list(df_player_points['Playername'])

points = list(df_player_points['points'])

index = 0

for name in player_names:
    df_small_input_spark.iloc[:,7:] = df_small_input_spark.iloc[:,7:].replace([name], int(points[index]))
    index += 1

df_small_input_spark['points'] = df_small_input_spark.iloc[:,7:].sum(axis=1)

df_small_input_spark.head()

Avoiding Nested Loop with Data Copy Creation

Note: This approach replaces player names with points and then calculates the row-wise sum of points without altering the original dataset.

Answer 2

Python Code Snippet

import pandas as pd

df_player_points = pd.read_csv('player_points.csv')

df_small_input_spark = pd.read_csv('small_input_spark.csv')

player_names = list(df_player_points['Playername'])

points = list(df_player_points['points'])

index = 0

for name in player_names:
    df_small_input_spark.iloc[:,7:] = df_small_input_spark.iloc[:,7:].replace([name], int(points[index]))
    index += 1

df_small_input_spark['points'] = df_small_input_spark.iloc[:,7:].sum(axis=1)

df_small_input_spark.head()

Avoiding Nested Loop with Data Copy Creation

Note: This approach replaces player names with points and then calculates the row-wise sum of points without altering the original dataset.

Answer 3

Answer №2

Here is the solution I came up with:

sc = SparkContext('local[*]')
spark = SparkSession(sparkContext=sc)

df2 = spark.read.options(inferSchema='True',delimiter=',',header='True').csv("D:\\bop\\small_input_spark.csv")
df1 = spark.read.options(inferSchema='True',delimiter=',',header='True').csv("D:\\bop\\player_points.csv")

# start = time.time()

player_name = df1.select('Playername').collect()
points = df1.select('points').collect()

dictn = {row['Playername']:row['points'] for row in df1.collect()}

print(dictn)

dictn = {k:str(v) for k,v in zip(dictn.keys(),dictn.values())}

df3 = df2.na.replace(dictn,1,("captain","v-captain","MoM","p1","p2","p3","p4","p5","p6","p7","p8","p9","p10","p11"))

integer_type = ["captain","v-captain","MoM","p1","p2","p3","p4","p5","p6","p7","p8","p9","p10","p11"]

for c in integer_type:
    df3 = df3.withColumn(c, df3[c].cast(IntegerType()))

numeric_col_list=df3.schema.names
numeric_col_list=numeric_col_list[4:]   

df3 = df3.withColumn('v-captain', ((col('v-captain') / 2 )))
df3 = df3.withColumn('MoM', ((col('MoM') * 2 )))

df3 = df3.withColumn('points',reduce(add, [col(x) for x in numeric_col_list]))

Answer 4

Here is the solution I came up with:

sc = SparkContext('local[*]')
spark = SparkSession(sparkContext=sc)

df2 = spark.read.options(inferSchema='True',delimiter=',',header='True').csv("D:\\bop\\small_input_spark.csv")
df1 = spark.read.options(inferSchema='True',delimiter=',',header='True').csv("D:\\bop\\player_points.csv")

# start = time.time()

player_name = df1.select('Playername').collect()
points = df1.select('points').collect()

dictn = {row['Playername']:row['points'] for row in df1.collect()}

print(dictn)

dictn = {k:str(v) for k,v in zip(dictn.keys(),dictn.values())}

df3 = df2.na.replace(dictn,1,("captain","v-captain","MoM","p1","p2","p3","p4","p5","p6","p7","p8","p9","p10","p11"))

integer_type = ["captain","v-captain","MoM","p1","p2","p3","p4","p5","p6","p7","p8","p9","p10","p11"]

for c in integer_type:
    df3 = df3.withColumn(c, df3[c].cast(IntegerType()))

numeric_col_list=df3.schema.names
numeric_col_list=numeric_col_list[4:]   

df3 = df3.withColumn('v-captain', ((col('v-captain') / 2 )))
df3 = df3.withColumn('MoM', ((col('MoM') * 2 )))

df3 = df3.withColumn('points',reduce(add, [col(x) for x in numeric_col_list]))

I have a pair of pyspark dataframes and I am looking to compute the total sum of points in the second dataframe, which will be based on the

Answer №1

Answer №2

Similar questions

What could be causing sympy.diff to not behave as anticipated when differentiating sympy polynomials?

Error encountered while attempting to use the DocArrayInMemorySearch feature in Langchain: The docarray Python package could not be successfully

Building a personalized payment experience using Python Flask and Stripe Checkout

Error: The 'chromedriver' executable must be located in the PATH directory

Importing primary key information from one table to another in SQLite3

Guidelines for incorporating my company's PPT slide designs in your python-pptx presentation

Scrapy utilizes AJAX to send a request in order to receive the response of the dynamically generated

Error encountered in Colab when importing keras.utils: "to_categorical" name cannot be imported

Syntax error detected in the if-else statement

Exploring data in Dynamodb database

Gradual improvement observed in keras model performance halfway through dataset

How can I retrieve the CPU ID using Python?

Quick explanation of the concept of "matrix multiplication" using Python

Tips for extracting data from a webpage that requires clicking the pagination button to load the content

Can the client's Python version impact the driver Python version when using Spark with PySpark?

Python's IndexError occurs when attempting to access an index that does not exist within a list

Scraping the web with Python: Navigating slow websites and handling sleep timings

Using Python 3 to verify if a JSON key is empty through a conditional statement

Python script experiencing empty file during execution with sequential write actions

Filtering JSON API response based on the name parameter