I have a pair of pyspark dataframes and I am looking to compute the total sum of points in the second dataframe, which will be based on the

Welcome to my first data frame that showcases player points

Playername pid matchid points
0 Virat Kohli 10 2 0
1 Ravichandran Ashwin 11 2 9
2 Gautam Gambhir 12 2 1
3 Ravindra Jadeja 13 2 7
4 Amit Mishra 14 2 2
5 Mohammed Shami 15 2 2
6 Karun Nair 16 2 4
7 Hardik Pandya 17 2 0
8 Cheteshwar Pujara 18 2 9
9 Ajinkya Rahane 19 2 5

Moving on to the second dataframe where I need to calculate the sum based on the players listed in each row

https://i.stack.imgur.com/8CRm0.png

Desired output can be seen here https://i.stack.imgur.com/3B15m.png

I have a solution, but I'm looking for an efficient method using pyspark

## A Function that returns the corresponding points for a Player from df1
def replacepoints(x):
    return df1['points'].where(df1['Playername']==x).sum()

## Iterating through all names and replacing them with their respective points to calculate total points for each row

df3 = df2[['p1','p2','p3','p4','p5','p6','p7','p8','p9','p10','p11']].copy()
# df3
length = len(df3)
for i in range(length):
    j_len = len(df3.iloc[i])
    for j in range(j_len):
        name = df3.iloc[i][j]
        df3.iloc[i][j] = replacepoints(name)

## df3 now only contains points
# df3

## Storing the sum of points
points = df3.sum(axis=1)
points

# Adding points to df2 points column
df2['points'] = points

Answer №1

Python Code Snippet

import pandas as pd

df_player_points = pd.read_csv('player_points.csv')

df_small_input_spark = pd.read_csv('small_input_spark.csv')

player_names = list(df_player_points['Playername'])

points = list(df_player_points['points'])

index = 0

for name in player_names:
    df_small_input_spark.iloc[:,7:] = df_small_input_spark.iloc[:,7:].replace([name], int(points[index]))
    index += 1

df_small_input_spark['points'] = df_small_input_spark.iloc[:,7:].sum(axis=1)

df_small_input_spark.head()

Avoiding Nested Loop with Data Copy Creation

Note: This approach replaces player names with points and then calculates the row-wise sum of points without altering the original dataset.

Answer №2

Here is the solution I came up with:

sc = SparkContext('local[*]')
spark = SparkSession(sparkContext=sc)

df2 = spark.read.options(inferSchema='True',delimiter=',',header='True').csv("D:\\bop\\small_input_spark.csv")
df1 = spark.read.options(inferSchema='True',delimiter=',',header='True').csv("D:\\bop\\player_points.csv")

# start = time.time()

player_name = df1.select('Playername').collect()
points = df1.select('points').collect()

dictn = {row['Playername']:row['points'] for row in df1.collect()}

print(dictn)

dictn = {k:str(v) for k,v in zip(dictn.keys(),dictn.values())}

df3 = df2.na.replace(dictn,1,("captain","v-captain","MoM","p1","p2","p3","p4","p5","p6","p7","p8","p9","p10","p11"))

integer_type = ["captain","v-captain","MoM","p1","p2","p3","p4","p5","p6","p7","p8","p9","p10","p11"]

for c in integer_type:
    df3 = df3.withColumn(c, df3[c].cast(IntegerType()))

numeric_col_list=df3.schema.names
numeric_col_list=numeric_col_list[4:]   

df3 = df3.withColumn('v-captain', ((col('v-captain') / 2 )))
df3 = df3.withColumn('MoM', ((col('MoM') * 2 )))

df3 = df3.withColumn('points',reduce(add, [col(x) for x in numeric_col_list]))

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

What could be causing sympy.diff to not behave as anticipated when differentiating sympy polynomials?

Trying to understand why the differentiation of sympy polynomials using sympy.diff isn't producing the expected result. It seems that when a function is defined with sympy.Poly, the derivative calculation doesn't work as intended even though it w ...

Error encountered while attempting to use the DocArrayInMemorySearch feature in Langchain: The docarray Python package could not be successfully

Here is the complete code that runs smoothly on notebook. However, I encounter an error when running it on my local machine related to: ImportError: Could not import docarray python package I attempted reinstallation and force installation of langchain ...

Building a personalized payment experience using Python Flask and Stripe Checkout

I'm attempting to set up a customized checkout integration with Stripe on my Flask web application and I've encountered some issues. After copying the code from the Stripe documentation (located at https://stripe.com/docs/checkout#integration-cu ...

Error: The 'chromedriver' executable must be located in the PATH directory

Currently, I am utilizing OS X El Capitan along with Eclipse (Neo) and Python. I have written some Selenium scripts in Python. Initially, these scripts were functioning properly. However, after upgrading from OSX Sierra to El Capitan, Please note: thi ...

Importing primary key information from one table to another in SQLite3

I am working with three tables: books, chapters, and concepts. My goal is to ensure that the book_id columns in the books and chapters tables are the same. After inserting data into the books table, I proceeded to insert data into the chapters table. How ...

Guidelines for incorporating my company's PPT slide designs in your python-pptx presentation

When I run the following code: from pptx import Presentation prs = Presentation() slide_layout = prs.slide_layouts[0] prs.save("Final.pptx") A PowerPoint presentation is generated with a single slide that uses the default title layout. However, when I o ...

Scrapy utilizes AJAX to send a request in order to receive the response of the dynamically generated

Is there a way to efficiently extract data from websites like this? To display all available offers, the "Show More Results" button at the bottom of the page needs to be clicked multiple times until all offers are shown. Each click triggers an AJAX reques ...

Error encountered in Colab when importing keras.utils: "to_categorical" name cannot be imported

Currently utilizing Google's Colab to execute the Deep Learning scripts from François Chollet's book "Deep Learning with python." The initial exercise involves using the mnist dataset, but encountering an error: ImportError: cannot import name & ...

Syntax error detected in the if-else statement

The script is mostly in Dutch (my native language), with an issue in the line containing the else function. After running the script, I encounter the error "invalid syntax" and the colon is highlighted as the source of the problem. So how can this be reso ...

Exploring data in Dynamodb database

Utilizing Lambda, API-gateway, and DynamoDB with Python 3.6 for managing orders. I have a DynamoDB table structured as follows: orderId (primary Key|String) orderStatus (String) orderCode (String) date (String) The orderId field is unique. However, when ...

Gradual improvement observed in keras model performance halfway through dataset

I am interested in creating a neural network using keras, sklearn, and tensorflow to predict the (n+1)-th value for a given dataset in a 1-dimensional array. For example, if I have [2,3,12,1,5,3] as input, I would like the output to be [2,3,12,1,5,3,x]. H ...

How can I retrieve the CPU ID using Python?

Is there a way to retrieve the processor ID using Python 2.6 on a Windows operating system? I am aware of pycpuid, but I am having trouble compiling it for version 2.6. ...

Quick explanation of the concept of "matrix multiplication" using Python

I am looking to redefine matrix multiplication by having each constant represented as another array, which will be convolved together instead of simply multiplying. Check out the image I created to better illustrate my concept: https://i.stack.imgur.com/p ...

Tips for extracting data from a webpage that requires clicking the pagination button to load the content

I am currently using Python BeautifulSoup to scrape data from a website. The website has pagination, and everything is working smoothly until I reach page 201. Strangely enough, when I try to access page 201 directly through the URL in the browser, it retu ...

Can the client's Python version impact the driver Python version when using Spark with PySpark?

Currently, I am utilizing python with pyspark for my projects. For testing purposes, I operate a standalone cluster on docker. I found this repository of code to be very useful. It is important to note that before running the code, you must execute this ...

Python's IndexError occurs when attempting to access an index that does not exist within a list

Check out the code snippet I've been working on: class MyClass: listt=[] def __init__(self): "" instancelist = [ MyClass() for i in range(29)] for i in range(0,29): instancelist[i].listt[i].append("ajay") print instancelist An ...

Scraping the web with Python: Navigating slow websites and handling sleep timings

I have a web scraping situation where the site I'm working with is sometimes slow and unresponsive. Using a fixed SLEEP duration has resulted in errors after a few days. How can this issue be resolved? I rely on using SLEEP at different intervals with ...

Using Python 3 to verify if a JSON key is empty through a conditional statement

This question has been updated in order to provide clearer information that will benefit others in the future. In my Python code, I am trying to use an if statement to check for the existence of the key classes within a JSON structure. I have managed to a ...

Python script experiencing empty file during execution with sequential write actions

Recently, I created a simple script to write some lines to a file: f = open('file.txt','w') while(operator): f.write("string") f.close() However, I noticed that while the script is running, the file remains empty. Only after th ...

Filtering JSON API response based on the name parameter

I've searched through numerous similar questions and discussions, but none have provided a solution to my problem. I am working with an API response that contains a list of jobs executed on a virtual machine. Specifically, I am interested in summarizi ...