Can the client's Python version impact the driver Python version when using Spark with PySpark?

Currently, I am utilizing python with pyspark for my projects.

For testing purposes, I operate a standalone cluster on docker.

I found this repository of code to be very useful.

It is important to note that before running the code, you must execute this command in order to log into it properly:

docker network create --gateway 10.5.0.1 --subnet 10.5.0.0/24 spark_master

As I interact with both the worker and master, I noticed that when I check the Python version using the command:

which python

The Python versions appear similar (3.5).

However, when executing a basic pyspark code outside of the containers like this:

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('appName').setMaster('spark://0.0.0.0:7077')
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 2, 3, 4])
rdd.map(lambda x: x**2).collect()

An error consistently occurs which states:

Exception: Python in worker has different version 3.5 than that in driver 3.7, PySpark cannot run with different minor versions. Please ensure that environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

After consulting resources online, such as this link, it was recommended to make sure the driver and worker Python versions match. Despite setting up a new conda environment with Python 3.5, the issue persisted.

Moreover, I attempted to adjust environment variables using os.environ by adding the following lines to my Python code:

os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.5"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3.5"
os.environ["PYSPARK_WORKER_PYTHON"] = "/usr/bin/python3.5"

Unfortunately, the error remains unchanged. It seems that while trying to include a missing path, a new error emerged indicating "no such file or directory." This suggests that the code runs within the cluster but does not address the root problem.

Answer №1

After some investigation, it turns out the issue stemmed from discrepancies between the client's python version and the driver/workers running at version 3.5. Updating everything to match resolved the problem.

It's puzzling why the client's python version would affect the driver, as far as I know they shouldn't interact in that way.

Possibly related to using pyspark for Python functions execution.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Discover the solution by utilizing XPath

I am struggling to extract data from an HTML table: <div class="parameters"> <div class="property">property 1</div> <div class="value">value</div> </div> <div class="paramete ...

Steps for separating data within a CSV cell and generating additional rows

I just recently started learning how to code in Python. After searching through various resources, I couldn't find a direct solution to my current issue. Here's the problem I'm facing: I have a csv file structured like this: Header1, Heade ...

Reorganize the layout of a Python dictionary with the help of recursion

I am working with a dictionary that contains data for "Land" and "Air": { "Land": { "2018": { "VALUE:Avg": 49.0, "VALUE:Sum": 49.0 }, "2008": { ...

Error Encountered while Implementing Image Classification Model using Tensorflow/Keras

My partner and I are collaborating on a project to create a model that can classify images based on whether or not they show someone wearing a mask correctly. However, we're encountering an issue when trying to run our model - a ValueError keeps appea ...

Looking for a website to conduct load testing? You'll need to log in first in order to access

I am in the process of developing an Automated Testing Python Project to conduct comprehensive testing on a website. This project involves executing Functionality, Accessibility, and Load Testing on the website, with detailed reporting of the outcomes. The ...

What is the process for including text on a cycling button?

I've created a unique script that generates a grid of interactive buttons, but I'm struggling to figure out how to display text on these buttons when they are clicked. Specifically, I want the first click to show "X" and the second click to displ ...

How to deselect a single option in a dropdown menu with Python Selenium WebDriver

Having trouble deselecting a selected option after using webdriver to select it? I'm encountering an error that says NotImplementedError("You may only deselect options of a multi-select") NotImplementedError: You may only deselect options of a multi-s ...

Accessing model fields in Django using the meta class option

How can I retrieve the user's first name and second name from the Users model while working with the StockTransfer model? class Meta: model = StockTransfer fields = ( 'id', 'from_stock', 'to_stock', ...

How can I use Beautiful Soup to retrieve the Q-number from a Wikidata item associated with a Wikipedia page?

Looking for the Wikidata item? You can locate it under Tools in the left sidebar of this Wikipedia page. Once you hover over it, you'll see the link address ending with a Q-number. . How do I extract the Q-number? from bs4 import BeautifulSoup import ...

An error occurred while trying to insert JSON data into SQLite using Python

I am currently working on a project where I need to store raw JSON strings in a sqlite database using the sqlite3 module in Python. Here is what I have attempted: rows = [["a", "<json value>"]....["n", "<json_value>"]] cursor.executemany("""I ...

displaying outcomes as 'Indefinite' rather than the anticipated result in the input field

I need to automatically populate values into 4 text fields based on the input value entered by the user. When the user exits the input field, a function called getcredentials() is triggered. This function, in turn, executes Python code that retrieves the r ...

Exploring the process of reading multiple CSV files from various directories within Spark SQL

Having trouble reading multiple csv files from various folders from pyspark.sql import * spark = SparkSession \ .builder \ .appName("example") \ .config("spark.some.config.option") \ .getOrCreate( ...

Storing string inputs in an array using Python: A step-by-step guide

Hey there! I'm trying to store strings in an array but running into some trouble. Here's the code snippet: while (count < ts ): dt=tb t1=count+180 t2=t1+360 dt1=dt+t1 dt2=dt+t2 slice=stream.slice(dt1, dt2) B=str(d ...

Creating a wrapper to override numerous methods in Python 3

Striving to subclass float and override various numeric operations with a wrapper, I came across an interesting example. Inspired by it, I attempted the following: def naturalize(*methods): def decorate(cls): for method in methods: method = &a ...

Experiencing a '429 error: excessive requests' while utilizing Instagram API through a Python script

I am attempting to execute a script that logs into Instagram and uploads 10 images with randomly generated text on them. However, the following is the output I receive when trying to run the script: 2023-01-02 21:56:48,608 - INFO - Instabot version: 0.117. ...

A guide on extracting text enclosed in and using Selenium

When attempting to extract text from an element and display it in the console, I encountered a challenge as the text was surrounded by \n and \t as noted in the JSON file retrieved during a GET request. The HTML structure appears as follows: < ...

Filtering rows by the time difference between two datetime64 columns

I have a dataset that has the following structure trip_id start_date start_station_id end_date end_station_id subscription_type journey_duration weekday 0 913460 2019-08-31 23:26:00 50 2019-08-31 23:39:00 70 Subscriber 0 days ...

Tips for halting and restarting an extended Python script

My Python script is designed to handle a large number of text files and can take a significant amount of time to run. Sometimes, it may be necessary to stop the script and resume it later for reasons such as program crashes or running out of disk space. I ...

The Python interpreter behaves consistently across different platforms, except for a specific issue with json_set in Visual Studio Code

I can't seem to understand what's going on. The code works fine everywhere except in Visual Studio Code. Take a look at this example: import sqlite3 _connection = sqlite3.connect(":memory:") connection.row_factory = sqlite3.Row create_table = ...

Reportlab is unable to locate the _imaging module when using the production server

I'm encountering an issue while attempting to deploy a Django app on the production server. The error message reads: ImportError: The _imaging C module is not installed Interestingly, the application works perfectly fine when running on the develo ...