Can the client's Python version impact the driver Python version when using Spark with PySpark?

Question

Can the client's Python version impact the driver Python version when using Spark with PySpark?

Currently, I am utilizing python with pyspark for my projects.

For testing purposes, I operate a standalone cluster on docker.

I found this repository of code to be very useful.

It is important to note that before running the code, you must execute this command in order to log into it properly:

docker network create --gateway 10.5.0.1 --subnet 10.5.0.0/24 spark_master

As I interact with both the worker and master, I noticed that when I check the Python version using the command:

which python

The Python versions appear similar (3.5).

However, when executing a basic pyspark code outside of the containers like this:

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('appName').setMaster('spark://0.0.0.0:7077')
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 2, 3, 4])
rdd.map(lambda x: x**2).collect()

An error consistently occurs which states:

Exception: Python in worker has different version 3.5 than that in driver 3.7, PySpark cannot run with different minor versions. Please ensure that environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

After consulting resources online, such as this link, it was recommended to make sure the driver and worker Python versions match. Despite setting up a new conda environment with Python 3.5, the issue persisted.

Moreover, I attempted to adjust environment variables using os.environ by adding the following lines to my Python code:

os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.5"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3.5"
os.environ["PYSPARK_WORKER_PYTHON"] = "/usr/bin/python3.5"

Unfortunately, the error remains unchanged. It seems that while trying to include a missing path, a new error emerged indicating "no such file or directory." This suggests that the code runs within the cluster but does not address the root problem.

python apache-spark pyspark

Answer 1

Answer №1

After some investigation, it turns out the issue stemmed from discrepancies between the client's python version and the driver/workers running at version 3.5. Updating everything to match resolved the problem.

It's puzzling why the client's python version would affect the driver, as far as I know they shouldn't interact in that way.

Possibly related to using pyspark for Python functions execution.

Answer 2

After some investigation, it turns out the issue stemmed from discrepancies between the client's python version and the driver/workers running at version 3.5. Updating everything to match resolved the problem.

It's puzzling why the client's python version would affect the driver, as far as I know they shouldn't interact in that way.

Possibly related to using pyspark for Python functions execution.

Can the client's Python version impact the driver Python version when using Spark with PySpark?

Answer №1

Similar questions

Discover the solution by utilizing XPath

Steps for separating data within a CSV cell and generating additional rows

Reorganize the layout of a Python dictionary with the help of recursion

Error Encountered while Implementing Image Classification Model using Tensorflow/Keras

Looking for a website to conduct load testing? You'll need to log in first in order to access

What is the process for including text on a cycling button?

How to deselect a single option in a dropdown menu with Python Selenium WebDriver

Accessing model fields in Django using the meta class option

How can I use Beautiful Soup to retrieve the Q-number from a Wikidata item associated with a Wikipedia page?

An error occurred while trying to insert JSON data into SQLite using Python

displaying outcomes as 'Indefinite' rather than the anticipated result in the input field

Exploring the process of reading multiple CSV files from various directories within Spark SQL

Storing string inputs in an array using Python: A step-by-step guide

Creating a wrapper to override numerous methods in Python 3

Experiencing a '429 error: excessive requests' while utilizing Instagram API through a Python script

A guide on extracting text enclosed in and using Selenium

Filtering rows by the time difference between two datetime64 columns

Tips for halting and restarting an extended Python script

The Python interpreter behaves consistently across different platforms, except for a specific issue with json_set in Visual Studio Code

Reportlab is unable to locate the _imaging module when using the production server