Questions tagged [apache-spark]

Apache Spark stands out amongst other data processing engines with its impressive capabilities in handling large-scale operations. Developed as an open source framework using Scala, it empowers users with a unified API and the ability to process distributed datasets effortlessly. Whether for batch or streaming tasks, Apache Spark proves itself to be an invaluable tool. The versatility of Apache Spark is readily evident through its application in various use cases. Notably, it thrives in scenarios involving machine learning, deep learning, and graph processing. With such flexibility, this remarkable engine continues to revolutionize the way data is processed and analyzed.

Error status: disconnecting Zeppelin Zephyr

Currently, I am utilizing Zeppelin 0.70 and have successfully built Zeppelin from the Git source. Everything was running smoothly when I started Zeppelin on port 8000 using the command below. ./zeppelin-daemon start https://i.stack.imgur.com/tiLyF.png H ...

Using Spark to read JSON files based on the file names

I am looking to extract JSON files from an HDFS directory for processing with Spark. Once the processing is complete, I want Spark to move the files to a different location. However, new files may be added while processing is ongoing, so I need a way to ke ...

json apache-spark

Converting JSON arrays into structured arrays using Spark

Does anyone know how to convert an Array of JSON strings into an Array of structures? Sample data: { "col1": "col1Value", "col2":[ "{\"SubCol1\":\"ABCD\",\"SubCol ...

json scala dataframe apache-spark apache-spark-sql

Detect JSON files without data using Spark 2.4

I am looking for a solution to detect and avoid processing empty JSON files. Some of the JSON files I receive only contain open and close square brackets, like: []. If a file only contains these brackets, it should be considered empty. Previously, with Sp ...

json scala apache-spark apache-spark-sql

Unable to convert a string into a date format in Spark SQL, encountering an error

I'm encountering an issue in spark.sql where I can't seem to convert a string to date format. Initially, when passing the raw string directly, it converts successfully. However, when attempting to store that value into a variable and use it as an argumen ...

python apache-spark pyspark apache-spark-sql

Rodeo encounters a version conflict between the worker and driver when running pySpark

When executing this basic script in Pyspark from the terminal, it runs without any issues: import pyspark sc = pyspark.SparkContext() foo = sc.parallelize([1,2]) foo.foreach(print) However, running the same script in Rodeo results in an error message, w ...

python apache-spark pyspark rodeo

I have a pair of pyspark dataframes and I am looking to compute the total sum of points in the second dataframe, which will be based on the

Welcome to my first data frame that showcases player points Playername pid matchid points 0 Virat Kohli 10 2 0 1 Ravichandran Ashwin 11 2 9 2 Gautam Gambhir 12 2 1 3 Ravindra Jadeja 13 2 7 4 Amit Mishra 14 2 2 5 Mohammed Shami 15 2 2 6 ...

python apache-spark pyspark

Retrieving a specific value from a JSON JDBC column in Spark Scala

Within the mysql jdbc data source used for loading data into Spark, there is a column that contains JSON data in string format. // Establish JDBC Connection and load table into Dataframe val verDf = spark.read.format("jdbc").option("driver&q ...

json scala apache-spark

Spark failing to produce output when parsing a newline delimited JSON file

The JSON file shown below is formatted with newline delimiters. [ {"name": "Vishay Electronics", "specifications": " feature low on-resistance and high Zener switching speed\n1/lineup from small signal produ ...

json apache-spark apache-spark-sql

Sending data from Spark to my Angular8 projectLearn how to seamlessly transfer data from your

I am utilizing Spark 2.4.4 and Scala to retrieve data from my MySQL database, and now I am looking to showcase this data in my Angular8 project. Can anyone provide guidance on the process? I have been unable to locate any relevant documentation so far. ...

mysql angular scala apache-spark

Merge together all the columns within a dataframe

Currently working on Python coding in Databricks using Spark 2.4.5. Trying to create a UDF that takes two parameters - a Dataframe and an SKid, where I need to hash all columns in that Dataframe based on the SKid. Although I have written some code for th ...

python apache-spark pyspark azure-databricks

What is the best method for storing information in a Cassandra table using Spark with Python?

I am currently working on developing a consumer-producer application. In this application, the Producer generates data on specific topics. The Consumer consumes this data from the same topic, processes it using the Spark API, and stores it in a Cassandra ...

python apache-spark cassandra

Error encountered while attempting to parse schema in JSON format in PySpark: Unknown token 'ArrayType' found, expected JSON String, Number, Array, Object, or token

Following up on a previous question thread, I want to thank @abiratis for the helpful response. We are now working on implementing the solution in our glue jobs; however, we do not have a static schema defined. To address this, we have added a new column c ...

python apache-spark pyspark parquet

Efficiently Indexing JSON Strings from Spark directly into Elasticsearch

Can JSON strings be directly indexed from Spark to Elasticsearch without using Scala case classes or POJOS? I am currently working with Spark, Scala, and Elastic 5.5. This is my current code snippet: val s = xml .map { x => import org.jso ...

java json scala apache-spark elasticsearch

Encountered a problem while trying to retrieve JSON data from Cassandra DB using Java and sparkSession

I am currently working on a project that involves reading data from a Cassandra table using Java with sparkSession. The goal is to format the output as JSON. Here is the structure of my database: CREATE TABLE user ( user_id uuid, email ...

java json apache-spark cassandra datastax

utilizing spark streaming to produce json results without encountering deprecation warnings

Here is a code snippet where the df5 dataframe successfully prints json data but isStream is false and it's deprecated in Spark 2.2.0. I attempted another approach in the last two lines of code to handle this, however, it fails to read json correctly. An ...

json apache-spark apache-spark-sql spark-streaming rdd

Tips for dynamically changing the field name in a JSON object using Spark

Having a JSON log file with a JSON delimiter (/n), I am looking to convert it into Spark struct type. However, the first field name in every JSON varies within my text file. Is there a way to achieve this? val elementSchema = new StructType() .add("n ...

json scala apache-spark parsing

Having difficulty sending a Spark job from a Windows development environment to a Linux cluster

Recently, I came across findspark and was intrigued by its potential. Up until now, my experience with Spark was limited to using spark-submit, which isn't ideal for interactive development in an IDE. To test it out, I ran the following code on Window ...

python apache-spark pyspark anaconda

What steps can be taken to resolve the ERROR Executor - Exception in task 0.0 in stage 20.0 (TID 20) issue?

Although a similar question has been briefly answered before, I couldn't add my additional query due to lack of minimum reputation. Hence, I am posting it here. I am trying to process Twitter data using Apache Spark + Kafka. I have created a pattern ...

python apache-spark pyspark apache-kafka spark-streaming

Tips for importing an Excel file into Databricks using PySpark

I am currently facing an issue with importing my Excel file into PySpark on Azure-DataBricks machine so that I can convert it to a PySpark Dataframe. However, I am encountering errors while trying to execute this task. import pandas data = pandas.read_exc ...

python apache-spark pyspark bigdata

What is the best way to transfer variables in Spark SQL with Python?

I've been working on some python spark code and I'm wondering how to pass a variable in a spark.sql query. q25 = 500 Q1 = spark.sql("SELECT col1 from table where col2>500 limit $q25 , 1") Unfortunately, the code above doesn't seem to work as i ...

python apache-spark pyspark apache-spark-sql

Sparks: Unlocking the Power of Transformative Merging

I am faced with a task involving 1000 JSON files that require individual transformations followed by merging into a single output file. The merged output must ensure no duplicate values are present after overlapping operations have been performed. My appr ...

java json apache-spark rdd

Processing JSON data with an extensive number of columns using Spark

Currently, I am working on a Proof of Concept for a Spark application written in Scala running in LOCAL mode. Our task involves processing a JSON dataset with an extensive 300 columns, though only a limited number of records. Utilizing Spark SQL has been s ...

json scala apache-spark multiple-columns

Guide to establishing a connection to CloudantDB through spark-scala and extracting JSON documents as a dataframe

I have been attempting to establish a connection with Cloudant using Spark and read the JSON documents as a dataframe. However, I am encountering difficulties in setting up the connection. I've tested the code below but it seems like the connection p ...

json scala dataframe apache-spark cloudant

Error message encountered while using SPARK: read.json is triggering a java.io.IOException due to an excessive number

Encountering an issue while trying to read a large 6GB single-line JSON file: Error message: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.io.IOException: Too ...

json apache-spark pyspark apache-spark-sql bigdata

Exploring the correlation between the number of nodes on AWS EMR and the execution time of

I am a newcomer to Spark. Recently, I attempted to run a basic application on Amazon EMR (Python pi approximation found here) initially with 1 worker node and then in a subsequent phase with 2 worker nodes (m4.large). Surprisingly, the elapsed time to co ...

python amazon-web-services apache-spark execution-time

choose an array consisting of spark structures

I'm facing a challenging issue as part of a complex problem I'm working on. Specifically, I'm stuck at a certain point in the process. To simplify the problem, let's assume I have created a dataframe from JSON data. The raw data looks something like this: ...

java python scala apache-spark

Optimizing hyperparameters for implicit matrix factorization model in pyspark.ml using CrossValidator

Currently, I am in the process of fine-tuning the parameters of an ALS matrix factorization model that utilizes implicit data. To achieve this, I am utilizing pyspark.ml.tuning.CrossValidator to iterate through a parameter grid and identify the optimal mod ...

python apache-spark pyspark apache-spark-ml

Tips for extracting a struct from an array of string JSON data prior to Spark 2.4

source data : ["a","b",...] convert into a dataframe : +----+ | tmp| +----+ |a | |b | +----+ Starting with Spark 2.4, we can utilize: explode(from_json($"column_name", ArrayType(StringType))) This method is very effecti ...

json apache-spark

When using spark.read to load a parquet file into a dataframe, I noticed that

I'm a beginner in PySpark and to sum it up: I am faced with the challenge of reading a parquet file and using it with SPARK SQL. Currently, I have encountered the following issues: When I read the file with schema, it shows NULL values - using spark. ...

python apache-spark pyspark apache-spark-sql parquet

Transforming a Pyspark dataframe into the desired JSON structure

Is there a way to convert a Pyspark dataframe into a specific JSON format without the need to convert it to a Python pandas dataframe? Here is an example of our input Pyspark df: model year timestamp 0 i20 [2019, 2018 ...

json python-2.7 apache-spark pyspark

Dividing a JSON array into two separate rows with Spark using Scala

Here is the structure of my dataframe: root |-- runKeyId: string (nullable = true) |-- entities: string (nullable = true) +--------+--------------------------------------------------------------------------------------------+ |runKeyId|entities ...

arrays json scala apache-spark

Retrieving data from a JSON object stored within a database column

There is a dataframe presented below: +-------+-------------------------------- |__key__|______value____________________| | 1 | {"name":"John", "age": 34} | | 2 | {"name":"Rose", "age" ...

json scala dataframe apache-spark apache-spark-sql

Generating aggregated statistics from JSON logs using Apache Spark

Recently delving into the world of Apache Spark, I've encountered a task involving the conversion of a JSON log into a flattened set of metrics. Essentially, transforming it into a simple CSV format. For instance: "orderId":1, "orderData": { ...

json apache-spark apache-spark-sql

Converting json file data into a case class using Spark and Spray Json

In my text file, I have JSON lines with a specific structure as illustrated below. {"city": "London","street": null, "place": "Pizzaria", "foo": "Bar"} To handle this data in Spark, I want to convert it into a case class using the Scala code provided be ...

json scala apache-spark spray-json

Building a Custom Component in Laravel Spark

I'm attempting to integrate my custom component into my Laravel Spark environment, but I keep encountering the following error: Property or method "obj" is not defined on the instance but referenced during render. Everything works perfectly when I bind ...

laravel apache-spark vue.js

Transform the text into cp500 encoding

I am currently working with a plain text file that contains 500 million rows and is approximately 27GB in size. This file is stored on AWS S3. I have been running the code below for the last 3 hours. I attempted to find encoding methods using PySpark, bu ...

python apache-spark encode

Can the client's Python version impact the driver Python version when using Spark with PySpark?

Currently, I am utilizing python with pyspark for my projects. For testing purposes, I operate a standalone cluster on docker. I found this repository of code to be very useful. It is important to note that before running the code, you must execute this ...

python apache-spark pyspark

Analyzing PySpark dataframes by tallying the null values across all rows and columns

I need help with writing a PySpark query to count all the null values in a large dataframe. Here is what I have so far: import pyspark.sql.functions as F df_agg = df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns]) df_countnull_agg.c ...

python dataframe apache-spark pyspark

PySpark - Utilizing Conditional Logic

Just starting out with PySpark and seeking advice on translating the following SAS code into PySpark: SAS Code: If ColA > Then Do; If ColB Not In ('B') and ColC <= 0 Then Do; New_Col = Sum(ColA, ColR, ColP); End; Else ...

python apache-spark pyspark

Transfer the folder from the DBFS location to the user's workspace directory within Azure Databricks

I am in need of transferring a group of files (Python or Scala) from a DBFS location to my user workspace directory for testing purposes. Uploading each file individually to the user workspace directory is quite cumbersome. Is there a way to easily move f ...

python azure apache-spark databricks azure-databricks

Generating individual rows for every item belonging to a user within a Spark dataframe

I have a dataset in Spark that looks like this: User Item Purchased 1 A 1 1 B 2 2 A 3 2 C 4 3 A 3 3 B 2 3 D 6 only showing top 5 rows Each user has a row for an item they purchased, with the 'Purchased' colum ...

python apache-spark apache-spark-sql

Breaking Down Arrays in Pyspark: Unraveling Multiple Array Columns into Individual

I need to transform a dataframe with one row and multiple columns into a new format. Some columns contain single values, while others contain lists of the same length. My goal is to split each list column into separate rows while preserving the non-list co ...

python apache-spark dataframe pyspark apache-spark-sql

A practical method for restructuring or dividing a string containing JSON entries

Within my dataset, I've got a string comprising JSON entries linked together similar to the following scenario. val docs = """ {"name": "Bilbo Baggins", "age": 50}{"name": "Gandalf", "age": 1000}{"name": "Thorin", "age": 195}{"name": "Balin", "age": 178 ...

json mongodb scala apache-spark

Loop through and collapse an Array containing different types of structures within a Dataset using Apache Spark in Java

java json apache-spark apache-spark-sql

Exploring the process of reading multiple CSV files from various directories within Spark SQL

Having trouble reading multiple csv files from various folders from pyspark.sql import * spark = SparkSession .builder .appName("example") .config("spark.some.config.option") .getOrCreate() folders = List(& ...

python sql csv apache-spark pyspark

Transform Python code into Scala

After transitioning from Python to Scala, I found myself struggling with the conversion of a program. Specifically, I encountered difficulties with two lines of code pertaining to creating an SQL dataframe. The original Python code: fields = [StructField ...

python scala apache-spark pyspark apache-zeppelin

Use regular expressions to filter a pyspark.RDD

I am working with a pyspark.RDD that contains dates which I need to filter out. The dates are in the following format within my RDD: data.collect() = ["Nujabes","Hip Hop","04:45 16 October 2018"] I have attempted filtering t ...

python apache-spark date pyspark rdd

Newtab Q&A