Loop through and collapse an Array containing different types of structures within a Dataset using Apache Spark in Java

My Dataset has the following Schema:

 root
 |-- collectorId: string (nullable = true)
 |-- generatedAt: long (nullable = true)
 |-- managedNeId: string (nullable = true)
 |-- neAlert: struct (nullable = true)
 |    |-- advisory: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- equipmentType: string (nullable = true)
 |    |    |    |-- headlineName: string (nullable = true)
 |    |-- fieldNotice: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- caveat: string (nullable = true)
 |    |    |    |-- distributionCode: string (nullable = true)
 |    |-- hwEoX: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- bulletinName: string (nullable = true)
 |    |    |    |-- equipmentType: string (nullable = true)
 |    |-- swEoX: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- bulletinHeadline: string (nullable = true)
 |    |    |    |-- equipmentType: string (nullable = true)
 |-- partyId: string (nullable = true)
 |-- recordType: string (nullable = true)
 |-- sourceNeId: string (nullable = true)
 |-- sourcePartyId: string (nullable = true)
 |-- sourceSubPartyId: string (nullable = true)
 |-- wfid: string (nullable = true)

I am trying to access the fields inside the "element" structure by flattening the arrays.

Dataset<Row> alert = spark.read().option("multiLine", true).option("mode", "PERMISSIVE").json("C:\\Users\\LearningAndDevelopment\\\\merge\\data1\\sample.json");

Seq<String> droppedColumns = scala.collection.JavaConversions.asScalaBuffer(Arrays.asList("neAlert"));

Dataset<Row> alertjson = alert.withColumn("exploded_advisory", explode(col("neAlert.advisory"))).withColumn("exploded_fn", explode(col("neAlert.fieldNotice"))).withColumn("exploded_swEoX", explode(col("neAlert.swEoX"))).withColumn("exploded_hwEox", explode(col("neAlert.hwEoX"))).drop(droppedColumns);

alertjson.printSchema();

The resulting JSON structure is as follows:

root
 |-- collectorId: string (nullable = true)
 |-- generatedAt: long (nullable = true)
 |-- managedNeId: string (nullable = true)
 |-- partyId: string (nullable = true)
 |-- recordType: string (nullable = true)
 |-- sourceNeId: string (nullable = true)
 |-- sourcePartyId: string (nullable = true)
 |-- sourceSubPartyId: string (nullable = true)
 |-- wfid: string (nullable = true)
 |-- exploded_advisory: struct (nullable = true)
 |    |-- equipmentType: string (nullable = true)
 |    |-- headlineName: string (nullable = true)
 |-- exploded_fn: struct (nullable = true)
 |    |-- caveat: string (nullable = true)
 |    |-- distributionCode: string (nullable = true)
 |-- exploded_swEoX: struct (nullable = true)
 |    |-- bulletinHeadline: string (nullable = true)
 |    |-- equipmentType: string (nullable = true)
 |-- exploded_hwEox: struct (nullable = true)
 |    |-- bulletinName: string (nullable = true)
 |    |-- equipmentType: string (nullable = true)

However, this method led to duplicated records with only data from the first element of each JSON array. I am looking for a way to flatten the JSON arrays while maintaining data integrity across all elements.

Answer №1

To access the nested JSON values, start by using the . dot operator and then apply the explode function for each nested field.

Dataset<Row> alertjson = alert
    .withColumn("exploded_advisory", explode(col("neAlert.advisory")))
    .withColumn("exploded_fn", explode(col("neAlert.fieldNotice")))
    .withColumn("exploded_swEoX", explode(col("neAlert.swEoX")))
    .withColumn("exploded_hwEox", explode(col("neAlert.hwEoX")));

If you need to individually explode each field, you will have to create separate dataframes which contain the exploded data.

// for advisory
Dataset<Row> alertjson = alert
    .withColumn("exploded_advisory", explode(col("neAlert.advisory")))

DataSet<Row> fieldNorice = alert
    .withColumn("exploded_fn", explode(col("neAlert.fieldNotice")))

Remove any unnecessary columns before proceeding.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Tips on how to retrieve a dynamically loaded web element in a lazy manner

After using @FindBy for a while, I have come to appreciate the delayed location of elements until they are actually needed on the webpage. However, in certain scenarios where there can be anywhere from 2-10 similar elements with numbered IDs (like "elemen ...

What language should be used for JSON data formats?

I am dealing with a JSON file named myjson.cfg that has the following structure: { "values": { "a": 1, "b": 2, "c": 3, "d": 4 }, "sales": [ { "a": 0, "b": 0, "c": 0, "d": 0, ...

Understanding JSON in R by extracting keys and values

I have a JSON URL that I need to parse in R. The URL is https://{{API_HOST}}/api/dd/new and contains both keys and values. I can easily parse this JSON in Postman by using the keys and values in the Headers section. Now, I am looking for a way to achieve t ...

Is there a way to divide a string that has two characters and store each of them in different variables?

I need to extract two characters from an array element by using a substring function. For instance, the value stored in the array element rank_tier is 52. I aim to store 5 into $firstnumber and 2 into $secondnumber. The error message I received is: A ...

Converting JSON payload with the help of WSO2 class mediator

Here is the log of my current json body. I need to add a new property, "NewPropertyName": "value", retrieved from a database using a class mediator. [2015-05-18 05:47:08,730] INFO - LogMediator To: /a/create-project, MessageID: urn:uuid:b7b6efa6-5fff-49b ...

Learn how to utilize Selenium in Java to access properties such as Role:img

Upon inspection, a list of properties can be found in the Accessibility section of the webpage for the element under evaluation: Computed Properties Name: "" `aria-labelledby`: Not specified `aria-label`: Not specified `title`: Not specified Rol ...

Identify an asynchronous JavaScript and XML (AJAX)

As a part of my practice, I am developing my own MVC framework and have created a Request class. My goal is to identify the type of request being made, whether it is an AJAX/JSON call or an HTML/XML request, in order to parse the data accordingly. Current ...

Switching the displayed image depending on the JSON data received

As a beginner in javascript and jQuery, I am working on displaying JSON results in the browser. My goal is to generate dynamic HTML by incorporating the JSON data. Below is an example of the JSON structure: [{"JobName":"JobDoSomething","JobStatus":2,"JobS ...

Tips for modifying the JSON array output into an object using PHP

I am encountering an issue with the API JSON output. I am experiencing some difficulties with the JSON output when it comes to looping through, as it seems to create its own indexes despite my attempts to force the array format into an indexed array. Here ...

Ajax transmitting data with concealed characters

I'm trying to send data from one PHP site to another. On the first site, everything looks good. The string appears as --show="author,book,text/n However, when I check the string after receiving it on the second site, it shows --show="author,b ...

Converting Grouped Pandas DataFrames to JSON format

I am encountering some difficulties converting the given dataframe into a JSON structure. Despite my attempts, I haven't been able to complete the final step successfully. Here is the data frame I have: serialNumber | date | part | value | n ...

Creating an array of objects by parsing JSON using the jQuery .each() method

I am attempting to generate an array of objects by parsing a JSON file. Here is the pertinent code: //president object constructor function president(a_presName, a_presDates, a_presNick, a_presImage) { this.presName=a_presName; this.presDates=a_pr ...

"Troubleshooting Problem with JSON Encoding in PHP and Parsing with getJSON in

Apologies if this sounds like yet another discussion on the topic, but I've been struggling for hours without finding a solution. I'm attempting to retrieve data from a MySQL database, generate a JSON using PHP, and then parse this JSON in JavaS ...

Mapping JSON schema to typed JavaScript objects

Are there any tools available to generate JavaScript typed objects (JS functions) from a JSON schema? Essentially, looking for the JS equivalent of this http://code.google.com/p/jsonschema2pojo/. Thank you. EDIT: Starting with: { "description": "An ...

Converting a Dataframe or CSV to a JSON object array

Calling all python experts, I have a simple query that needs addressing. Take a look at the data below: 0 <a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ed959691ab9f92d1dcc0c2">[email protected]</a> 1323916902 ...

During the second test, Selenium was unable to locate the element

Encountering the same issue with every test script I attempt to create using selenium + TestNG. After the first @test method, webdriver fails to identify elements in the following @test methods. Illustratively, consider the scenario: The web driver loads ...

Import and access a JSON file from a Git package using Composer

In my current project, I have created a GitHub repository that includes a valuable file data.json. Simultaneously, I am also working on another repository which is the PHP version of this idea. Now, I need to load and parse the list from data.json. Initia ...

Automatically rotate Xcode simulator to landscape orientation on iPhone/iPad using Selenium

After examining the Appium log, it appears that it is indicating landscape mode with the desired capabilities being set as follows: [debug] [XCUITest] Setting initial orientation to 'LANDSCAPE' To achieve this, I am currently using the followin ...

Converting a text file to JSON in Python with element stripping and reordering techniques

I have a file with data separated by spaces like this: 2017-05-16 00:44:36.151724381 +43.8187 -104.7669 -004.4 00.6 00.2 00.2 090 C 2017-05-16 00:44:36.246672534 +41.6321 -104.7834 +004.3 00.6 00.3 00.2 130 C 2017-05-16 00:44:36.356132768 +46.4559 -104.5 ...

Utilizing the jq tool in Unix, convert complex JSON data with multiple levels into a CSV file

I have a complex JSON structure as shown below: { "id": "id123", "details": { "prod": "prod123", "etype": "type1" }, "accounts": [ { ...