How can a custom format structure be established for the json export feature in Scrapy? If it is possible, what is the process for doing so

As a beginner in the world of Python and Scrapy, I am struggling with the complexities of Scrapy documentation. Despite successfully creating a spider for my school project to scrape data, I am facing issues with the formatting in JSON export. Here is a snippet of my code;

def parse_links(self, response):
    products  = response.css('qwerty')
    for product in products:
        yield {
            'Title' : response.xpath('/html/head/title/text()').get()
            'URL' : response.url,
            'Product' : response.css('product').getall(),
            'Manufacturer' : response.xpath('Manufacturer').getall(),
            'Description' : response.xpath('Description').getall(),
            'Rating' : response.css('rating').getall(),
            }.

The current JSON export appears like this;

[{"Title": "x", "URL": "https://y.com", "Product": ["a", "e"], "Manufacturer": ["b", "f"], "Description": ["c", "g"], "Rating": ["d", "h"]}]
.

This is how it looks currently.

However, I aim to have my data exported in a different format;

[{"Products": [{"Title":"x","URL":"https://y.com", "Links":[{"Product":"a","Manufacturer":"b","Description":"c","Rating":"d"},{"Product":"e","Manufacturer":"f","Description":"g","Rating":"h"}]}]}]

This is how I want the data to be formatted.

I have tried various solutions from the web but nothing has worked so far, and the explanations on the Scrapy site are difficult for me to comprehend as a newcomer. Any assistance would be greatly appreciated. While I managed to create the scraper easily, I have been stuck on this particular issue for a day now. Just to clarify, I am not utilizing any custom pipelines or items in this process.

Thank you in advance and have a wonderful day.

Answer №1

Check out this sample code for parsing JSON data using Python

def extract_data(self, response):
items = response.css('item')
for item in items:
    results = []
    links_list = []
    links_list.append({"Name":response.css('name').getall(),"Brand":response.xpath('brand').getall(),"Description":response.xpath('description').getall(),"Price":response.css('price').getall()})
    info_dict = {"Info":response.xpath('/html/body/div/info/text()').get(),"Website":"https://example.com","Links":links_list}
    data_output = {"Items":[info_dict]}
yield data_output

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Utilizing Cross-Validation post feature transformation: A comprehensive guide

My dataset contains a mix of categorical and non-categorical values. To handle this, I used OneHotEncoder for the categorical values and StandardScaler for the continuous values. transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat&apo ...

Bringing in text using pandas in Python

I need help regarding importing a table into a pandas dataframe. One of the strings in the table contains the special character 'NF-κB' with the 'kappa' symbol. However, when I use pd.read_table to import the table from 'table_pro ...

Retrieving data from a jSONobject

I have been working on a web service and encountered an issue with retrieving values from a JSON object. I created a parser class to handle this but ran into a problem where the main object was returning as null, causing the subsequent code to not work pro ...

What is the best method for selecting features after performing PCA?

When working on a classification task with a binary outcome using RandomForestClassifier, I recognize the significance of data preprocessing to enhance accuracy. With over 100 features and nearly 4000 instances in my dataset, I aim to implement dimensional ...

POST requests in Express node sometimes have an empty req.body

Server Code: const express = require('express') const app = express() app.use(express.static('public')) app.use(express.json({limit :'100mb'})); app.post('/post', (req, res) => { console.log('post called ...

Is there a way to deactivate the return_bind_key function in PySimpleGui?

Can someone help me with disabling the bind_return_key parameter to false after an incorrect answer? The submit button is linked to 'b1' key and I previously used the .update() method which was working fine until recently when I started getting t ...

What's the reason behind the error message "local variable 'text' referenced before assignment" popping up on my screen?

I am attempting to develop a sharing feature using Python and Django, but when I try to execute the "share" function, an error is returned. Below is the code I am working with: views.py from django.shortcuts import render from basicapp.forms import U ...

Python - Despite closing or flushing the file object, the file append or write operation does not actually write to the file

In the code snippet below, I have two similar functions: def getFirstSet(): file1 = open("log.txt", 'a') list1 = [item1, item2, item3, item4, ...] for item in list1: file1.write(item + '\n') file1 ...

I need guidance on selecting the most suitable data structure for handling extensive volumes of text data

Currently, I am exploring text classification with scikit-learn's TfidfVectorizer and the Nearest Neighbor algorithm. My challenge lies in determining similarity metrics between two datasets, each containing 18000 entries. I am grappling with decidin ...

Query to retrieve specific JSON elements

Here is a snippet of JSON data: {"response":[2939, {"mid":6581,"date":1345018696,"out":0,"uid":84175314,"read_state":1,"title":" ... ","body":"Text1"}, {"mid":6578,"date":1344984256,"out":0,"uid":32438192,"read_state":1,"title":" ... ","body":"Text2"} ]} ...

Querying Mongodb with extended millisecond intervals

I have a collection that I need to export every 5 minutes based on the timestamp field. When querying the collection, the maximum date is as follows: db.testcol.find({},{_id : 0,ts : 1}).sort({ts:-1}) 2017-04-14 23:40:27.690Z I converted it to mil ...

The latest version of Spring MVC, 4.1.x, is encountering a "Not Acceptable" error while trying

After creating a Rest Service using Spring MVC4.1.X, I encountered an issue when attempting to return Json output to the browser. The error message displayed was: The resource identified by this request is only capable of generating responses with charact ...

Unable to extract information from empty <td> using python and selenium

Currently, I am facing an issue while trying to fetch values from a <tr> using Python Selenium. Specifically, I need these values to be ordered and also identified based on whether they contain the word "PICK" or not. My goal is to determine the exa ...

Learning how to interpret a JSON string in VB6

I am currently working on a project in VB6 where I need to call a web service that returns a JSON string as a response. After successfully storing the response in a string, my next task is to extract each parameter individually. Could someone provide gui ...

The error message reads: `json.decoder.JSONDecodeError: Unexpected additional data present at line 2, starting from column 1 (character

I encountered an error: (json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 5357)) when trying to parse a JSON file. Can someone explain the reason behind this error? Additionally, could you provide guidance on how to properly extract va ...

Parsing JSON using GSON library can be done even when the keys in the JSON string are unknown

Here is an example of the JSON structure: # "trig_cond": { # "_and": { # "param1": ["op", "value1"], # "param2": ["op", "value2"], ... # }, # "_or": { # "param1": ["op", "value1"], # ...

Scrolling for Screenshots with Selenium

I've been working on a python script that navigates through the UI and completes numerous pages. However, I've run into an issue where if a page is longer than the screen size, my screenshots end up cutting off important information. I experiment ...

Issue with Objects in a Lexicon in Ursina Programming Language

Looking to create a grid of various entities without having to write an excessive amount of code, I've opted to utilize a dictionary to automatically generate them. This saves me from manually coding 1521 lines for each individual entity. To handle i ...

"Transforming JSON data into a format compatible with Highcharts in PHP: A step-by-step

Currently facing an issue with converting the given array format into a Highcharts compatible JSON to create a line chart. Although everything else is functioning correctly, I am struggling with this specific conversion task. { name: [ 1000, ...

Finding the root cause of the error message 'Unexpected character: :' from using rjson

My company recently acquired a JSON data set, and I need to extract specific information from it. However, when attempting to import the data using the "fromJSON" method, I encountered an error as described in the title. With over 16,000 files worth of d ...