Transform an array of JSON data into a structured dataframe

I am looking to transform the owid Covid-19 JSON data found here into a dataframe. This JSON contains daily records in the data column, and I aim to merge this with the country index to create the desired dataframe.

{"AFG":{"continent":"Asia","location":"Afghanistan","population":39835428.0,"population_density":54.422,"median_age":18.6,"aged_65_older":2.581,"aged_70_older":1.337,"gdp_per_capita":1803.987,"cardiovasc_death_rate":597.029,"diabetes_prevalence":9.59,"handwashing_facilities":37.746,"hospital_beds_per_thousand":0.5,"life_expectancy":64.83,"human_development_index":0.511,"data":[{"date":"2020-02-24","total_cases":5.0,"new_cases":5.0,"total_cases_per_million":0.126,"new_cases_per_million":0.126,"stringency_index":8.33},{"da...

Currently, I've been directly loading the file into a dataframe:

df = pd.read_json('owid-covid-data.json', orient='index')

Then, I proceed to normalize the array in the following manner:

data = pd.concat([pd.DataFrame(json_normalize(key)) for key in df['data']])

This method works fine, except it drops the index, leaving me without an identifier to link back to the static values. I suspect there might be a more efficient way to normalize the data than what I'm currently using. Any assistance would be greatly appreciated!

Answer №1

Although not the most efficient method, this solution gets the job done:

new_df = pd.DataFrame()
for index, row in df.iterrows():
    tmp = pd.json_normalize(row['data'])
    tmp['country_code'] = index
    new_df = pd.concat([new_df, tmp])

UPDATE:

I have discovered a more efficient approach by normalizing all JSON data simultaneously:

country_codes = []
datas = []
for index, data in zip(df.index, df['data']):
    datas.extend(data)
    country_codes.extend(len(data)*[index])
    
new_df = pd.DataFrame(datas)
new_df['country_code'] = country_codes

This optimization has reduced the execution time from 9.38 s ± 856 ms per loop to 1.37 s ± 12 ms per loop

Answer №2

df = pd.read_json("https://covid.ourworldindata.org/data/owid-covid-data.json", orient='index')

# transform records/lists into new rows, convert to dictionary,
# utilize it to form a new DataFrame, and transpose it
data = pd.DataFrame(df['data'].explode().to_dict()).T

df = df.drop(columns='data').join(data)

Efficiency

Disregarding the data retrieval time

>>> %%timeit
... data = pd.DataFrame(df['data'].explode().to_dict()).T
... df.drop(columns='data').join(data)

84.4 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

NOTE - PLEASE NOTE

The earlier solution is actually incorrect. While converting to_dict, several records are lost because there are numerous repeated country code keys (Series index), and in a dictionary, keys must be unique. To resolve this, we first need to reset the index to ensure uniqueness. It's only after creating the new DataFrame that we reintegrate the original index.

data = df['data'].explode()
data_df = pd.DataFrame(data.reset_index(drop=True).to_dict()).T
data_df.index = data.index

df = df.drop(columns='data').join(data_df)

This process takes longer compared to the previous solution due to the presence of 127314 records, whereas the initial solution generates only 233 records (unique country codes). Even if we disregard the 'join' section, as Bruno's solution does, it still lags behind Bruno's approach.

>>> %%timeit 
... data = df['data'].explode()
... new_df = pd.DataFrame(data.reset_index(drop=True).to_dict()).T
... new_df.index = data.index

17.6 s ± 972 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Bruno's solution 
>>> %%timeit
... country_codes = []
... datas = []
... for index, data in zip(df.index, df['data']):
...     datas.extend(data)
...     country_codes.extend(len(data)*[index])
...     
... new_df = pd.DataFrame(datas)
... new_df['country_code'] = country_codes

1.86 s ± 32.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

UPDATE 2 - A more efficient method...

I have discovered a much simpler and superior solution. I was indeed overcomplicating things. It mirrors Bruno's methodology

data = df['data'].explode()
data_df = pd.DataFrame(data.tolist(), index=data.index)

df = df.drop(columns='data').join(data_df)

It yields results comparable to Bruno's solution

>>> %%timeit 
... data = df['data'].explode()
... pd.DataFrame(data.tolist(), index=data.index)

1.87 s ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answer №3

To speed up the process, consider using pd.json_normalize(), which proved to be much faster (I tested this on the entire JSON file):

%%timeit
pd.json_normalize(
    data["AFG"],
    record_path=["data"],
    meta=[
        "continent",
        "location"
          // additional metadata fields here ...
    ],
)

17.9 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Results:

 date  total_cases  new_cases  total_cases_per_million  new_cases_per_million  stringency_index  new_cases_smoothed  ...  gdp_per_capita  cardiovasc_death_rate  diabetes_prevalence  handwashing_facilities  hospital_beds_per_thousand  life_expectancy  human_development_index
0    // output data here ...

[615 rows x 38 columns]

For an even quicker solution, you can simply download the data in csv format from their provided link here.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Utilizing Weather APIs to fetch JSON data

Trying to integrate with the Open Weather API: Check out this snippet of javascript code: $(document).ready(function() { if (navigator.geolocation) { navigator.geolocation.getCurrentPosition(function(position) { $(".ok").html("latitude: " + ...

Display the menu and submenus by making a request with $.get()

My menu with submenu is generated in JSON format, but I am facing issues displaying it on an HTML page using the provided code. Can someone please assist me in identifying what mistakes I might be making? let HandleClass = function() { ...

The Angular HTML component is failing to display the locally stored JSON data upon page initialization

import { Store,STORES } from '../models/store'; export class StoreLocatorComponent implements OnInit { public Stores: any = []; constructor() { } ngOnInit(): void { this.Stores = STORES ; this.Stores.forEach(element => { ...

Creating a Slider with a Multitude of Images

I am working on a project to develop a gallery that pulls images from a JSON source. I have successfully implemented infinite scrolling to display the images on the first page. However, I am now facing a challenge when it comes to showing the selected imag ...

Is there a way to transform a JavaScript array into a 'name' within the name/value pair of a JSON object?

Suppose I have a dynamic array with an unspecified length, but two specific values are given for this array. var arrName = ["firstName","lastName"]; I need to create a JSON variable that includes the exact values provided for this dynamic array. Here are ...

python pandas: the alternative to R's dcast command

I want to achieve the equivalent of the commands below in Python: test <- data.frame(convert_me=c('Transform1','Transform2','Transform3'), values=rnorm(3,45, 12), age_col=c('23','33', ...

Scraping from the web: How to selectively crawl and eliminate duplicate items

What is the most effective method for ensuring that Scrapy does not store duplicate items in a database when running periodically to retrieve new content? Would assigning items a hash help prevent this issue? Your advice on avoiding duplicates would be g ...

What is the process for styling a title using Pelican?

Using Pelican with Markdown, I format my blog posts as foo.md files. Here is an example: Title: Light, by Kelly Link Date: 2015-09-07 21:18 Blah blah ... I want to italicize a word in the title, but since markdown doesn't work in the Title: field, ...

Creating an efficient post request using retrofit

Currently, I am delving into the world of Retrofit as it appears to provide solutions to many of the challenges I face with JSON requests and their handling. It is clear that interfaces play a key role in defining the methods utilized, particularly when m ...

Can Selenium provide me with a numerical value?

I recently started using Selenium and I've encountered a problem. The solution is probably simple, but I haven't been able to find it anywhere. When I attempt to locate an element with the code below: options = Options() options.binary_location= ...

When PyCharm is not in debug mode, it runs smoothly without any issues, but as soon as debug

UPDATE: After reverting back to PyCharm version 2017.2.4, the script started working again. It seems that the issue was with the IDE and not the script itself. Recently, my script has been running smoothly until today. Strangely, it only runs without any ...

Attempting to transform a JSON file or string into a CSV format results in an unpopulated CSV file

So I have this task of converting a JSON file to a CSV file. I'm using the Google Sheets API to export the JSON file, but when I try to convert it, either with the file location or as a string, I always get a null pointer exception or an empty CSV fil ...

Client side is receiving an unexpected format from the pyramid when using the __json__ method

As I'm configuring a Pyramid back-end along with an Angular front-end application, I encounter an issue. When Angular sends an http GET request to Pyramid, the data received on the Angular side is not as expected. It seems like the id is being recogni ...

The server returned a 500 Error when attempting to send a JSON object to an .ASMX

Short and sweet question: Why do I get a 500 error when passing a JSON object to an ASMX webservice? If I define the params as individual variables (e.g. int ID, int OrderHeaderID), I don't encounter the error. I'm puzzled because I've succe ...

Combining numerous strings in separate rows to create a unified row across the entire dataframe

I am facing an issue with a dataset containing 60,000 tweets in a CSV file. Some of the tweets are multiline, and I need to convert them all into a single line. For example: https://i.stack.imgur.com/C9gsL.png As shown in the example above, the tweet span ...

Ways to iterate through corresponding row and column indexes in two different Pandas dataframes

Two distinct Pandas Dataframes are at my disposal with identical dimensions. Dataframe 1 column1 column2 column3 1 "A" "B" "C" 2 "A" "B" "C" 3 "A" "B" "C&quo ...

Attempting to have my Python script execute a click on a button only if a specific condition is not met

Upon meeting the condition, a click is currently triggered. For example: The code scans for a specific word on my screen and compares it to predefined data set. Expected behavior: If the screen is empty, a click should happen. If the word "cookie" app ...

Determine the upload date of all channel videos using yt_dlp

Is it possible to extract the upload date in a json for all videos of a channel? I am looking to get a json output that includes the upload dates of all videos. Here is the code I have: import json import yt_dlp as youtube_dl options = {'ignoreerrors ...

merging 4 arrays in a specified order, organized by ID

i have below 4 array objects var dataArray1 = [ {'ProjectID': '001', 'Project': 'Main Project 1', 'StartYear': '2023', 'EndYear': '2023', 'StartMonth': 'Sep&apo ...

Selenium can locate an element by its CSS selector that comes after a specific element

How can I locate the text "Interesting" which is the first occurrence of the class b after h1.important when using Selenium? <div class="a"> <div class="b">Not interesting</div> </div> <div class="title"> <h1 c ...