Transform an array of JSON data into a structured dataframe

Question

Transform an array of JSON data into a structured dataframe

I am looking to transform the owid Covid-19 JSON data found here into a dataframe. This JSON contains daily records in the data column, and I aim to merge this with the country index to create the desired dataframe.

{"AFG":{"continent":"Asia","location":"Afghanistan","population":39835428.0,"population_density":54.422,"median_age":18.6,"aged_65_older":2.581,"aged_70_older":1.337,"gdp_per_capita":1803.987,"cardiovasc_death_rate":597.029,"diabetes_prevalence":9.59,"handwashing_facilities":37.746,"hospital_beds_per_thousand":0.5,"life_expectancy":64.83,"human_development_index":0.511,"data":[{"date":"2020-02-24","total_cases":5.0,"new_cases":5.0,"total_cases_per_million":0.126,"new_cases_per_million":0.126,"stringency_index":8.33},{"da...

Currently, I've been directly loading the file into a dataframe:

df = pd.read_json('owid-covid-data.json', orient='index')

Then, I proceed to normalize the array in the following manner:

data = pd.concat([pd.DataFrame(json_normalize(key)) for key in df['data']])

This method works fine, except it drops the index, leaving me without an identifier to link back to the static values. I suspect there might be a more efficient way to normalize the data than what I'm currently using. Any assistance would be greatly appreciated!

python json pandas dataframe

Answer 1

Answer №1

Although not the most efficient method, this solution gets the job done:

new_df = pd.DataFrame()
for index, row in df.iterrows():
    tmp = pd.json_normalize(row['data'])
    tmp['country_code'] = index
    new_df = pd.concat([new_df, tmp])

UPDATE:

I have discovered a more efficient approach by normalizing all JSON data simultaneously:

country_codes = []
datas = []
for index, data in zip(df.index, df['data']):
    datas.extend(data)
    country_codes.extend(len(data)*[index])
    
new_df = pd.DataFrame(datas)
new_df['country_code'] = country_codes

This optimization has reduced the execution time from 9.38 s ± 856 ms per loop to 1.37 s ± 12 ms per loop

Answer 2

Although not the most efficient method, this solution gets the job done:

new_df = pd.DataFrame()
for index, row in df.iterrows():
    tmp = pd.json_normalize(row['data'])
    tmp['country_code'] = index
    new_df = pd.concat([new_df, tmp])

UPDATE:

I have discovered a more efficient approach by normalizing all JSON data simultaneously:

country_codes = []
datas = []
for index, data in zip(df.index, df['data']):
    datas.extend(data)
    country_codes.extend(len(data)*[index])
    
new_df = pd.DataFrame(datas)
new_df['country_code'] = country_codes

This optimization has reduced the execution time from 9.38 s ± 856 ms per loop to 1.37 s ± 12 ms per loop

Answer 3

Answer №2

df = pd.read_json("https://covid.ourworldindata.org/data/owid-covid-data.json", orient='index')

# transform records/lists into new rows, convert to dictionary,
# utilize it to form a new DataFrame, and transpose it
data = pd.DataFrame(df['data'].explode().to_dict()).T

df = df.drop(columns='data').join(data)

Efficiency

Disregarding the data retrieval time

>>> %%timeit
... data = pd.DataFrame(df['data'].explode().to_dict()).T
... df.drop(columns='data').join(data)

84.4 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

NOTE - PLEASE NOTE

The earlier solution is actually incorrect. While converting to_dict, several records are lost because there are numerous repeated country code keys (Series index), and in a dictionary, keys must be unique. To resolve this, we first need to reset the index to ensure uniqueness. It's only after creating the new DataFrame that we reintegrate the original index.

data = df['data'].explode()
data_df = pd.DataFrame(data.reset_index(drop=True).to_dict()).T
data_df.index = data.index

df = df.drop(columns='data').join(data_df)

This process takes longer compared to the previous solution due to the presence of 127314 records, whereas the initial solution generates only 233 records (unique country codes). Even if we disregard the 'join' section, as Bruno's solution does, it still lags behind Bruno's approach.

>>> %%timeit 
... data = df['data'].explode()
... new_df = pd.DataFrame(data.reset_index(drop=True).to_dict()).T
... new_df.index = data.index

17.6 s ± 972 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Bruno's solution 
>>> %%timeit
... country_codes = []
... datas = []
... for index, data in zip(df.index, df['data']):
...     datas.extend(data)
...     country_codes.extend(len(data)*[index])
...     
... new_df = pd.DataFrame(datas)
... new_df['country_code'] = country_codes

1.86 s ± 32.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

UPDATE 2 - A more efficient method...

I have discovered a much simpler and superior solution. I was indeed overcomplicating things. It mirrors Bruno's methodology

data = df['data'].explode()
data_df = pd.DataFrame(data.tolist(), index=data.index)

df = df.drop(columns='data').join(data_df)

It yields results comparable to Bruno's solution

>>> %%timeit 
... data = df['data'].explode()
... pd.DataFrame(data.tolist(), index=data.index)

1.87 s ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answer 4

df = pd.read_json("https://covid.ourworldindata.org/data/owid-covid-data.json", orient='index')

# transform records/lists into new rows, convert to dictionary,
# utilize it to form a new DataFrame, and transpose it
data = pd.DataFrame(df['data'].explode().to_dict()).T

df = df.drop(columns='data').join(data)

Efficiency

Disregarding the data retrieval time

>>> %%timeit
... data = pd.DataFrame(df['data'].explode().to_dict()).T
... df.drop(columns='data').join(data)

84.4 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

NOTE - PLEASE NOTE

The earlier solution is actually incorrect. While converting to_dict, several records are lost because there are numerous repeated country code keys (Series index), and in a dictionary, keys must be unique. To resolve this, we first need to reset the index to ensure uniqueness. It's only after creating the new DataFrame that we reintegrate the original index.

data = df['data'].explode()
data_df = pd.DataFrame(data.reset_index(drop=True).to_dict()).T
data_df.index = data.index

df = df.drop(columns='data').join(data_df)

This process takes longer compared to the previous solution due to the presence of 127314 records, whereas the initial solution generates only 233 records (unique country codes). Even if we disregard the 'join' section, as Bruno's solution does, it still lags behind Bruno's approach.

>>> %%timeit 
... data = df['data'].explode()
... new_df = pd.DataFrame(data.reset_index(drop=True).to_dict()).T
... new_df.index = data.index

17.6 s ± 972 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Bruno's solution 
>>> %%timeit
... country_codes = []
... datas = []
... for index, data in zip(df.index, df['data']):
...     datas.extend(data)
...     country_codes.extend(len(data)*[index])
...     
... new_df = pd.DataFrame(datas)
... new_df['country_code'] = country_codes

1.86 s ± 32.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

UPDATE 2 - A more efficient method...

I have discovered a much simpler and superior solution. I was indeed overcomplicating things. It mirrors Bruno's methodology

data = df['data'].explode()
data_df = pd.DataFrame(data.tolist(), index=data.index)

df = df.drop(columns='data').join(data_df)

It yields results comparable to Bruno's solution

>>> %%timeit 
... data = df['data'].explode()
... pd.DataFrame(data.tolist(), index=data.index)

1.87 s ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answer 5

Answer №3

To speed up the process, consider using pd.json_normalize(), which proved to be much faster (I tested this on the entire JSON file):

%%timeit
pd.json_normalize(
    data["AFG"],
    record_path=["data"],
    meta=[
        "continent",
        "location"
          // additional metadata fields here ...
    ],
)

17.9 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Results:

 date  total_cases  new_cases  total_cases_per_million  new_cases_per_million  stringency_index  new_cases_smoothed  ...  gdp_per_capita  cardiovasc_death_rate  diabetes_prevalence  handwashing_facilities  hospital_beds_per_thousand  life_expectancy  human_development_index
0    // output data here ...

[615 rows x 38 columns]

For an even quicker solution, you can simply download the data in csv format from their provided link here.

Answer 6

To speed up the process, consider using pd.json_normalize(), which proved to be much faster (I tested this on the entire JSON file):

%%timeit
pd.json_normalize(
    data["AFG"],
    record_path=["data"],
    meta=[
        "continent",
        "location"
          // additional metadata fields here ...
    ],
)

17.9 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Results:

 date  total_cases  new_cases  total_cases_per_million  new_cases_per_million  stringency_index  new_cases_smoothed  ...  gdp_per_capita  cardiovasc_death_rate  diabetes_prevalence  handwashing_facilities  hospital_beds_per_thousand  life_expectancy  human_development_index
0    // output data here ...

[615 rows x 38 columns]

For an even quicker solution, you can simply download the data in csv format from their provided link here.

Transform an array of JSON data into a structured dataframe

Answer №1

Answer №2

Answer №3

Similar questions

Utilizing Weather APIs to fetch JSON data

Display the menu and submenus by making a request with $.get()

The Angular HTML component is failing to display the locally stored JSON data upon page initialization

Creating a Slider with a Multitude of Images

Is there a way to transform a JavaScript array into a 'name' within the name/value pair of a JSON object?

python pandas: the alternative to R's dcast command

Scraping from the web: How to selectively crawl and eliminate duplicate items

What is the process for styling a title using Pelican?

Creating an efficient post request using retrofit

Can Selenium provide me with a numerical value?

When PyCharm is not in debug mode, it runs smoothly without any issues, but as soon as debug

Attempting to transform a JSON file or string into a CSV format results in an unpopulated CSV file

Client side is receiving an unexpected format from the pyramid when using the json method

The server returned a 500 Error when attempting to send a JSON object to an .ASMX

Combining numerous strings in separate rows to create a unified row across the entire dataframe

Ways to iterate through corresponding row and column indexes in two different Pandas dataframes

Attempting to have my Python script execute a click on a button only if a specific condition is not met

Determine the upload date of all channel videos using yt_dlp

merging 4 arrays in a specified order, organized by ID

Selenium can locate an element by its CSS selector that comes after a specific element

Transform an array of JSON data into a structured dataframe

Answer №1

Answer №2

Answer №3

Similar questions

Utilizing Weather APIs to fetch JSON data

Display the menu and submenus by making a request with $.get()

The Angular HTML component is failing to display the locally stored JSON data upon page initialization

Creating a Slider with a Multitude of Images

Is there a way to transform a JavaScript array into a 'name' within the name/value pair of a JSON object?

python pandas: the alternative to R's dcast command

Scraping from the web: How to selectively crawl and eliminate duplicate items

What is the process for styling a title using Pelican?

Creating an efficient post request using retrofit

Can Selenium provide me with a numerical value?

When PyCharm is not in debug mode, it runs smoothly without any issues, but as soon as debug

Attempting to transform a JSON file or string into a CSV format results in an unpopulated CSV file

Client side is receiving an unexpected format from the pyramid when using the __json__ method

The server returned a 500 Error when attempting to send a JSON object to an .ASMX

Combining numerous strings in separate rows to create a unified row across the entire dataframe

Ways to iterate through corresponding row and column indexes in two different Pandas dataframes

Attempting to have my Python script execute a click on a button only if a specific condition is not met

Determine the upload date of all channel videos using yt_dlp

merging 4 arrays in a specified order, organized by ID

Selenium can locate an element by its CSS selector that comes after a specific element

Client side is receiving an unexpected format from the pyramid when using the json method