Combine two data sets by matching their time columns

Please note: I previously posted a similar question with the same dataset on this link, but now I am exploring a different approach to merge the data frames.

I have two separate data frames that house diverse types of medical information for patients. The common identifiers in both data frames are the encounter ID (hadm_id) and the timestamp when the information was recorded ((n|c)e_charttime).

One data frame (ds) includes structured data, while the other data frame (dn) has a column containing clinical notes documented at specific times for each encounter. Despite having multiple encounters in each data frame, they share the common element of the encounter ID (hadm_id).

The following are samples of the data frames:

ds
    hadm_id ce_charttime    hr  sbp dbp
0   140694  2121-08-12 19:00:00 67.0    102.0   75.0
1   140694  2121-08-12 19:45:00 68.0    135.0   68.0
2   140694  2121-08-12 20:00:00 70.0    153.0   94.0
3   171544  2153-09-06 14:11:00 80.0    114.0   50.0
4   171544  2153-09-06 17:30:00 80.0    114.0   50.0
5   171544  2153-09-06 17:35:00 80.0    114.0   50.0
6   171544  2153-09-06 17:40:00 76.0    115.0   51.0
7   171544  2153-09-06 17:45:00 79.0    117.0   53.0
dn
    hadm_id ne_charttime    note
0   140694  2121-08-10 20:32:00 some text1
1   140694  2121-08-11 12:57:00 some text2
2   140694  2121-08-11 15:18:00 some text3
3   171544  2153-09-05 15:09:00 some text4
4   171544  2153-09-05 17:43:00 some text5
5   171544  2153-09-06 10:36:00 some text6
6   171544  2153-09-06 15:55:00 some text7
7   171544  2153-09-06 17:12:00 some text8

The actual dataset comprises nearly 10,000 encounters with over 250,000 rows of structured data and 50,000 rows of clinical notes.

My objective is to merge these datasets based on the charted time. Essentially, if we take one encounter from each dataframe and sort them by charttime, I aim to consolidate all the information into one resulting dataframe, accounting for missing values using NaN. For instance, given the input from the two data frames above, the resulting merged dataframe would appear as follows:

final
    hadm_id charttime   ce_charttime    hr  sbp dbp ne_charttime    note
0   140694  2121-08-10 20:32:00 NaT NaN NaN NaN 2121-08-10 20:32:00 some text1
1   140694  2121-08-11 12:57:00 NaT NaN NaN NaN 2121-08-11 12:57:00 some text2
2   140694  2121-08-11 15:18:00 NaT NaN NaN NaN 2121-08-11 15:18:00 some text3
3   140694  2121-08-12 19:00:00 2121-08-12 19:00:00 67.0    102.0   75.0    NaT NaN
4   140694  2121-08-12 19:45:00 2121-08-12 19:45:00 68.0    135.0   68.0    NaT NaN
5   140694  2121-08-12 20:00:00 2121-08-12 20:00:00 70.0    153.0   94.0    NaT NaN
6   171544  2153-09-05 15:09:00 NaT NaN NaN NaN 2153-09-05 15:09:00 some text4
7   171544  2153-09-05 17:43:00 NaT NaN NaN NaN 2153-09-05 17:43:00 some text5
8   171544  2153-09-06 10:36:00 NaT NaN NaN NaN 2153-09-06 10:36:00 some text6
9   171544  2153-09-06 14:11:00 2153-09-06 14:11:00 80.0    114.0   50.0    NaT NaN
10  171544  2153-09-06 15:55:00 NaT NaN NaN NaN 2153-09-06 15:55:00 some text7
11  171544  2153-09-06 17:12:00 NaT NaN NaN NaN 2153-09-06 17:12:00 some text8
12  171544  2153-09-06 17:30:00 2153-09-06 17:30:00 80.0    114.0   50.0    NaT NaN
13  171544  2153-09-06 17:35:00 2153-09-06 17:35:00 80.0    114.0   50.0    NaT NaN
14  171544  2153-09-06 17:40:00 2153-09-06 17:40:00 76.0    115.0   51.0    NaT NaN
15  171544  2153-09-06 17:45:00 2153-09-06 17:45:00 76.0    117.0   53.0    NaT NaN

I manually created this resulting dataframe and seek guidance on how to achieve this efficiently using Pandas. Eventually, I will eliminate ce_charttime and ne_charttime, retaining only the newly generated charttime column and appropriately filling in any missing values later. Any assistance provided would be greatly valued. Please let me know if further information is necessary.

Thank you.

Answer №1

My plan is to eliminate the columns ce_charttime and ne_charttime, and only keep the newly created charttime

An alternative approach would be to remove these columns prior to merging the two dataframes, then use the pandas concat method to combine them into one dataframe.

import pandas as pd
from datetime import datetime

def parse_datetime(strftime):
    datetime.strptime(strftime, '%Y-%m-%d %H:%M:%S')

# assuming both dataframes have a shared column named `charttime` on the same axis
data1 = pd.read_csv('data1.csv', parse_dates=True, date_parser=parse_datetime)
data2 = pd.read_csv('data2.csv', parse_dates=True, date_parser=parse_datetime)

print(data1.head(10), end='\n\n')
print(data2.head(10), end='\n\n')

data = pd.concat([data1, data2], axis=0, sort=True)
data.sort_values(by=['charttime'], inplace=True)
data.reset_index(drop=True, inplace=True)
print(data.head(20))

The following shows the output generated by the code above:

   hadm_id            charttime    hr    sbp   dbp
0   140694  2121-08-12 19:00:00  67.0  102.0  75.0
1   140694  2121-08-12 19:45:00  68.0  135.0  68.0
2   140694  2121-08-12 20:00:00  70.0  153.0  94.0
3   171544  2153-09-06 14:11:00  80.0  114.0  50.0
4   171544  2153-09-06 17:30:00  80.0  114.0  50.0
5   171544  2153-09-06 17:35:00  80.0  114.0  50.0
6   171544  2153-09-06 17:40:00  76.0  115.0  51.0
7   171544  2153-09-06 17:45:00  79.0  117.0  53.0

   hadm_id            charttime        note
0   140694  2121-08-10 20:32:00  some text1
1   140694  2121-08-11 12:57:00  some text2
2   140694  2121-08-11 15:18:00  some text3
3   171544  2153-09-05 15:09:00  some text4
4   171544  2153-09-05 17:43:00  some text5
5   171544  2153-09-06 10:36:00  some text6
6   171544  2153-09-06 15:55:00  some text7
7   171544  2153-09-06 17:12:00  some text8

              charttime   dbp  hadm_id    hr        note    sbp
0   2121-08-10 20:32:00   NaN   140694   NaN  some text1    NaN
1   2121-08-11 12:57:00   NaN   140694   NaN  some text2    NaN
2   2121-08-11 15:18:00   NaN   140694   NaN  some text3    NaN
3   2121-08-12 19:00:00  75.0   140694  67.0         NaN  102.0
4   2121-08-12 19:45:00  68.0   140694  68.0         NaN  135.0
5   2121-08-12 20:00:00  94.0   140694  70.0         NaN  153.0
6   2153-09-05 15:09:00   NaN   171544   NaN  some text4    NaN
7   2153-09-05 17:43:00   NaN   171544   NaN  some text5    NaN
8   2153-09-06 10:36:00   NaN   171544   NaN  some text6    NaN
9   2153-09-06 14:11:00  50.0   171544  80.0         NaN  114.0
10  2153-09-06 15:55:00   NaN   171544   NaN  some text7    NaN
11  2153-09-06 17:12:00   NaN   171544   NaN  some text8    NaN
12  2153-09-06 17:30:00  50.0   171544  80.0         NaN  114.0
13  2153-09-06 17:35:00  50.0   171544  80.0         NaN  114.0
14  2153-09-06 17:40:00  51.0   171544  76.0         NaN  115.0
15  2153-09-06 17:45:00  53.0   171544  79.0         NaN  117.0

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Setting up VirtualenvWrapper

After using sudo pip install virtualenvwrapper to install virtualenvwrapper, I encountered an error when trying to run source bash_profile. The error message I received was: bash: /usr/local/share/python/virtualenvwrapper.sh: No such file or directory Her ...

The radio button cannot be interacted with as Selenium cannot find any other unique identifier for it

While examining the code of the button, I landed on this specific section: <input type="radio" name="chosen" value="UniqueNameExample"> I am attempting to locate an element within this code that I can interact with by c ...

Unlocking the door: Navigating login hurdles with Selenium using multiple class methods

I have been encountering difficulty while attempting to transfer a username and password to a login page that features multiple classes within its structure. Despite my attempts at various XPATH solutions, none seem to work with this particular example. B ...

Error encountered when trying to import a file from a specific directory in Python: `ModuleNotFoundError`

Encountering a moduleNotFoundError issue while trying to import a file from the root directory. Here is the directory structure for my Flask App: index.py auth_ - server.py Pages - home.py I can successfully import home.py from the Pages directory, b ...

What is the Unicode representation for the character "你"?

I recently learned that the Unicode for 你 (meaning you) is \x4F\x60. Is there a way to retrieve it using my Python command console? >>> print("你") 你 >>> print(("你").encode("gbk")) b'\xc4\xe3' >&g ...

Issue encountered when storing and retrieving a pointer within a class using Boost.Python: incorrect data type detected

I am encountering an issue while using Boost.Python 1.54 on Windows with MSVC2010. I am trying to store a pointer to one class in another class from Python and then retrieve it, but it appears that the data type is getting altered somehow. Below are my cl ...

The positioning of the button's rectangle is inaccurate

For my school project, I am working on an idle clicker game. One issue I encountered is that the button intended for upgrading clicking power has its rectangle placed incorrectly. The rectangle assigned to the button does not align with the actual button i ...

Tips for organizing data based on the "grand total" value in a pivot table with the Google Sheets API

I have been using the code below to automatically generate a pivot table by referencing a sheet ID. However, I am now trying to figure out how to sort the pivot table based on the Grand Total column programmatically. While I can manually do this in the spr ...

CharField validation using RegexValidator to prevent any trailing white spaces

I need a regular expression validator for a model's character field that will trigger an error if the field contains leading and trailing white spaces. Example: "__regex validator": not valid "regex validator__": not valid "regex validator": valid ...

Quick explanation of the concept of "matrix multiplication" using Python

I am looking to redefine matrix multiplication by having each constant represented as another array, which will be convolved together instead of simply multiplying. Check out the image I created to better illustrate my concept: https://i.stack.imgur.com/p ...

Python drawing techniques: A step-by-step guide

I am trying to divide a triangle into two colors, with the left side in red and the right side in yellow. But despite my efforts, I can't seem to achieve this split. When you run my code, you will see two triangles instead of one divided triangle as d ...

How to import JSON file in Python without the 'u prefix in the key

Recently, while working on a Python project involving graphs, I encountered an issue with saving certain data structures in files for quick retrieval. One particular problem arose when I attempted to save a dictionary in JSON format using the json.dump fun ...

Enforcing the blocking of insecure private network requests with Selenium

chrome://flags/#block-insecure-private-network-requests I am looking to use Selenium web driver with Python in order to disable/block insecure private network requests. Can anyone advise on the specific flag I need to add using add_argument()? ...

Combining data on specified columns with missing values in the primary columns using pandas

Looking for a solution to merge two data frames with null values in key columns and select specific columns from one of the data frames. import pandas as pd import numpy as np data = { 'Email': ['example1@example.com', 'exampl ...

Guide to installing torch through python

Attempting to install PyTorch using pip3 install torch --no-cache-dir resulted in the following error after a few seconds: Collecting torch Downloading https://files.pythonhosted.org/packages/24/19/4804aea17cd136f1705a5e98a00618cb8f6ccc375ad8bfa4374 ...

The execution of certain functions in Python Selenium is contingent upon the output of the preceding function

I have a challenge where I need to check for the presence of element A or B on a page using Python. If it's element A, I want to run function A and if it's element B, I want to run function B. I'm thinking of using a dictionary and an "if" e ...

`finding elements in Selenium with Python`

Looking to automate an HTML code on a webpage and one line reads: <"option value = "1">ABC<"/option>" Is there a way to find this without using the value attribute, but by locating it through the text "ABC"? I have "ABC" saved as a variable a ...

Localization with a separate domain for i18n in Python files

I'm seeking clarification on how to handle i18n translations within py files. I have a string that requires translation, and the content is already included in the plone domain in plone.pot. Therefore, I want to specify the domain as 'plone' ...

Python script in PowerShell has arguments that include double quotes and whitespace

I am attempting to execute a python script using PowerShell. Within the python script, I am trying to include a command line argument with double quotes and whitespace, but it is not working as expected. It seems like there may be an issue with PowerShell. ...

execute all tests in specified directory

Within a specific directory, I have a collection of testcase files that are in yaml format. To test my application, I am utilizing pyresttest. Is there a way to efficiently run all the tests located in this directory without having to write a script manu ...