Example on how to import MEDS data format¶
In [ ]:
Copied!
import pandas as pd
import numpy as np
from datetime import datetime
from twinweaver import (
convert_meds_to_dtc,
DataManager,
DataSplitterEvents,
ConverterInstruction,
Config,
)
import pandas as pd
import numpy as np
from datetime import datetime
from twinweaver import (
convert_meds_to_dtc,
DataManager,
DataSplitterEvents,
ConverterInstruction,
Config,
)
/home/makaron1/twinweaver/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Synethetic example¶
Here we provide synthetic example data as generated by Gemini.
In [2]:
Copied!
code_metadata_list = [
# Static Measurements
{"code": "GENDER/Female", "description": "Female sex"},
{"code": "GENDER/Male", "description": "Male sex"},
{"code": "GENETIC/BRCA1_pos", "description": "BRCA1 gene mutation"},
# Visit and Administrative Codes
{
"code": "ADMISSION/Outpatient",
"description": "Admission for an outpatient clinic visit",
},
{
"code": "ADMISSION/Inpatient",
"description": "Admission to the hospital for an inpatient stay",
},
{
"code": "DISCHARGE/Outpatient",
"description": "Discharge from an outpatient clinic visit",
},
{
"code": "DISCHARGE/Inpatient",
"description": "Discharge from an inpatient hospital stay",
},
{
"code": "NOTE/FollowUp",
"description": "Clinical note for a follow-up appointment",
},
# Diagnosis Codes (ICD-10-CM)
{
"code": "ICD10CM/C34.90",
"description": "Malignant neoplasm of unspecified part of unspecified bronchus or lung",
},
{"code": "ICD10CM/C61", "description": "Malignant neoplasm of prostate"},
# Symptom Codes
{"code": "SYMPTOM/Cough", "description": "Patient reports a persistent cough"},
# Procedure Codes (CPT)
{
"code": "CPT/71250",
"description": "Procedure code for a CT scan of the thorax without contrast",
},
{
"code": "CPT/32408",
"description": "Procedure code for a core needle biopsy of the lung or mediastinum",
},
{
"code": "CPT/55700",
"description": "Procedure code for a needle biopsy of the prostate",
},
{
"code": "CPT/55840",
"description": "Procedure code for a radical retropubic prostatectomy",
},
# Lab Codes (LOINC)
{
"code": "LOINC/6690-2",
"description": "Leukocytes [#/volume] in Blood by Automated count (White Blood Cell Count)",
},
{
"code": "LOINC/2039-6",
"description": "Carcinoembryonic Ag [Mass/volume] in Serum or Plasma (CEA Tumor Marker)",
},
{
"code": "LOINC/59261-8",
"description": "Comprehensive metabolic 2014 panel - Serum or Plasma",
},
{
"code": "LOINC/2857-1",
"description": "Prostate specific Ag [Mass/volume] in Serum or Plasma (PSA Test)",
},
# Medication Codes
{
"code": "RX/Cisplatin",
"description": "Administration of Cisplatin chemotherapy agent",
},
# Death
{"code": "DEATH", "description": "Death"},
]
code_metadata_df = pd.DataFrame(code_metadata_list)
# Patient Events DataFrame
patient_events_list = [
# Patient 101: Jane Doe (Lung Cancer) - Assigned to 'train' split
# Static data
{
"subject_id": 101,
"time": pd.NaT,
"code": "GENDER/Female",
"numeric_value": np.nan,
"text_value": "Female",
},
{
"subject_id": 101,
"time": pd.NaT,
"code": "GENETIC/BRCA1_pos",
"numeric_value": 1,
"text_value": "Positive",
},
# Visit 1 (Week 2, 2024): Diagnosis
{
"subject_id": 101,
"time": datetime(2024, 1, 8),
"code": "ADMISSION/Outpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
{
"subject_id": 101,
"time": datetime(2024, 1, 8),
"code": "SYMPTOM/Cough",
"numeric_value": np.nan,
"text_value": "Persistent for 2 months",
},
{
"subject_id": 101,
"time": datetime(2024, 1, 8),
"code": "LOINC/6690-2",
"numeric_value": 12.5,
"text_value": np.nan,
},
{
"subject_id": 101,
"time": datetime(2024, 1, 8),
"code": "CPT/71250",
"numeric_value": np.nan,
"text_value": "Nodule found in right lung",
},
{
"subject_id": 101,
"time": datetime(2024, 1, 8),
"code": "CPT/32408",
"numeric_value": np.nan,
"text_value": np.nan,
},
{
"subject_id": 101,
"time": datetime(2024, 1, 8),
"code": "ICD10CM/C34.90",
"numeric_value": np.nan,
"text_value": "Primary Diagnosis",
},
{
"subject_id": 101,
"time": datetime(2024, 1, 8),
"code": "DISCHARGE/Outpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
# Visit 2 (Week 4, 2024): Treatment
{
"subject_id": 101,
"time": datetime(2024, 1, 22),
"code": "ADMISSION/Inpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
{
"subject_id": 101,
"time": datetime(2024, 1, 22),
"code": "LOINC/59261-8",
"numeric_value": np.nan,
"text_value": "All values within normal limits",
},
{
"subject_id": 101,
"time": datetime(2024, 1, 22),
"code": "RX/Cisplatin",
"numeric_value": np.nan,
"text_value": "Cisplatin",
},
{
"subject_id": 101,
"time": datetime(2024, 1, 22),
"code": "DISCHARGE/Inpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
# Visit 3 (Week 8, 2024): Follow-up
{
"subject_id": 101,
"time": datetime(2024, 2, 19),
"code": "ADMISSION/Outpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
{
"subject_id": 101,
"time": datetime(2024, 2, 19),
"code": "NOTE/FollowUp",
"numeric_value": np.nan,
"text_value": "Patient tolerated first cycle well.",
},
{
"subject_id": 101,
"time": datetime(2024, 2, 19),
"code": "LOINC/2039-6",
"numeric_value": 50.2,
"text_value": np.nan,
},
{
"subject_id": 101,
"time": datetime(2024, 2, 19),
"code": "DISCHARGE/Outpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
# Patient 202: John Smith (Prostate Cancer) - Assigned to 'held_out' split
# Static data
{
"subject_id": 202,
"time": pd.NaT,
"code": "GENDER/Male",
"numeric_value": np.nan,
"text_value": "Male",
},
# Visit 1 (Week 10, 2024): Diagnosis
{
"subject_id": 202,
"time": datetime(2024, 3, 4),
"code": "ADMISSION/Outpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
{
"subject_id": 202,
"time": datetime(2024, 3, 4),
"code": "LOINC/2857-1",
"numeric_value": 15.1,
"text_value": np.nan,
},
{
"subject_id": 202,
"time": datetime(2024, 3, 4),
"code": "CPT/55700",
"numeric_value": np.nan,
"text_value": "Biopsy taken",
},
{
"subject_id": 202,
"time": datetime(2024, 3, 4),
"code": "ICD10CM/C61",
"numeric_value": np.nan,
"text_value": "Primary Diagnosis",
},
{
"subject_id": 202,
"time": datetime(2024, 3, 4),
"code": "DISCHARGE/Outpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
# Visit 2 (Week 14, 2024): Treatment (Surgery)
{
"subject_id": 202,
"time": datetime(2024, 4, 1),
"code": "ADMISSION/Inpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
{
"subject_id": 202,
"time": datetime(2024, 4, 1),
"code": "CPT/55840",
"numeric_value": np.nan,
"text_value": "Surgical procedure completed.",
},
{
"subject_id": 202,
"time": datetime(2024, 4, 1),
"code": "LOINC/6690-2",
"numeric_value": 8.2,
"text_value": np.nan,
},
{
"subject_id": 202,
"time": datetime(2024, 4, 1),
"code": "DISCHARGE/Inpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
# Visit 3 (Week 20, 2024): Follow-up
{
"subject_id": 202,
"time": datetime(2024, 5, 13),
"code": "ADMISSION/Outpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
{
"subject_id": 202,
"time": datetime(2024, 5, 13),
"code": "LOINC/2857-1",
"numeric_value": 0.1,
"text_value": np.nan,
},
{
"subject_id": 202,
"time": datetime(2024, 5, 13),
"code": "NOTE/FollowUp",
"numeric_value": np.nan,
"text_value": "PSA levels are undetectable post-op.",
},
{
"subject_id": 202,
"time": datetime(2024, 5, 13),
"code": "DISCHARGE/Outpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
{
"subject_id": 202,
"time": datetime(2025, 5, 13),
"code": "DEATH",
"numeric_value": np.nan,
"text_value": np.nan,
},
]
patient_events_df = pd.DataFrame(patient_events_list)
patient_events_df["time"] = pd.to_datetime(patient_events_df["time"])
patient_events_df["subject_id"] = patient_events_df["subject_id"].astype(str)
# Subject Splits DataFrame
subject_splits_list = [
{"subject_id": 101, "split": "train"},
{
"subject_id": 202,
"split": "held_out",
}, # 'held_out' is often used for the final test set
]
subject_splits_df = pd.DataFrame(subject_splits_list)
code_metadata_list = [
# Static Measurements
{"code": "GENDER/Female", "description": "Female sex"},
{"code": "GENDER/Male", "description": "Male sex"},
{"code": "GENETIC/BRCA1_pos", "description": "BRCA1 gene mutation"},
# Visit and Administrative Codes
{
"code": "ADMISSION/Outpatient",
"description": "Admission for an outpatient clinic visit",
},
{
"code": "ADMISSION/Inpatient",
"description": "Admission to the hospital for an inpatient stay",
},
{
"code": "DISCHARGE/Outpatient",
"description": "Discharge from an outpatient clinic visit",
},
{
"code": "DISCHARGE/Inpatient",
"description": "Discharge from an inpatient hospital stay",
},
{
"code": "NOTE/FollowUp",
"description": "Clinical note for a follow-up appointment",
},
# Diagnosis Codes (ICD-10-CM)
{
"code": "ICD10CM/C34.90",
"description": "Malignant neoplasm of unspecified part of unspecified bronchus or lung",
},
{"code": "ICD10CM/C61", "description": "Malignant neoplasm of prostate"},
# Symptom Codes
{"code": "SYMPTOM/Cough", "description": "Patient reports a persistent cough"},
# Procedure Codes (CPT)
{
"code": "CPT/71250",
"description": "Procedure code for a CT scan of the thorax without contrast",
},
{
"code": "CPT/32408",
"description": "Procedure code for a core needle biopsy of the lung or mediastinum",
},
{
"code": "CPT/55700",
"description": "Procedure code for a needle biopsy of the prostate",
},
{
"code": "CPT/55840",
"description": "Procedure code for a radical retropubic prostatectomy",
},
# Lab Codes (LOINC)
{
"code": "LOINC/6690-2",
"description": "Leukocytes [#/volume] in Blood by Automated count (White Blood Cell Count)",
},
{
"code": "LOINC/2039-6",
"description": "Carcinoembryonic Ag [Mass/volume] in Serum or Plasma (CEA Tumor Marker)",
},
{
"code": "LOINC/59261-8",
"description": "Comprehensive metabolic 2014 panel - Serum or Plasma",
},
{
"code": "LOINC/2857-1",
"description": "Prostate specific Ag [Mass/volume] in Serum or Plasma (PSA Test)",
},
# Medication Codes
{
"code": "RX/Cisplatin",
"description": "Administration of Cisplatin chemotherapy agent",
},
# Death
{"code": "DEATH", "description": "Death"},
]
code_metadata_df = pd.DataFrame(code_metadata_list)
# Patient Events DataFrame
patient_events_list = [
# Patient 101: Jane Doe (Lung Cancer) - Assigned to 'train' split
# Static data
{
"subject_id": 101,
"time": pd.NaT,
"code": "GENDER/Female",
"numeric_value": np.nan,
"text_value": "Female",
},
{
"subject_id": 101,
"time": pd.NaT,
"code": "GENETIC/BRCA1_pos",
"numeric_value": 1,
"text_value": "Positive",
},
# Visit 1 (Week 2, 2024): Diagnosis
{
"subject_id": 101,
"time": datetime(2024, 1, 8),
"code": "ADMISSION/Outpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
{
"subject_id": 101,
"time": datetime(2024, 1, 8),
"code": "SYMPTOM/Cough",
"numeric_value": np.nan,
"text_value": "Persistent for 2 months",
},
{
"subject_id": 101,
"time": datetime(2024, 1, 8),
"code": "LOINC/6690-2",
"numeric_value": 12.5,
"text_value": np.nan,
},
{
"subject_id": 101,
"time": datetime(2024, 1, 8),
"code": "CPT/71250",
"numeric_value": np.nan,
"text_value": "Nodule found in right lung",
},
{
"subject_id": 101,
"time": datetime(2024, 1, 8),
"code": "CPT/32408",
"numeric_value": np.nan,
"text_value": np.nan,
},
{
"subject_id": 101,
"time": datetime(2024, 1, 8),
"code": "ICD10CM/C34.90",
"numeric_value": np.nan,
"text_value": "Primary Diagnosis",
},
{
"subject_id": 101,
"time": datetime(2024, 1, 8),
"code": "DISCHARGE/Outpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
# Visit 2 (Week 4, 2024): Treatment
{
"subject_id": 101,
"time": datetime(2024, 1, 22),
"code": "ADMISSION/Inpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
{
"subject_id": 101,
"time": datetime(2024, 1, 22),
"code": "LOINC/59261-8",
"numeric_value": np.nan,
"text_value": "All values within normal limits",
},
{
"subject_id": 101,
"time": datetime(2024, 1, 22),
"code": "RX/Cisplatin",
"numeric_value": np.nan,
"text_value": "Cisplatin",
},
{
"subject_id": 101,
"time": datetime(2024, 1, 22),
"code": "DISCHARGE/Inpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
# Visit 3 (Week 8, 2024): Follow-up
{
"subject_id": 101,
"time": datetime(2024, 2, 19),
"code": "ADMISSION/Outpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
{
"subject_id": 101,
"time": datetime(2024, 2, 19),
"code": "NOTE/FollowUp",
"numeric_value": np.nan,
"text_value": "Patient tolerated first cycle well.",
},
{
"subject_id": 101,
"time": datetime(2024, 2, 19),
"code": "LOINC/2039-6",
"numeric_value": 50.2,
"text_value": np.nan,
},
{
"subject_id": 101,
"time": datetime(2024, 2, 19),
"code": "DISCHARGE/Outpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
# Patient 202: John Smith (Prostate Cancer) - Assigned to 'held_out' split
# Static data
{
"subject_id": 202,
"time": pd.NaT,
"code": "GENDER/Male",
"numeric_value": np.nan,
"text_value": "Male",
},
# Visit 1 (Week 10, 2024): Diagnosis
{
"subject_id": 202,
"time": datetime(2024, 3, 4),
"code": "ADMISSION/Outpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
{
"subject_id": 202,
"time": datetime(2024, 3, 4),
"code": "LOINC/2857-1",
"numeric_value": 15.1,
"text_value": np.nan,
},
{
"subject_id": 202,
"time": datetime(2024, 3, 4),
"code": "CPT/55700",
"numeric_value": np.nan,
"text_value": "Biopsy taken",
},
{
"subject_id": 202,
"time": datetime(2024, 3, 4),
"code": "ICD10CM/C61",
"numeric_value": np.nan,
"text_value": "Primary Diagnosis",
},
{
"subject_id": 202,
"time": datetime(2024, 3, 4),
"code": "DISCHARGE/Outpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
# Visit 2 (Week 14, 2024): Treatment (Surgery)
{
"subject_id": 202,
"time": datetime(2024, 4, 1),
"code": "ADMISSION/Inpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
{
"subject_id": 202,
"time": datetime(2024, 4, 1),
"code": "CPT/55840",
"numeric_value": np.nan,
"text_value": "Surgical procedure completed.",
},
{
"subject_id": 202,
"time": datetime(2024, 4, 1),
"code": "LOINC/6690-2",
"numeric_value": 8.2,
"text_value": np.nan,
},
{
"subject_id": 202,
"time": datetime(2024, 4, 1),
"code": "DISCHARGE/Inpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
# Visit 3 (Week 20, 2024): Follow-up
{
"subject_id": 202,
"time": datetime(2024, 5, 13),
"code": "ADMISSION/Outpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
{
"subject_id": 202,
"time": datetime(2024, 5, 13),
"code": "LOINC/2857-1",
"numeric_value": 0.1,
"text_value": np.nan,
},
{
"subject_id": 202,
"time": datetime(2024, 5, 13),
"code": "NOTE/FollowUp",
"numeric_value": np.nan,
"text_value": "PSA levels are undetectable post-op.",
},
{
"subject_id": 202,
"time": datetime(2024, 5, 13),
"code": "DISCHARGE/Outpatient",
"numeric_value": np.nan,
"text_value": np.nan,
},
{
"subject_id": 202,
"time": datetime(2025, 5, 13),
"code": "DEATH",
"numeric_value": np.nan,
"text_value": np.nan,
},
]
patient_events_df = pd.DataFrame(patient_events_list)
patient_events_df["time"] = pd.to_datetime(patient_events_df["time"])
patient_events_df["subject_id"] = patient_events_df["subject_id"].astype(str)
# Subject Splits DataFrame
subject_splits_list = [
{"subject_id": 101, "split": "train"},
{
"subject_id": 202,
"split": "held_out",
}, # 'held_out' is often used for the final test set
]
subject_splits_df = pd.DataFrame(subject_splits_list)
In [3]:
Copied!
patient_events_df
patient_events_df
Out[3]:
| subject_id | time | code | numeric_value | text_value | |
|---|---|---|---|---|---|
| 0 | 101 | NaT | GENDER/Female | NaN | Female |
| 1 | 101 | NaT | GENETIC/BRCA1_pos | 1.0 | Positive |
| 2 | 101 | 2024-01-08 | ADMISSION/Outpatient | NaN | NaN |
| 3 | 101 | 2024-01-08 | SYMPTOM/Cough | NaN | Persistent for 2 months |
| 4 | 101 | 2024-01-08 | LOINC/6690-2 | 12.5 | NaN |
| 5 | 101 | 2024-01-08 | CPT/71250 | NaN | Nodule found in right lung |
| 6 | 101 | 2024-01-08 | CPT/32408 | NaN | NaN |
| 7 | 101 | 2024-01-08 | ICD10CM/C34.90 | NaN | Primary Diagnosis |
| 8 | 101 | 2024-01-08 | DISCHARGE/Outpatient | NaN | NaN |
| 9 | 101 | 2024-01-22 | ADMISSION/Inpatient | NaN | NaN |
| 10 | 101 | 2024-01-22 | LOINC/59261-8 | NaN | All values within normal limits |
| 11 | 101 | 2024-01-22 | RX/Cisplatin | NaN | Cisplatin |
| 12 | 101 | 2024-01-22 | DISCHARGE/Inpatient | NaN | NaN |
| 13 | 101 | 2024-02-19 | ADMISSION/Outpatient | NaN | NaN |
| 14 | 101 | 2024-02-19 | NOTE/FollowUp | NaN | Patient tolerated first cycle well. |
| 15 | 101 | 2024-02-19 | LOINC/2039-6 | 50.2 | NaN |
| 16 | 101 | 2024-02-19 | DISCHARGE/Outpatient | NaN | NaN |
| 17 | 202 | NaT | GENDER/Male | NaN | Male |
| 18 | 202 | 2024-03-04 | ADMISSION/Outpatient | NaN | NaN |
| 19 | 202 | 2024-03-04 | LOINC/2857-1 | 15.1 | NaN |
| 20 | 202 | 2024-03-04 | CPT/55700 | NaN | Biopsy taken |
| 21 | 202 | 2024-03-04 | ICD10CM/C61 | NaN | Primary Diagnosis |
| 22 | 202 | 2024-03-04 | DISCHARGE/Outpatient | NaN | NaN |
| 23 | 202 | 2024-04-01 | ADMISSION/Inpatient | NaN | NaN |
| 24 | 202 | 2024-04-01 | CPT/55840 | NaN | Surgical procedure completed. |
| 25 | 202 | 2024-04-01 | LOINC/6690-2 | 8.2 | NaN |
| 26 | 202 | 2024-04-01 | DISCHARGE/Inpatient | NaN | NaN |
| 27 | 202 | 2024-05-13 | ADMISSION/Outpatient | NaN | NaN |
| 28 | 202 | 2024-05-13 | LOINC/2857-1 | 0.1 | NaN |
| 29 | 202 | 2024-05-13 | NOTE/FollowUp | NaN | PSA levels are undetectable post-op. |
| 30 | 202 | 2024-05-13 | DISCHARGE/Outpatient | NaN | NaN |
| 31 | 202 | 2025-05-13 | DEATH | NaN | NaN |
Conversion to DTC format¶
In [4]:
Copied!
# Here we set a demo mapping for the event_category column - if not provided it uses a default
# This is useful especially for cases when generating custom training data for LLMs
demo_mapping = {
"SYMPTOM/Cough": "symptom",
"ICD10CM/C34.90": "diagnosis",
"DEATH": "death",
"RX/Cisplatin": "lot",
}
# Here we set a demo mapping for the event_category column - if not provided it uses a default
# This is useful especially for cases when generating custom training data for LLMs
demo_mapping = {
"SYMPTOM/Cough": "symptom",
"ICD10CM/C34.90": "diagnosis",
"DEATH": "death",
"RX/Cisplatin": "lot",
}
In [5]:
Copied!
#: Do actual conversion
df_converted_constant, df_converted_constant_description, df_converted_events = convert_meds_to_dtc(
df_codes=code_metadata_df,
df_data=patient_events_df,
df_split=subject_splits_df,
prefer_text_value_over_numeric=True,
event_category_mapping=demo_mapping,
no_value_default="observed",
)
#: Do actual conversion
df_converted_constant, df_converted_constant_description, df_converted_events = convert_meds_to_dtc(
df_codes=code_metadata_df,
df_data=patient_events_df,
df_split=subject_splits_df,
prefer_text_value_over_numeric=True,
event_category_mapping=demo_mapping,
no_value_default="observed",
)
In [ ]:
Copied!
# Get for future use
constant_columns = df_converted_constant.columns.tolist()
constant_columns = [x for x in constant_columns if x not in ["patientid"]]
# Get for future use
constant_columns = df_converted_constant.columns.tolist()
constant_columns = [x for x in constant_columns if x not in ["patientid"]]
Example usage in digital_twin_converter package¶
Here we're showing an example for inference (i.e. using a pretrained model), but check out the other examples if you need to e.g. generate training data.
In [ ]:
Copied!
# Set basics
indication = "meds_demo"
config = Config() # Override values here to customize pipeline
config.constant_columns_to_use = constant_columns
config.constant_birthdate_column = None # Not using in demo
config.lot_name_col = None # Setting for LoTs
config.event_value_lot_start = None
# Set basics
indication = "meds_demo"
config = Config() # Override values here to customize pipeline
config.constant_columns_to_use = constant_columns
config.constant_birthdate_column = None # Not using in demo
config.lot_name_col = None # Setting for LoTs
config.event_value_lot_start = None
In [ ]:
Copied!
# Setup basics
dm = DataManager(config=config)
dm.load_indication_data(
df_events=df_converted_events,
df_constant=df_converted_constant,
df_constant_description=df_converted_constant_description,
)
dm.process_indication_data()
dm.setup_unique_mapping_of_events()
dm.setup_dataset_splits()
data_splitter_events = DataSplitterEvents(dm, config=config)
data_splitter_events.setup_variables()
converter = ConverterInstruction(
dm.data_frames["constant_description"],
nr_tokens_budget_total=8192,
config=config,
dm=dm,
)
# Setup basics
dm = DataManager(config=config)
dm.load_indication_data(
df_events=df_converted_events,
df_constant=df_converted_constant,
df_constant_description=df_converted_constant_description,
)
dm.process_indication_data()
dm.setup_unique_mapping_of_events()
dm.setup_dataset_splits()
data_splitter_events = DataSplitterEvents(dm, config=config)
data_splitter_events.setup_variables()
converter = ConverterInstruction(
dm.data_frames["constant_description"],
nr_tokens_budget_total=8192,
config=config,
dm=dm,
)
In [11]:
Copied!
# Set example patient
patientid = 101
# Get data
patient_data = dm.get_patient_data(patientid)
patient_data["events"] = patient_data["events"].sort_values("date")
# Here then split date
split_date = patient_data["events"]["date"].iloc[-1]
# Generate splits to predict whether death will occur in the next 52 weeks
events_splits = data_splitter_events.get_splits_from_patient(
patient_data,
max_nr_samples_per_split=1,
override_split_dates=[split_date],
override_category="death",
override_observation_time_delta=pd.Timedelta(weeks=52),
)
events_split = events_splits[0][0]
#: no forecasting split
forecast_split = None
forecasting_times_to_predict = None
# Convert to instruction
converted = converter.forward_conversion_inference(
forecasting_split=forecast_split,
forecasting_future_weeks_per_variable=forecasting_times_to_predict,
event_split=events_split,
custom_tasks=None,
)
print(converted["instruction"])
# Set example patient
patientid = 101
# Get data
patient_data = dm.get_patient_data(patientid)
patient_data["events"] = patient_data["events"].sort_values("date")
# Here then split date
split_date = patient_data["events"]["date"].iloc[-1]
# Generate splits to predict whether death will occur in the next 52 weeks
events_splits = data_splitter_events.get_splits_from_patient(
patient_data,
max_nr_samples_per_split=1,
override_split_dates=[split_date],
override_category="death",
override_observation_time_delta=pd.Timedelta(weeks=52),
)
events_split = events_splits[0][0]
#: no forecasting split
forecast_split = None
forecasting_times_to_predict = None
# Convert to instruction
converted = converter.forward_conversion_inference(
forecasting_split=forecast_split,
forecasting_future_weeks_per_variable=forecasting_times_to_predict,
event_split=events_split,
custom_tasks=None,
)
print(converted["instruction"])
The following is a patient, starting with the demographic data, following visit by visit everything that the patient experienced. All lab codes refer to LOINC codes. Starting with demographic data: Female sex is Female, BRCA1 gene mutation is Positive, No description available is train. On the first visit, the patient experienced the following: Malignant neoplasm of unspecified part of unspecified bronchus or lung is primary diagnosis, Admission for an outpatient clinic visit is observed, Procedure code for a core needle biopsy of the lung or mediastinum is observed, Procedure code for a CT scan of the thorax without contrast is nodule found in right lung, Discharge from an outpatient clinic visit is observed, Leukocytes [#/volume] in Blood by Automated count (White Blood Cell Count) is 12.5, Patient reports a persistent cough is persistent for 2 months. 2 weeks later, the patient visited and experienced the following: Admission to the hospital for an inpatient stay is observed, Discharge from an inpatient hospital stay is observed, Comprehensive metabolic 2014 panel - Serum or Plasma is all values within normal limits, Administration of Cisplatin chemotherapy agent is cisplatin. 4 weeks later, the patient visited and experienced the following: Admission for an outpatient clinic visit is observed, Discharge from an outpatient clinic visit is observed, Carcinoembryonic Ag [Mass/volume] in Serum or Plasma (CEA Tumor Marker) is 50.2, Clinical note for a follow-up appointment is patient tolerated first cycle well.. Here we repeat the last observed values of each genetic event in the input data: No genetic data available. The most recent line of therapy: Cisplatin You will now have multiple tasks to complete. Please answer for each task in the same order as they are presented. Before every response state the task nr, e.g. 'Task 2:'. Task 1 is time to event prediction: Your task is to predict whether the following event was censored 52 weeks from the last clinical visit and whether the event occurred or not: death. Please provide your prediction in the following format: 'Here is the prediction: the event (<name of event>) was [not] censored and [did not occur]/[occurred].'