Example for single patient to convert using the instruction setup with custom dataset¶
Start by loading in all libraries
import pandas as pd
from twinweaver import (
DataSplitterForecasting,
DataManager,
DataSplitterEvents,
ConverterInstruction,
Config,
)
Basic Setup¶
Set up the config - showing how to use custom dataset here from example data.
df_events = pd.read_csv("../../example_data/events.csv")
df_constant = pd.read_csv("../../example_data/constant.csv")
df_constant_description = pd.read_csv("../../example_data/constant_description.csv")
Set up the data managers which hold the patient data.
config = Config() # Override values here to customize pipeline
config.constant_columns_to_use = ["birthyear", "gender", "histology", "smoking_history"] # Manually set from constant
config.constant_birthdate_column = "birthyear"
dm = DataManager(config=config)
dm.load_indication_data(df_events=df_events, df_constant=df_constant, df_constant_description=df_constant_description)
dm.process_indication_data()
dm.setup_unique_mapping_of_events()
dm.setup_dataset_splits()
dm.infer_var_types()
data_splitter_events = DataSplitterEvents(dm, config=config)
data_splitter_events.setup_variables()
data_splitter_forecasting = DataSplitterForecasting(
data_manager=dm,
config=config,
)
# In case you manually want to override the variables for forecasting selectiong, you can skip this next line.
data_splitter_forecasting.setup_statistics()
converter = ConverterInstruction(
dm.data_frames["constant_description"],
nr_tokens_budget_total=8192,
config=config,
dm=dm,
variable_stats=data_splitter_forecasting.variable_stats, # Optional, needed for forecasting QA tasks
)
Examine patient data¶
From the data manager we can get the patient, for example the first patientid.
patientid = dm.all_patientids[2]
patientid
Let's checkout the data of the patient. patient_data is a dictionary containing the patient's data, with two keys:
- "events": A pandas DataFrame containing all time-series events (original events and molecular data combined and sorted by date).
- "constant": A pandas DataFrame containing the static (constant) data for the patient.
patient_data = dm.get_patient_data(patientid)
patient_data["events"].head(20)
patient_data["constant"]
Convert patient data to string¶
We start by generating random "splits" in the patient trajectory. We can make multiple relevant samples from each patient trajectory (e.g. depending on when the therapy started), and also to predict different variables (e.g. neutrophils/hemoglobin/... for forecasting, death/progression/metastases/next treatment for event).
Here we generate these random splits. We can also manually override them (see other examples on inference).
processed_splits_fc, split_dates = data_splitter_forecasting.get_splits_from_patient(
patient_data,
nr_samples_per_split=4,
filter_outliers=False,
include_metadata=True,
max_num_splits_per_lot=2,
)
processed_splits_ev = data_splitter_events.get_splits_from_patient(
patient_data,
reference_split_dates=split_dates,
max_nr_samples_per_split=3,
)
Now for each split, we can generate these strings.
split_idx = 0
p_converted = converter.forward_conversion(
forecasting_splits=processed_splits_fc[split_idx],
event_splits=processed_splits_ev[split_idx],
override_mode_to_select_forecasting="forecasting_qa",
)
p_converted is a dictionary containing the final formatted data:
- 'instruction': The complete input string for the model (context + multi-task prompt).
- 'answer': The complete target string for the model (multi-task answer).
- 'meta': A dictionary holding metadata including patient ID, structured constant and history data used, split date, combined metadata from sub-converters, and a list of detailed metadata for each individual task generated ('target_meta_detailed').
print(p_converted["instruction"])
print(p_converted["answer"])
date = split_dates["date"][0]
return_list = converter.reverse_conversion(p_converted["answer"], dm, date)
return_list[2]["result"]