Example for single patient to convert using the unified DataSplitter API with custom dataset¶
Start by loading in all libraries
import pandas as pd
from twinweaver import (
DataSplitterForecasting,
DataSplitterEvents,
DataSplitter,
DataManager,
ConverterInstruction,
Config,
)
Basic Setup¶
Set up the config - showing how to use custom dataset here from example data.
df_events = pd.read_csv("../../example_data/events.csv")
df_constant = pd.read_csv("../../example_data/constant.csv")
df_constant_description = pd.read_csv("../../example_data/constant_description.csv")
Set up the data managers and the unified DataSplitter which combines both event and forecasting splitters.
config = Config() # Override values here to customize pipeline
# <---------------------- CRITICAL CONFIGURATION ---------------------->
# 1. Event category used for data splitting (e.g., split data around Lines of Therapy 'lot')
# Has to be set for all instruction tasks
config.split_event_category = "lot"
# 2. List of event categories we want to forecast (e.g., forecasting 'lab' values)
# Only needs to be set if you want to forecast variables
config.event_category_forecast = ["lab"]
# 3. Mapping of specific time to events to predict (e.g., we want to predict 'death' and 'progression')
# Only needs to be set if you want to do time to event prediction
config.event_category_events_prediction_with_naming = {
"death": "death",
"progression": "next progression", # Custom name in prompt: "next progression" instead of "progression"
}
# Constant setup
config.constant_columns_to_use = ["birthyear", "gender", "histology", "smoking_history"] # Manually set from constant
config.constant_birthdate_column = "birthyear"
dm = DataManager(config=config)
dm.load_indication_data(df_events=df_events, df_constant=df_constant, df_constant_description=df_constant_description)
dm.process_indication_data()
dm.setup_unique_mapping_of_events()
dm.setup_hold_out_sets(validation_split=0.1, test_split=0.1)
dm.infer_var_types()
data_splitter_events = DataSplitterEvents(
dm,
config=config,
max_length_to_sample=pd.Timedelta(weeks=104),
min_length_to_sample=pd.Timedelta(weeks=1),
)
data_splitter_events.setup_variables()
data_splitter_forecasting = DataSplitterForecasting(
data_manager=dm,
config=config,
max_forecasted_trajectory_length=pd.Timedelta(days=90),
)
# In case you manually want to override the variables for forecasting selectiong, you can skip this next line.
data_splitter_forecasting.setup_statistics()
# Use the unified DataSplitter API that combines both splitters
data_splitter = DataSplitter(
data_splitter_events=data_splitter_events,
data_splitter_forecasting=data_splitter_forecasting,
)
converter = ConverterInstruction(
nr_tokens_budget_total=8192,
config=config,
dm=dm,
variable_stats=data_splitter_forecasting.variable_stats, # Optional, needed for forecasting QA tasks
)
Examine patient data¶
From the data manager we can get the patient, for example the first patientid.
patientid = dm.all_patientids[2]
patientid
Let's checkout the data of the patient. patient_data is a dictionary containing the patient's data, with two keys:
- "events": A pandas DataFrame containing all time-series events (original events and molecular data combined and sorted by date).
- "constant": A pandas DataFrame containing the static (constant) data for the patient.
patient_data = dm.get_patient_data(patientid)
patient_data["events"].head(20)
patient_data["constant"]
Convert patient data to string¶
We start by generating random "splits" in the patient trajectory using the unified DataSplitter.get_splits_from_patient_with_target method. This ensures that both forecasting and event splits use the same anchor points in time. We can make multiple relevant samples from each patient trajectory (e.g. depending on when the therapy started), and also to predict different variables (e.g. neutrophils/hemoglobin/... for forecasting, death/progression/metastases/next treatment for event).
We can also manually override them (see other examples on inference).
forecasting_splits, events_splits, reference_dates = data_splitter.get_splits_from_patient_with_target(
patient_data,
forecasting_nr_samples_per_split=4,
forecasting_filter_outliers=False,
max_num_splits_per_split_event=2,
events_max_nr_samples_per_split=3,
)
Now for each split, we can generate these strings.
split_idx = 0
p_converted = converter.forward_conversion(
forecasting_splits=forecasting_splits[split_idx],
event_splits=events_splits[split_idx],
)
p_converted is a dictionary containing the final formatted data:
- 'instruction': The complete input string for the model (context + multi-task prompt).
- 'answer': The complete target string for the model (multi-task answer).
- 'meta': A dictionary holding metadata including patient ID, structured constant and history data used, split date, combined metadata from sub-converters, and a list of detailed metadata for each individual task generated ('target_meta_detailed').
print(p_converted["instruction"])
print(p_converted["answer"])
date = reference_dates["date"][0]
return_list = converter.reverse_conversion(p_converted["answer"], dm, date)
return_list[0]["result"]