Forecasting-Only Example: Training Data Generation with the Unified DataSplitter API¶
Start by loading in all libraries
import pandas as pd
from twinweaver import (
DataSplitterForecasting,
DataSplitter,
DataManager,
ConverterInstruction,
Config,
)
Basic Setup¶
Set up the config - showing how to use custom dataset here from example data.
df_events = pd.read_csv("../../example_data/events.csv")
df_constant = pd.read_csv("../../example_data/constant.csv")
df_constant_description = pd.read_csv("../../example_data/constant_description.csv")
Set up the data manager and the forecasting-only pipeline using the unified DataSplitter API. By passing only data_splitter_forecasting, the unified interface handles the forecasting-only case automatically.
config = Config() # Override values here to customize pipeline
# <---------------------- CRITICAL CONFIGURATION ---------------------->
# 1. Event category used for data splitting (e.g., split data around Lines of Therapy 'lot')
# Has to be set for all instruction tasks
config.split_event_category = "lot"
# 2. List of event categories we want to forecast (e.g., forecasting 'lab' values)
# Only needs to be set if you want to forecast variables
config.event_category_forecast = ["lab"]
# No time to event
# Constant setup
config.constant_columns_to_use = ["birthyear", "gender", "histology", "smoking_history"] # Manually set from constant
config.constant_birthdate_column = "birthyear"
dm = DataManager(config=config)
dm.load_indication_data(df_events=df_events, df_constant=df_constant, df_constant_description=df_constant_description)
dm.process_indication_data()
dm.setup_unique_mapping_of_events()
dm.setup_hold_out_sets(validation_split=0.1, test_split=0.1)
dm.infer_var_types()
data_splitter_forecasting = DataSplitterForecasting(
data_manager=dm,
config=config,
max_forecasted_trajectory_length=pd.Timedelta(days=90),
)
# In case you manually want to override the variables for forecasting selectiong, you can skip this next line.
data_splitter_forecasting.setup_statistics()
# Use the unified DataSplitter API with only the forecasting splitter
data_splitter = DataSplitter(data_splitter_forecasting=data_splitter_forecasting)
converter = ConverterInstruction(
nr_tokens_budget_total=8192,
config=config,
dm=dm,
variable_stats=data_splitter_forecasting.variable_stats, # Optional, needed for forecasting QA tasks
)
Examine patient data¶
From the data manager we can get a patient, for example the third patientid.
patientid = dm.all_patientids[2]
patientid
Let's checkout the data of the patient. patient_data is a dictionary containing the patient's data, with two keys:
- "events": A pandas DataFrame containing all time-series events (original events and molecular data combined and sorted by date).
- "constant": A pandas DataFrame containing the static (constant) data for the patient.
patient_data = dm.get_patient_data(patientid)
patient_data["events"].head(20)
patient_data["constant"]
Convert patient data to string¶
We start by generating random "splits" in the patient trajectory using the unified DataSplitter.get_splits_from_patient_with_target method. We can make multiple relevant samples from each patient trajectory (e.g. depending on when the therapy started), and also to predict different variables (e.g. neutrophils/hemoglobin/... for forecasting).
Since we only provided a forecasting splitter, events_splits will be None.
forecasting_splits, events_splits, reference_dates = data_splitter.get_splits_from_patient_with_target(
patient_data,
forecasting_nr_samples_per_split=4,
forecasting_filter_outliers=False,
max_num_splits_per_split_event=2,
)
# Note, events_splits will be None here since we don't have any split events for this patient
Now for each split, we can generate the formatted strings. Note that events_splits is None since we only provided a forecasting splitter, so we pass an empty list for event_splits.
split_idx = 0
p_converted = converter.forward_conversion(
forecasting_splits=forecasting_splits[split_idx],
event_splits=None, # Not needed for forecasting-only splitter
)
p_converted is a dictionary containing the final formatted data:
- 'instruction': The complete input string for the model (context + multi-task prompt).
- 'answer': The complete target string for the model (multi-task answer).
- 'meta': A dictionary holding metadata including patient ID, structured constant and history data used, split date, combined metadata from sub-converters, and a list of detailed metadata for each individual task generated ('target_meta_detailed').
print(p_converted["instruction"])
print(p_converted["answer"])