Forecasting-Only Example: Training Data Generation with the Unified DataSplitter API¶

Start by loading in all libraries

In [ ]:

Copied!





import pandas as pd

from twinweaver import (
    DataSplitterForecasting,
    DataSplitter,
    DataManager,
    ConverterInstruction,
    Config,
)
import pandas as pd

from twinweaver import (
    DataSplitterForecasting,
    DataSplitter,
    DataManager,
    ConverterInstruction,
    Config,
)

Basic Setup¶

Set up the config - showing how to use custom dataset here from example data.

In [ ]:

Copied!

df_events = pd.read_csv("../../example_data/events.csv")
df_constant = pd.read_csv("../../example_data/constant.csv")
df_constant_description = pd.read_csv("../../example_data/constant_description.csv")
df_events = pd.read_csv("../../example_data/events.csv")
df_constant = pd.read_csv("../../example_data/constant.csv")
df_constant_description = pd.read_csv("../../example_data/constant_description.csv")

Set up the data manager and the forecasting-only pipeline using the unified DataSplitter API. By passing only data_splitter_forecasting, the unified interface handles the forecasting-only case automatically.

In [ ]:

Copied!





config = Config()  # Override values here to customize pipeline

# <---------------------- CRITICAL CONFIGURATION ---------------------->
# 1. Event category used for data splitting (e.g., split data around Lines of Therapy 'lot')
# Has to be set for all instruction tasks
config.split_event_category = "lot"

# 2. List of event categories we want to forecast (e.g., forecasting 'lab' values)
# Only needs to be set if you want to forecast variables
config.event_category_forecast = ["lab"]

# No time to event

# Constant setup
config.constant_columns_to_use = ["birthyear", "gender", "histology", "smoking_history"]  # Manually set from constant
config.constant_birthdate_column = "birthyear"
config = Config()  # Override values here to customize pipeline

# <---------------------- CRITICAL CONFIGURATION ---------------------->
# 1. Event category used for data splitting (e.g., split data around Lines of Therapy 'lot')
# Has to be set for all instruction tasks
config.split_event_category = "lot"

# 2. List of event categories we want to forecast (e.g., forecasting 'lab' values)
# Only needs to be set if you want to forecast variables
config.event_category_forecast = ["lab"]

# No time to event

# Constant setup
config.constant_columns_to_use = ["birthyear", "gender", "histology", "smoking_history"]  # Manually set from constant
config.constant_birthdate_column = "birthyear"

In [ ]:

Copied!





dm = DataManager(config=config)
dm.load_indication_data(df_events=df_events, df_constant=df_constant, df_constant_description=df_constant_description)
dm.process_indication_data()
dm.setup_unique_mapping_of_events()
dm.setup_hold_out_sets(validation_split=0.1, test_split=0.1)
dm.infer_var_types()


data_splitter_forecasting = DataSplitterForecasting(
    data_manager=dm,
    config=config,
    max_forecasted_trajectory_length=pd.Timedelta(days=90),
)
# In case you manually want to override the variables for forecasting selectiong, you can skip this next line.
data_splitter_forecasting.setup_statistics()

# Use the unified DataSplitter API with only the forecasting splitter
data_splitter = DataSplitter(data_splitter_forecasting=data_splitter_forecasting)

converter = ConverterInstruction(
    nr_tokens_budget_total=8192,
    config=config,
    dm=dm,
    variable_stats=data_splitter_forecasting.variable_stats,  # Optional, needed for forecasting QA tasks
)
dm = DataManager(config=config)
dm.load_indication_data(df_events=df_events, df_constant=df_constant, df_constant_description=df_constant_description)
dm.process_indication_data()
dm.setup_unique_mapping_of_events()
dm.setup_hold_out_sets(validation_split=0.1, test_split=0.1)
dm.infer_var_types()


data_splitter_forecasting = DataSplitterForecasting(
    data_manager=dm,
    config=config,
    max_forecasted_trajectory_length=pd.Timedelta(days=90),
)
# In case you manually want to override the variables for forecasting selectiong, you can skip this next line.
data_splitter_forecasting.setup_statistics()

# Use the unified DataSplitter API with only the forecasting splitter
data_splitter = DataSplitter(data_splitter_forecasting=data_splitter_forecasting)

converter = ConverterInstruction(
    nr_tokens_budget_total=8192,
    config=config,
    dm=dm,
    variable_stats=data_splitter_forecasting.variable_stats,  # Optional, needed for forecasting QA tasks
)

Examine patient data¶

From the data manager we can get a patient, for example the third patientid.

In [ ]:

Copied!

patientid = dm.all_patientids[2]
patientid
patientid = dm.all_patientids[2]
patientid

Let's checkout the data of the patient. patient_data is a dictionary containing the patient's data, with two keys:

"events": A pandas DataFrame containing all time-series events (original events and molecular data combined and sorted by date).
"constant": A pandas DataFrame containing the static (constant) data for the patient.

In [ ]:

Copied!

patient_data = dm.get_patient_data(patientid)
patient_data["events"].head(20)
patient_data = dm.get_patient_data(patientid)
patient_data["events"].head(20)

In [ ]:

Copied!

patient_data["constant"]
patient_data["constant"]

Convert patient data to string¶

We start by generating random "splits" in the patient trajectory using the unified DataSplitter.get_splits_from_patient_with_target method. We can make multiple relevant samples from each patient trajectory (e.g. depending on when the therapy started), and also to predict different variables (e.g. neutrophils/hemoglobin/... for forecasting).

Since we only provided a forecasting splitter, events_splits will be None.

In [ ]:

Copied!





forecasting_splits, events_splits, reference_dates = data_splitter.get_splits_from_patient_with_target(
    patient_data,
    forecasting_nr_samples_per_split=4,
    forecasting_filter_outliers=False,
    max_num_splits_per_split_event=2,
)
# Note, events_splits will be None here since we don't have any split events for this patient
forecasting_splits, events_splits, reference_dates = data_splitter.get_splits_from_patient_with_target(
    patient_data,
    forecasting_nr_samples_per_split=4,
    forecasting_filter_outliers=False,
    max_num_splits_per_split_event=2,
)
# Note, events_splits will be None here since we don't have any split events for this patient

Now for each split, we can generate the formatted strings. Note that events_splits is None since we only provided a forecasting splitter, so we pass an empty list for event_splits.

In [ ]:

Copied!





split_idx = 0
p_converted = converter.forward_conversion(
    forecasting_splits=forecasting_splits[split_idx],
    event_splits=None,  # Not needed for forecasting-only splitter
)
split_idx = 0
p_converted = converter.forward_conversion(
    forecasting_splits=forecasting_splits[split_idx],
    event_splits=None,  # Not needed for forecasting-only splitter
)

p_converted is a dictionary containing the final formatted data:

'instruction': The complete input string for the model (context + multi-task prompt).
'answer': The complete target string for the model (multi-task answer).
'meta': A dictionary holding metadata including patient ID, structured constant and history data used, split date, combined metadata from sub-converters, and a list of detailed metadata for each individual task generated ('target_meta_detailed').

In [ ]:

Copied!

print(p_converted["instruction"])
print(p_converted["instruction"])

In [ ]:

Copied!

print(p_converted["answer"])
print(p_converted["answer"])