Data Splitting¶
Data splitting is the process of dividing a patient's longitudinal timeline into input (history) and target (future) segments. Each split produces a training example for the LLM: the history becomes the prompt context, and the target becomes the expected completion.
TwinWeaver provides specialized splitters for two complementary clinical prediction tasks:
| Splitter | Task | Example |
|---|---|---|
DataSplitterForecasting |
Forecasting continuous or categorical variables | Predict hemoglobin values over the next 90 days |
DataSplitterEvents |
Landmark event prediction (time-to-event) | Did the patient progress within 52 weeks? |
A unified DataSplitter interface combines both, ensuring they share the same split dates for multi-task training.
Core Concept: Split Dates¶
Every split revolves around a split date — the moment in time that separates what the model can see (input) from what it must predict (target).
Patient timeline
──────────────────────────────────────────────────────► time
LoT Start Split Date Forecast Horizon
│ │ │
▼ ▼ ▼
INPUT (history) │ TARGET (future) │
events ≤ split │ events > split │
How Split Dates Are Chosen¶
Split dates are anchored to split events — a configurable event category (typically Line of Therapy, "lot"). The framework:
- Finds all split-event start dates in the patient's history (e.g., every LoT start).
- Identifies candidate dates within a window around each split event (controlled by
max_split_length_after_split_event, default 90 days). - Randomly samples one or more candidate dates per split event (
max_num_splits_per_split_event).
This anchoring ensures that training examples are centered on clinically meaningful time points rather than arbitrary dates.
Forecasting Splits¶
The DataSplitterForecasting generates tasks where the model must predict future values of specific variables (e.g., lab results, biomarker levels).
How It Works¶
flowchart TD
A[Patient Data] --> B[Find valid split dates<br>anchored to split events]
B --> C[For each date, identify<br>forecastable variables]
C --> D{Variable has enough<br>data before AND after?}
D -->|Yes| E[Add to candidate pool]
D -->|No| F[Skip variable]
E --> G[Sample 1 or multiple variables<br>weighted by score]
G --> H[Create split:<br>history + target values]
H --> I[Optional: filter<br>outliers 3-sigma]
For each candidate split date, the forecasting splitter:
- Checks variable eligibility: A variable is valid at a given date only if it has at least
min_nr_variable_seen_previouslyoccurrences in the lookback window andmin_nr_variable_seen_afteroccurrences in the forecast window. - Samples variables: Between
min_nr_variables_to_sampleandmax_nr_variables_to_samplevariables are selected per task, using weighted proportional sampling based on pre-computed statistics (optionally uniform sampling). - Creates the split: Events before the split date form the input; future values of the sampled variables (within
max_forecast_time_for_value) form the target. - Filters future LoT overlap: Target events occurring after the next Line of Therapy start are excluded to avoid data leakage.
Variable Statistics & Sampling¶
Before generating splits, calling setup_statistics() computes a baseline predictability score for each variable using a simple copy-forward strategy (predicting the next value as the previous one). The computed metrics include:
| Metric | Description |
|---|---|
| R² | Coefficient of determination for the copy-forward baseline |
| NRMSE | Normalized Root Mean Squared Error |
| MAPE | Mean Absolute Percentage Error |
score_log_nrmse_n_samples |
Combined score used for weighted sampling (default) |
Variables with higher variability (harder to predict with copy-forward) receive higher sampling weights, encouraging the model to focus on learning patterns for more dynamic biomarkers. Categorical variables are sampled uniformly using the mean score of numeric variables.
Numeric vs. Categorical Variables
TwinWeaver automatically detects variable types via DataManager.infer_var_types(). Numeric variables get full statistical analysis; categorical variables receive placeholder statistics and uniform sampling weights.
Outlier Filtering¶
When filter_outliers=True, the 3-sigma strategy clips target values to the \([\mu - 3\sigma, \mu + 3\sigma]\) range based on training-set statistics. This prevents extreme outliers from dominating the training signal.
Key Parameters¶
data_splitter_forecasting = DataSplitterForecasting(
data_manager=dm,
config=config,
max_split_length_after_split_event=pd.Timedelta(days=90), # Window after split event
max_lookback_time_for_value=pd.Timedelta(days=90), # Lookback for variable history
max_forecast_time_for_value=pd.Timedelta(days=90), # Forecast horizon
min_nr_variable_seen_previously=1, # Min past occurrences
min_nr_variable_seen_after=1, # Min future occurrences
min_nr_variables_to_sample=1, # Min variables per task
max_nr_variables_to_sample=3, # Max variables per task
filtering_strategy="3-sigma", # Outlier handling
sampling_strategy="proportional", # Weighted or uniform sampling
)
Event Prediction Splits¶
The DataSplitterEvents generates landmark event prediction tasks — predicting whether a discrete clinical event (e.g., death, disease progression) occurs within a randomly sampled future time window.
How It Works¶
flowchart TD
A[Patient Data] --> B[Find valid split dates<br>anchored to split events]
B --> C[For each date, sample<br>an event category]
C --> D[Randomly sample a<br>prediction window]
D --> E{Event occurred<br>within window <br> and before censoring event?}
E -->|Yes| F[occurred = True]
E -->|No| G{Censored by<br>next LoT or data end?}
G -->|Next LoT| H[censored = new_therapy_start]
G -->|End of data| I[censored = end_of_data]
G -->|No censoring| J[censored = None<br>Event truly did not occur]
F --> K[Create DataSplitterEventsOption]
H --> K
I --> K
J --> K
For each candidate split date, the event splitter:
- Samples an event category from the configured mapping (e.g.,
"death"or"progression"), avoiding duplicate categories per split. - Samples a prediction window of random duration between
min_length_to_sample(default: 1 week) andmax_length_to_sample(default: 104 weeks). This trains the model to handle variable-length horizons. - Determines the outcome:
- Occurred: The event was observed within the window before any censoring events.
- Censored: The observation was cut short by a new therapy start, end of data, or a data cutoff date.
- Not occurred: The event genuinely did not happen within the window (e.g., the patient is known to be alive at the end of the window).
- Handles backup categories: If the exact event category is absent, the splitter can fall back to a backup (e.g., using
"death"as a proxy for"progression"events).
Key Parameters¶
data_splitter_events = DataSplitterEvents(
data_manager=dm,
config=config,
max_length_to_sample=pd.Timedelta(weeks=104), # Max prediction window
min_length_to_sample=pd.Timedelta(weeks=1), # Min prediction window
unit_length_to_sample="weeks", # Window sampling unit
max_split_length_after_split_event=pd.Timedelta(days=90), # Window after split event
)
Configuration¶
The event-to-prediction mapping is configured via:
config.data_splitter_events_variables_category_mapping = {
"death": "death", # event_category → descriptive name in prompt
"progression": "next progression", # custom prompt label
}
Combined Splitting with DataSplitter¶
The DataSplitter class provides a unified interface that coordinates both splitters. This is the recommended approach for generating multi-task training data, as it ensures forecasting and event prediction tasks share the same split dates.
Training Workflow¶
from twinweaver import DataSplitter
data_splitter = DataSplitter(data_splitter_events, data_splitter_forecasting)
# Generate aligned splits for both tasks
forecasting_splits, events_splits, reference_dates = \
data_splitter.get_splits_from_patient_with_target(patient_data)
Internally, get_splits_from_patient_with_target:
- Calls
DataSplitterForecasting.get_splits_from_patient()to determine split dates and generate forecasting tasks. - Passes those same split dates (
reference_dates) toDataSplitterEvents.get_splits_from_patient()to generate aligned event prediction tasks.
This alignment is critical: both task types see the same patient history up to the same point in time, enabling consistent multi-task learning.
Inference Workflow¶
For inference, use get_splits_from_patient_inference, which anchors the split at the last available date in the patient's record:
forecast_split, events_split = data_splitter.get_splits_from_patient_inference(
patient_data,
inference_type="both", # "forecasting", "events", or "both"
forecasting_override_variables_to_predict=["HGB", "WBC"],
events_override_category="death",
events_override_observation_time_delta=pd.Timedelta(weeks=52),
)
How Multiple Training Examples Are Generated¶
A single patient can yield many training examples through several sources of variation:
| Source of Variation | Controlled By | Effect |
|---|---|---|
| Multiple split events (e.g., LoTs) | Patient history | One split per LoT by default |
| Multiple dates per split event | max_num_splits_per_split_event |
Random dates within the LoT window |
| Different variable subsets | min/max_nr_variables_to_sample |
Different forecasting questions per date |
| Different event categories | data_splitter_events_variables_category_mapping |
Death vs. progression predictions |
| Different prediction windows | min/max_length_to_sample |
1-week to 104-week horizons |
This diversity encourages the model to generalize across time points, variables, and prediction tasks.
End-to-End Example¶
import pandas as pd
from twinweaver import (
DataManager, Config,
DataSplitterForecasting, DataSplitterEvents,
DataSplitter, ConverterInstruction,
)
# 1. Configure
config = Config()
config.split_event_category = "lot"
config.event_category_forecast = ["lab"]
config.data_splitter_events_variables_category_mapping = {
"death": "death",
"progression": "next progression",
}
# 2. Load and process data
dm = DataManager(config=config)
dm.load_indication_data(df_events=df_events, df_constant=df_constant,
df_constant_description=df_constant_description)
dm.process_indication_data()
dm.setup_unique_mapping_of_events()
dm.setup_dataset_splits()
dm.infer_var_types()
# 3. Initialize splitters
data_splitter_events = DataSplitterEvents(dm, config=config)
data_splitter_events.setup_variables()
data_splitter_forecasting = DataSplitterForecasting(data_manager=dm, config=config)
data_splitter_forecasting.setup_statistics() # Compute variable scores
data_splitter = DataSplitter(data_splitter_events, data_splitter_forecasting)
# 4. Generate splits for a patient
patient_data = dm.get_patient_data(dm.all_patientids[0])
forecasting_splits, events_splits, reference_dates = \
data_splitter.get_splits_from_patient_with_target(patient_data)
# 5. Convert to text
converter = ConverterInstruction(
nr_tokens_budget_total=8192, config=config, dm=dm,
variable_stats=data_splitter_forecasting.variable_stats,
)
result = converter.forward_conversion(
forecasting_splits=forecasting_splits[0],
event_splits=events_splits[0],
override_mode_to_select_forecasting="both",
)
print(result["instruction"][:500])
print(result["answer"])
What's Next?¶
- Dataset Format: Understand the expected input data structure
- Framework Overview: Learn about TwinWeaver's architecture and task types
- Data Preparation Tutorial: Step-by-step notebook walkthrough
- Custom Splitting (Training): Advanced splitting with individual splitters
- API Reference — Data Splitters: Full API documentation