Example for converting patient data to instruction data¶
This notebook demonstrates the core workflow for converting raw clinical data into Instruction Tuning examples for the TwinWeaver model.
We will walk through the process of:
- Loading Data: Importing raw tabular data (longitudinal events and static demographics).
- Configuration: Setting up the pipeline to match your data schema.
- Splitting: Generating "Splits" (input/output samples) from a patient's timeline.
- Conversion: Transforming these splits into text-based Instruction (Input) and Answer (Target) pairs suitable for fine-tuning an LLM.
import pandas as pd
from twinweaver import (
DataManager,
Config,
DataSplitterForecasting,
DataSplitterEvents,
ConverterInstruction,
DataSplitter,
)
Basic Setup¶
Load Data¶
We require three standardized dataframes to construct the patient digital twin:
- Events (
df_events): The longitudinal history in 'long' format (one row per event). Required columns:patientid: Unique patient identifier.date: Date of the event.event_category: High-level grouping (e.g., 'lab', 'drug', 'condition', 'lot').event_name: Specific variable name (e.g., 'Hemoglobin', 'Metformin').event_value: The result/value (e.g., '12.5', 'Start').event_descriptive_name: Natural language description used in the text prompt.
- Constant (
df_constant): Static patient information (one row per patient). Contains demographics like birth year, gender, and histology. - Constant Description (
df_constant_description): Metadata mapping constant columns to natural language descriptions. Columns:variable,comment.
# Load data - generated example data
df_events = pd.read_csv("./example_data/events.csv")
df_constant = pd.read_csv("./example_data/constant.csv")
df_constant_description = pd.read_csv("./example_data/constant_description.csv")
Configuration and Data Manager¶
We initialize the Config object, which serves as the central control for column mapping, token limits, and prompt templates. You can override defaults here to match your specific dataset schema (e.g., specifying which columns in df_constant to include).
The DataManager then ingests the raw dataframes, handling preprocessing steps like date parsing, unique event mapping, and train/test splitting at the patient level.
config = Config() # Override values here to customize pipeline
config.constant_columns_to_use = [
"birthyear",
"gender",
"histology",
"smoking_history",
] # Manually set from constant DF
config.constant_birthdate_column = "birthyear"
dm = DataManager(config=config)
dm.load_indication_data(df_events=df_events, df_constant=df_constant, df_constant_description=df_constant_description)
dm.process_indication_data()
dm.setup_unique_mapping_of_events()
dm.setup_dataset_splits()
dm.setup_dataset_splits()
dm.infer_var_types()
Initialize Splitters and Converter¶
To generate diverse training examples, we use specialized Data Splitters:
DataSplitterEvents: Identifies time points for predicting discrete outcomes (e.g., progression, death).DataSplitterForecasting: Identifies time points for forecasting continuous variables (e.g., future lab values).
The ConverterInstruction is the core engine that transforms these data points into tokenized text. It respects a token budget (e.g., 8192 tokens) to ensure the generated prompts fit within the model's context window.
# This data splitter handles event prediction tasks
data_splitter_events = DataSplitterEvents(dm, config=config)
data_splitter_events.setup_variables()
# This data splitter handles forecasting tasks
data_splitter_forecasting = DataSplitterForecasting(
data_manager=dm,
config=config,
)
# If you don't want to do forecasting QA, proportional sampling, or 3-sigma filtering, you can skip this step
data_splitter_forecasting.setup_statistics()
# We will also use the easier interface that combines both data splitters
data_splitter = DataSplitter(data_splitter_events, data_splitter_forecasting)
# Set up the converter instruction
converter = ConverterInstruction(
nr_tokens_budget_total=8192,
config=config,
dm=dm,
variable_stats=data_splitter_forecasting.variable_stats, # Optional, needed for forecasting QA tasks
)
Examine patient data¶
From the data manager we can get the patient, for example this patientid.
patientid = dm.all_patientids[4]
patientid
Let's checkout the data of the patient. patient_data is a dictionary containing the patient's data, with two keys:
"events": A pandas DataFrame containing all time-series events (original events and molecular data combined and sorted by date)."constant": A pandas DataFrame containing the static (constant) data for the patient.
patient_data = dm.get_patient_data(patientid)
patient_data["events"].head(20)
patient_data["constant"]
Convert patient data to string¶
Generate Training Splits¶
A single patient's timeline can yield multiple training examples. A Split represents a specific point in time (the "split date") where we divide the data:
- Input: History before the split date.
- Target: Future events or values after the split date.
The get_splits_from_patient_with_target method samples valid split points (anchored to sampling random times around the lines of therapy) and determines appropriate targets (forecasting vs. event prediction). This allows the model to learn from various stages of a patient's journey.
forecasting_splits, events_splits, reference_dates = data_splitter.get_splits_from_patient_with_target(
patient_data,
)
Now for each split, we can generate these strings. We just pick the first one as an example.
split_idx = 0
p_converted = converter.forward_conversion(
forecasting_splits=forecasting_splits[split_idx],
event_splits=events_splits[split_idx],
override_mode_to_select_forecasting="both",
)
Inspect the Output¶
The forward_conversion method returns p_converted, a dictionary containing the final LLM training example:
instruction: The full text prompt. It includes the patient's history (demographics + events) followed by the specific task questions.answer: The target completion string containing the correct answers for the tasks.meta: Structured metadata used to generate the text, useful for debugging or evaluation.
print(p_converted["instruction"])
print(p_converted["answer"])
Reverse Conversion: Text to Structured Data¶
Finally, we demonstrate the Reverse Conversion process. This is the inverse of the instruction generation step. It takes the text string (which would be generated by the model during inference) and parses it back into structured Pandas DataFrames.
This capability is crucial for:
- Evaluation: Comparing the model's text predictions against ground truth data programmatically.
- Integration: Converting the model's narrative outputs back into downstream clinical systems or dashboards.
In this example, we take the answer string we just generated and confirm it can be reconstructed into a structured dataframe using the reverse_conversion method.
date = reference_dates["date"][0]
return_list = converter.reverse_conversion(p_converted["answer"], dm, date)
return_list[0]["result"]