Data Preprocessing: From Raw Clinical Data to TwinWeaver Format¶
This tutorial demonstrates how to transform raw clinical data into the standardized TwinWeaver format required for training digital twin models.
Key Principles:
- Include as much data as possible - We aim to capture all available clinical information first, then trim down if needed during data generation.
- Prefer events over constants - Longitudinal data (events) provides richer temporal context than static data (constants). Put as much as possible into the events dataframe.
We will cover:
- Creating synthetic raw clinical data (simulating real-world EHR exports)
- Transforming raw data into the three required TwinWeaver dataframes:
df_events: Longitudinal patient events in long formatdf_constant: Static patient demographicsdf_constant_description: Metadata describing constant columns
- Using preprocessing helper functions for data aggregation and column classification
- Converting the processed data into instruction-tuning format
import pandas as pd
from twinweaver import (
DataManager,
Config,
DataSplitterForecasting,
DataSplitterEvents,
ConverterInstruction,
DataSplitter,
identify_constant_and_changing_columns,
aggregate_events_to_weeks,
)
1. Create Synthetic Raw Clinical Data¶
In real-world scenarios, you would receive data exports from electronic health records (EHR), clinical trial databases, or other clinical data sources. These typically come as wide-format tables with mixed static and longitudinal information.
We'll create two raw dataframes simulating a typical oncology dataset:
- Raw Patient Demographics: Contains static information like birth year, gender, and diagnosis details
- Raw Clinical Observations: Contains longitudinal data like lab results, treatments, and clinical assessments
# Raw Patient Demographics DataFrame
# This simulates a typical patient registry export with static information
raw_demographics = pd.DataFrame(
{
"patient_id": ["PT001", "PT002", "PT003", "PT004", "PT005"],
"birth_year": [1958, 1965, 1972, 1949, 1961],
"sex": ["Male", "Female", "Male", "Female", "Male"],
"cancer_type": ["NSCLC", "NSCLC", "NSCLC", "NSCLC", "NSCLC"],
"histology": [
"Adenocarcinoma",
"Squamous Cell Carcinoma",
"Adenocarcinoma",
"Adenocarcinoma",
"Squamous Cell Carcinoma",
],
"smoking_status": ["Former", "Never", "Current", "Former", "Current"],
"diagnosis_date": ["2020-03-15", "2020-06-22", "2021-01-10", "2019-11-05", "2020-09-18"],
"stage_at_diagnosis": ["IIIB", "IV", "IIIA", "IV", "IIIB"],
"egfr_status": ["Wild Type", "Wild Type", "L858R Mutation", "Wild Type", "Wild Type"],
"alk_status": ["Wild Type", "Wild Type", "Wild Type", "Rearrangement", "Wild Type"],
"pdl1_expression": ["50-100%", "1-49%", "<1%", "1-49%", "50-100%"],
# Death information: some patients died, others are censored (alive at last follow-up)
"death_status": ["Deceased", "Alive", "Alive", "Deceased", "Alive"],
"death_date": ["2021-02-10", None, None, "2020-08-15", None], # None for alive patients
}
)
print("Raw Demographics DataFrame:")
raw_demographics
# Raw Clinical Observations DataFrame
# This simulates longitudinal clinical data with labs, vitals, treatments, and outcomes
raw_observations = pd.DataFrame(
{
"patient_id": [
# Patient PT001 - multiple visits
"PT001",
"PT001",
"PT001",
"PT001",
"PT001",
"PT001",
"PT001",
"PT001",
"PT001",
"PT001",
"PT001",
"PT001",
"PT001",
"PT001",
"PT001",
"PT001",
# Patient PT002 - multiple visits
"PT002",
"PT002",
"PT002",
"PT002",
"PT002",
"PT002",
"PT002",
"PT002",
"PT002",
"PT002",
"PT002",
"PT002",
"PT002",
"PT002",
# Patient PT003 - multiple visits
"PT003",
"PT003",
"PT003",
"PT003",
"PT003",
"PT003",
"PT003",
"PT003",
"PT003",
"PT003",
"PT003",
"PT003",
# Patient PT004 - multiple visits
"PT004",
"PT004",
"PT004",
"PT004",
"PT004",
"PT004",
"PT004",
"PT004",
"PT004",
"PT004",
"PT004",
"PT004",
"PT004",
"PT004",
# Patient PT005 - multiple visits
"PT005",
"PT005",
"PT005",
"PT005",
"PT005",
"PT005",
"PT005",
"PT005",
"PT005",
"PT005",
],
"visit_date": [
# PT001 visits
"2020-03-20",
"2020-03-20",
"2020-03-20",
"2020-03-20", # Baseline
"2020-04-17",
"2020-04-17",
"2020-04-17",
"2020-04-17", # Cycle 1
"2020-05-15",
"2020-05-15",
"2020-05-15",
"2020-05-15", # Cycle 2
"2020-06-12",
"2020-06-12",
"2020-06-12",
"2020-06-12", # Cycle 3
# PT002 visits
"2020-06-25",
"2020-06-25",
"2020-06-25", # Baseline
"2020-07-23",
"2020-07-23",
"2020-07-23",
"2020-07-23", # Cycle 1
"2020-08-20",
"2020-08-20",
"2020-08-20", # Cycle 2
"2020-09-17",
"2020-09-17",
"2020-09-17",
"2020-09-17", # Cycle 3
# PT003 visits
"2021-01-15",
"2021-01-15",
"2021-01-15", # Baseline
"2021-02-12",
"2021-02-12",
"2021-02-12", # Cycle 1
"2021-03-12",
"2021-03-12",
"2021-03-12", # Cycle 2
"2021-04-09",
"2021-04-09",
"2021-04-09", # Cycle 3
# PT004 visits
"2019-11-10",
"2019-11-10",
"2019-11-10",
"2019-11-10", # Baseline
"2019-12-08",
"2019-12-08",
"2019-12-08", # Cycle 1
"2020-01-05",
"2020-01-05",
"2020-01-05", # Cycle 2
"2020-02-02",
"2020-02-02",
"2020-02-02",
"2020-02-02", # Cycle 3
# PT005 visits
"2020-09-22",
"2020-09-22",
"2020-09-22", # Baseline
"2020-10-20",
"2020-10-20",
"2020-10-20", # Cycle 1
"2020-11-17",
"2020-11-17",
"2020-11-17",
"2020-11-17", # Cycle 2
],
"observation_type": [
# PT001
"hemoglobin",
"platelets",
"ecog",
"treatment_start",
"hemoglobin",
"platelets",
"ecog",
"drug_admin",
"hemoglobin",
"platelets",
"ecog",
"drug_admin",
"hemoglobin",
"platelets",
"ecog",
"response_assessment",
# PT002
"hemoglobin",
"platelets",
"treatment_start",
"hemoglobin",
"platelets",
"ecog",
"drug_admin",
"hemoglobin",
"platelets",
"drug_admin",
"hemoglobin",
"platelets",
"ecog",
"response_assessment",
# PT003
"hemoglobin",
"platelets",
"treatment_start",
"hemoglobin",
"platelets",
"drug_admin",
"hemoglobin",
"platelets",
"drug_admin",
"hemoglobin",
"platelets",
"response_assessment",
# PT004
"hemoglobin",
"platelets",
"ecog",
"treatment_start",
"hemoglobin",
"platelets",
"drug_admin",
"hemoglobin",
"platelets",
"drug_admin",
"hemoglobin",
"platelets",
"ecog",
"response_assessment",
# PT005
"hemoglobin",
"platelets",
"treatment_start",
"hemoglobin",
"platelets",
"drug_admin",
"hemoglobin",
"platelets",
"ecog",
"response_assessment",
],
"observation_value": [
# PT001 - stable patient
"13.5",
"285",
"1",
"Carboplatin/Pemetrexed/Pembrolizumab",
"13.2",
"278",
"1",
"Carboplatin/Pemetrexed/Pembrolizumab",
"12.8",
"265",
"1",
"Carboplatin/Pemetrexed/Pembrolizumab",
"12.9",
"270",
"0",
"Partial Response",
# PT002 - declining hemoglobin
"14.1",
"310",
"Carboplatin/Paclitaxel/Pembrolizumab",
"13.5",
"295",
"1",
"Carboplatin/Paclitaxel/Pembrolizumab",
"12.8",
"280",
"Carboplatin/Paclitaxel/Pembrolizumab",
"12.2",
"268",
"1",
"Stable Disease",
# PT003 - EGFR+ patient on targeted therapy
"14.8",
"245",
"Osimertinib",
"14.5",
"250",
"Osimertinib",
"14.3",
"248",
"Osimertinib",
"14.6",
"252",
"Partial Response",
# PT004 - ALK+ patient
"11.8",
"198",
"2",
"Alectinib",
"12.1",
"210",
"Alectinib",
"12.5",
"225",
"Alectinib",
"12.8",
"235",
"1",
"Partial Response",
# PT005 - IO monotherapy
"15.2",
"320",
"Pembrolizumab",
"14.9",
"315",
"Pembrolizumab",
"14.6",
"308",
"0",
"Complete Response",
],
"observation_unit": [
# PT001
"g/dL",
"10^9/L",
"",
"",
"g/dL",
"10^9/L",
"",
"",
"g/dL",
"10^9/L",
"",
"",
"g/dL",
"10^9/L",
"",
"",
# PT002
"g/dL",
"10^9/L",
"",
"g/dL",
"10^9/L",
"",
"",
"g/dL",
"10^9/L",
"",
"g/dL",
"10^9/L",
"",
"",
# PT003
"g/dL",
"10^9/L",
"",
"g/dL",
"10^9/L",
"",
"g/dL",
"10^9/L",
"",
"g/dL",
"10^9/L",
"",
# PT004
"g/dL",
"10^9/L",
"",
"",
"g/dL",
"10^9/L",
"",
"g/dL",
"10^9/L",
"",
"g/dL",
"10^9/L",
"",
"",
# PT005
"g/dL",
"10^9/L",
"",
"g/dL",
"10^9/L",
"",
"g/dL",
"10^9/L",
"",
"",
],
}
)
print("Raw Clinical Observations DataFrame:")
raw_observations.head(20)
2. Use Preprocessing Helpers to Understand Your Data¶
Before transforming the data, let's use the preprocessing helper functions to:
- Identify constant vs. changing columns - This helps decide what goes into
df_constantvsdf_events - Aggregate events to weeks - This reduces noise from multiple observations on nearby days
# First, let's check which columns in our demographics data are truly constant
# We'll merge demographics with a simplified observations view to check
# Create a merged view for analysis
merged_for_analysis = raw_observations.merge(
raw_demographics[["patient_id", "birth_year", "sex", "histology", "smoking_status"]], on="patient_id", how="left"
)
# Identify constant vs changing columns
constant_cols, changing_cols = identify_constant_and_changing_columns(
merged_for_analysis, date_column="visit_date", patientid_column="patient_id"
)
print("Constant columns (same value across all visits for each patient):")
print(constant_cols)
print("\nChanging columns (values vary over time):")
print(changing_cols)
Why Put Most Data into Events?¶
Key Insight: Even data that appears "constant" (like biomarker status) is often better represented as events because:
- It has a specific date when it was measured
- It could potentially change over time (e.g., acquired resistance mutations)
- The temporal context of when information was known is clinically relevant
Rule of thumb: Only truly immutable patient characteristics (birth year, biological sex) should go in df_constant. Everything else should be an event!
def transform_to_events(raw_obs: pd.DataFrame, raw_demo: pd.DataFrame) -> pd.DataFrame:
"""
Transform raw clinical data into TwinWeaver events format.
The events dataframe has these required columns:
- patientid: Unique patient identifier
- date: Date of the event
- event_category: High-level grouping (e.g., 'lab', 'drug', 'lot', 'death')
- event_name: Specific variable name
- event_value: The result/value
- event_descriptive_name: Natural language description for prompts
- meta_data: Additional metadata (optional)
- source: Data source identifier (optional)
"""
events_list = []
# --- Process clinical observations ---
for _, row in raw_obs.iterrows():
patient_id = row["patient_id"]
visit_date = row["visit_date"]
obs_type = row["observation_type"]
obs_value = row["observation_value"]
obs_unit = row["observation_unit"]
# Map observation types to TwinWeaver categories
if obs_type == "hemoglobin":
events_list.append(
{
"patientid": patient_id,
"date": visit_date,
"event_category": "lab",
"event_name": "hemoglobin_-_718-7",
"event_value": obs_value,
"event_descriptive_name": "hemoglobin - 718-7",
"meta_data": f"Test: hemoglobin, Cleaned lab units: {obs_unit}",
"source": "clinical_observations",
}
)
elif obs_type == "platelets":
events_list.append(
{
"patientid": patient_id,
"date": visit_date,
"event_category": "lab",
"event_name": "platelets_-_26515-7",
"event_value": obs_value,
"event_descriptive_name": "platelets - 26515-7",
"meta_data": f"Test: platelets, Cleaned lab units: {obs_unit}",
"source": "clinical_observations",
}
)
elif obs_type == "ecog":
events_list.append(
{
"patientid": patient_id,
"date": visit_date,
"event_category": "ecog",
"event_name": "ecog",
"event_value": obs_value,
"event_descriptive_name": "ECOG Performance Status",
"meta_data": None,
"source": "clinical_observations",
}
)
elif obs_type == "treatment_start":
# Treatment start creates a Line of Therapy (LoT) event
events_list.append(
{
"patientid": patient_id,
"date": visit_date,
"event_category": "lot",
"event_name": "line_number",
"event_value": "1",
"event_descriptive_name": "line number",
"meta_data": None,
"source": "clinical_observations",
}
)
events_list.append(
{
"patientid": patient_id,
"date": visit_date,
"event_category": "lot",
"event_name": "line_name",
"event_value": obs_value,
"event_descriptive_name": "line of therapy",
"meta_data": None,
"source": "clinical_observations",
}
)
# Also add individual drug LoT start events
for drug in obs_value.split("/"):
events_list.append(
{
"patientid": patient_id,
"date": visit_date,
"event_category": "lot",
"event_name": drug.lower(),
"event_value": "LoT Start",
"event_descriptive_name": "LoT",
"meta_data": None,
"source": "clinical_observations",
}
)
elif obs_type == "drug_admin":
# Drug administration events
for drug in obs_value.split("/"):
events_list.append(
{
"patientid": patient_id,
"date": visit_date,
"event_category": "drug",
"event_name": drug.lower(),
"event_value": "administered",
"event_descriptive_name": drug.lower(),
"meta_data": obs_value,
"source": "clinical_observations",
}
)
elif obs_type == "response_assessment":
events_list.append(
{
"patientid": patient_id,
"date": visit_date,
"event_category": "response",
"event_name": "recist_response",
"event_value": obs_value,
"event_descriptive_name": "RECIST Response",
"meta_data": None,
"source": "clinical_observations",
}
)
# --- Process diagnosis and biomarker data from demographics ---
# These are events because they have a specific date and could change over time
for _, row in raw_demo.iterrows():
patient_id = row["patient_id"]
diagnosis_date = row["diagnosis_date"]
# Initial diagnosis event
events_list.append(
{
"patientid": patient_id,
"date": diagnosis_date,
"event_category": "main_diagnosis",
"event_name": "initial_diagnosis",
"event_value": row["cancer_type"],
"event_descriptive_name": "initial cancer diagnosis",
"meta_data": row["cancer_type"],
"source": "demographics",
}
)
# Stage at diagnosis
events_list.append(
{
"patientid": patient_id,
"date": diagnosis_date,
"event_category": "staging",
"event_name": "stage",
"event_value": row["stage_at_diagnosis"],
"event_descriptive_name": "Cancer Stage",
"meta_data": None,
"source": "demographics",
}
)
# Biomarker results (these go into events, not constants!)
events_list.append(
{
"patientid": patient_id,
"date": diagnosis_date,
"event_category": "basic_biomarker",
"event_name": "EGFR",
"event_value": row["egfr_status"],
"event_descriptive_name": "EGFR",
"meta_data": "NGS",
"source": "demographics",
}
)
events_list.append(
{
"patientid": patient_id,
"date": diagnosis_date,
"event_category": "basic_biomarker",
"event_name": "ALK",
"event_value": row["alk_status"],
"event_descriptive_name": "ALK",
"meta_data": "NGS",
"source": "demographics",
}
)
events_list.append(
{
"patientid": patient_id,
"date": diagnosis_date,
"event_category": "biomarker_ihc",
"event_name": "PD-L1",
"event_value": row["pdl1_expression"],
"event_descriptive_name": "PD-L1 Expression (TPS)",
"meta_data": "IHC 22C3",
"source": "demographics",
}
)
# --- Process death events ---
# Death is a time-to-event outcome that occurs at a specific date
if row["death_status"] == "Deceased" and pd.notna(row["death_date"]):
events_list.append(
{
"patientid": patient_id,
"date": row["death_date"],
"event_category": "death",
"event_name": "death",
"event_value": "Yes",
"event_descriptive_name": "Death",
"meta_data": None,
"source": "demographics",
}
)
# Create DataFrame and sort by patient and date
df_events = pd.DataFrame(events_list)
df_events["date"] = pd.to_datetime(df_events["date"])
df_events = df_events.sort_values(["patientid", "date"]).reset_index(drop=True)
return df_events
# Transform the data
df_events = transform_to_events(raw_observations, raw_demographics)
print(f"Created events DataFrame with {len(df_events)} events")
print(f"Unique patients: {df_events['patientid'].nunique()}")
print(f"\nEvent categories: {df_events['event_category'].unique().tolist()}")
df_events.head(15)
3.2 Create df_constant (Static Patient Information)¶
Only truly immutable characteristics should go here. We keep this minimal!
def transform_to_constant(raw_demo: pd.DataFrame) -> pd.DataFrame:
"""
Extract truly constant patient information.
Only include immutable characteristics that:
1. Never change over time
2. Don't have a meaningful "measurement date"
"""
df_constant = raw_demo[["patient_id", "birth_year", "sex", "histology", "smoking_status"]].copy()
# Rename columns to match TwinWeaver format
df_constant = df_constant.rename(
columns={
"patient_id": "patientid",
"birth_year": "birthyear",
"sex": "gender",
}
)
return df_constant
df_constant = transform_to_constant(raw_demographics)
print("Constant DataFrame (static patient information):")
df_constant
3.3 Create df_constant_description (Metadata for Constants)¶
This provides human-readable descriptions for each column in df_constant.
def create_constant_description(df_constant: pd.DataFrame) -> pd.DataFrame:
"""
Create descriptions for each constant column.
These descriptions are used in prompt generation.
"""
descriptions = {
"patientid": "Unique patient identifier",
"birthyear": "Year of birth of the patient",
"gender": "Gender of the patient",
"histology": "Histological subtype of NSCLC",
"smoking_status": "Smoking status at diagnosis",
}
# Create description for each column that exists
rows = []
for col in df_constant.columns:
rows.append({"variable": col, "comment": descriptions.get(col, f"Description for {col}")})
return pd.DataFrame(rows)
df_constant_description = create_constant_description(df_constant)
print("Constant Description DataFrame:")
df_constant_description
4. Apply Weekly Aggregation (Optional Preprocessing)¶
If your data has multiple observations on nearby days (e.g., labs taken daily), you may want to aggregate them to reduce noise. The aggregate_events_to_weeks function handles this automatically.
# Demonstrate weekly aggregation on lab values
df_labs_only = df_events[df_events["event_category"] == "lab"].copy()
print(f"Before aggregation: {len(df_labs_only)} lab events")
# Aggregate to weekly values
df_labs_aggregated = aggregate_events_to_weeks(
df_labs_only,
patientid_column="patientid",
date_column="date",
event_name_column="event_name",
event_value_column="event_value",
random_state=42, # For reproducibility
)
print(f"After aggregation: {len(df_labs_aggregated)} lab events")
print("\nAggregated lab events (first 10):")
df_labs_aggregated.sort_values(by=["patientid", "date"]).head(10)
5. Validate the TwinWeaver Format¶
Let's verify our data is in the correct format before proceeding.
def validate_twinweaver_format(df_events, df_constant, df_constant_description):
"""Validate that dataframes conform to TwinWeaver requirements."""
issues = []
# Check df_events required columns
events_required = ["patientid", "date", "event_category", "event_name", "event_value", "event_descriptive_name"]
for col in events_required:
if col not in df_events.columns:
issues.append(f"df_events missing required column: {col}")
# Check df_constant has patientid
if "patientid" not in df_constant.columns:
issues.append("df_constant missing required column: patientid")
# Check df_constant_description structure
if "variable" not in df_constant_description.columns:
issues.append("df_constant_description missing required column: variable")
if "comment" not in df_constant_description.columns:
issues.append("df_constant_description missing required column: comment")
# Check patient ID consistency
events_patients = set(df_events["patientid"].unique())
constant_patients = set(df_constant["patientid"].unique())
if events_patients != constant_patients:
missing_in_events = constant_patients - events_patients
missing_in_constant = events_patients - constant_patients
if missing_in_events:
issues.append(f"Patients in constant but not in events: {missing_in_events}")
if missing_in_constant:
issues.append(f"Patients in events but not in constant: {missing_in_constant}")
if issues:
print("❌ Validation issues found:")
for issue in issues:
print(f" - {issue}")
else:
print("✅ All dataframes are in valid TwinWeaver format!")
print(f" - Events: {len(df_events)} rows, {df_events['patientid'].nunique()} patients")
print(f" - Constants: {len(df_constant)} rows, {len(df_constant.columns)} columns")
print(f" - Descriptions: {len(df_constant_description)} variable descriptions")
return len(issues) == 0
validate_twinweaver_format(df_events, df_constant, df_constant_description)
6. Convert to Instruction-Tuning Format¶
Now we can use our processed data with the TwinWeaver pipeline to generate instruction-tuning examples, just like in the 01_data_preparation_for_training tutorial.
# Configure TwinWeaver
config = Config()
# Set the event category used for data splitting (split around Lines of Therapy)
config.split_event_category = "lot"
# Define which event categories to forecast
config.event_category_forecast = ["lab"]
# 3. Mapping of specific time to events to predict (e.g., we want to predict 'death' and 'progression')
# Only needs to be set if you want to do time to event prediction
config.event_category_events_prediction_with_naming = {
"death": "death",
"progression": "next progression", # Custom name in prompt: "next progression" instead of "progression"
}
# Define which static columns to include in prompts
config.constant_columns_to_use = [
"birthyear",
"gender",
"histology",
"smoking_status",
]
# Specify the birth year column for age calculation
config.constant_birthdate_column = "birthyear"
# Initialize the DataManager and load our processed data
dm = DataManager(config=config)
dm.load_indication_data(df_events=df_events, df_constant=df_constant, df_constant_description=df_constant_description)
dm.process_indication_data()
dm.setup_unique_mapping_of_events()
dm.setup_hold_out_sets(validation_split=0.1, test_split=0.1)
dm.infer_var_types()
print(f"Loaded {len(dm.all_patientids)} patients into DataManager")
# Initialize data splitters and converter
data_splitter_events = DataSplitterEvents(
dm,
config=config,
max_length_to_sample=pd.Timedelta(weeks=104),
min_length_to_sample=pd.Timedelta(weeks=1),
)
data_splitter_events.setup_variables()
data_splitter_forecasting = DataSplitterForecasting(
data_manager=dm,
config=config,
max_forecasted_trajectory_length=pd.Timedelta(days=90),
)
data_splitter_forecasting.setup_statistics()
data_splitter = DataSplitter(data_splitter_events, data_splitter_forecasting)
converter = ConverterInstruction(
nr_tokens_budget_total=8192,
config=config,
dm=dm,
variable_stats=data_splitter_forecasting.variable_stats,
)
# Generate instruction-tuning examples for a patient
patientid = dm.all_patientids[0]
patient_data = dm.get_patient_data(patientid)
print(f"Patient: {patientid}")
print(f"Number of events: {len(patient_data['events'])}")
print("\nPatient events:")
patient_data["events"]
# Generate training splits
forecasting_splits, events_splits, reference_dates = data_splitter.get_splits_from_patient_with_target(
patient_data,
)
print(f"Generated {len(forecasting_splits)} training splits for patient {patientid}")
# Convert first split to instruction format
if len(forecasting_splits) > 0:
split_idx = 0
p_converted = converter.forward_conversion(
forecasting_splits=forecasting_splits[split_idx],
event_splits=events_splits[split_idx],
)
print("=" * 80)
print("INSTRUCTION (Model Input):")
print("=" * 80)
print(p_converted["instruction"])
else:
print("No training splits generated for this patient.")
if len(forecasting_splits) > 0:
print("=" * 80)
print("ANSWER (Target Output):")
print("=" * 80)
print(p_converted["answer"])
Summary: Key Takeaways¶
Data Format Requirements¶
TwinWeaver requires three dataframes:
df_events(Longitudinal data in long format)- Required columns:
patientid,date,event_category,event_name,event_value,event_descriptive_name - Optional columns:
meta_data,source
- Required columns:
df_constant(Static patient information)- Required column:
patientid - Additional columns for immutable characteristics (birthyear, gender, etc.)
- Required column:
df_constant_description(Metadata for constants)- Required columns:
variable,comment
- Required columns:
Best Practices¶
Put as much as possible into events - Even data that seems "constant" often has temporal context:
- Biomarker results → events (they have a test date)
- Staging information → events (stage at diagnosis date)
- Demographics like birth year, biological sex → constants (truly immutable)
Include all available data first - Start with everything, then trim during data generation if needed:
- Use the token budget in
ConverterInstructionto control output length - The framework automatically prioritizes recent and relevant events
- Use the token budget in
Use preprocessing helpers wisely:
identify_constant_and_changing_columns()- Helps decide what goes whereaggregate_events_to_weeks()- Reduces noise from frequent measurements
Validate your data before training to catch format issues early.