Dataset Format¶
TwinWeaver expects three primary dataframes (or CSV files) as input. Example files can be found in examples/example_data/.
1. Longitudinal Events (events.csv)¶
Contains time-varying clinical data where each row represents a single event.
| Column | Description |
|---|---|
patientid |
Unique identifier for the patient |
date |
Date of the event (processable by pandas.to_datetime) |
event_descriptive_name |
Human-readable name used in the text output |
event_category |
(Optional) Category (e.g., lab, drug), used for determining splits & tasks |
event_name |
(Optional) Specific event identifier |
event_value |
Value associated with the event |
meta_data |
(Optional) Additional metadata |
source |
(Optional) Modality of data - default to "events", alternatively "genetic" |
Example:
patientid,date,event_descriptive_name,event_category,event_name,event_value,meta_data,source
patient_001,2024-01-15,Hemoglobin,lab,HGB,12.5,,clinical
patient_001,2024-01-15,White Blood Cells,lab,WBC,7.2,,clinical
patient_001,2024-02-01,Chemotherapy Started,treatment,CHEMO,1,,clinical
2. Patient Constants (constant.csv)¶
Contains static patient information (demographics, baseline characteristics). One row per patient.
| Column | Description |
|---|---|
patientid |
Unique identifier for the patient |
birthyear |
(example) Patient's year of birth |
gender |
(example) Patient's gender |
... |
Any other static patient attributes |
Example:
patientid,birthyear,gender,diagnosis_stage
patient_001,1965,Female,Stage II
patient_002,1978,Male,Stage III
3. Constant Descriptions (constant_description.csv)¶
Maps columns in the constant table to human-readable descriptions for the text prompt.
| Column | Description |
|---|---|
variable |
Name of the column in the constant table |
comment |
Description of the variable for the text prompt |
Example:
variable,comment
birthyear,Year of birth
gender,Patient gender
diagnosis_stage,Cancer stage at diagnosis
Conceptual Overview of Text Transformation¶
The Converter transforms the structured data splits into natural language:
Constants → Demographics Text¶
Static patient information is converted into readable sentences:
| Input (DataFrame) | Output (Text) |
|---|---|
birthyear: 1965 |
"Year of birth is 1965" |
gender: Female |
"Patient gender is female" |
Events → Temporal Narrative¶
Longitudinal events are organized by visit date and converted to natural language with relative time references:
Input (Events DataFrame):
date | event_descriptive_name | event_value
2024-01-15 | Hemoglobin | 12.5
2024-01-15 | White Blood Cells | 7.2
2024-01-29 | Hemoglobin | 11.8
Output (Text):
On the first visit, the patient experienced the following:
Hemoglobin is 12.5,
White Blood Cells is 7.2.
2 weeks later, the patient visited and experienced the following:
Hemoglobin is 11.8.
Relative Dating¶
TwinWeaver uses relative dating instead of absolute dates. All calendar dates from the input data are converted into time deltas relative to the previous visit (e.g., "2 weeks later") rather than being included as raw dates (e.g., "2024-01-29"). This serves two important purposes: first, it anonymizes the patient data by removing identifiable calendar dates from the training text; second, it provides the model with clinically meaningful temporal context — the time elapsed between visits — rather than arbitrary date strings. By default, time intervals are expressed in weeks, but this can be changed to days using Config.set_delta_time_unit("days"). Accumulative deltas (time since the very first visit rather than since the previous visit) are also supported.
Final Output Structure¶
For training, TwinWeaver produces input-target pairs:
Input:
[Preamble explaining data structure]
[Demographics section]
[Chronological event narrative]
[Task-specific prompt]
Target:
[Expected model response - predicted values or outcomes]
For inference, only the input portion is generated, and the model produces the target predictions.
Best Practices for Data Processing¶
When transforming raw clinical data into TwinWeaver format, following these principles will help you get the most out of your data.
1. Prefer Events Over Constants¶
Key principle: Put as much data as possible into the events table. Only truly immutable patient characteristics should go into constants.
Even data that appears "constant" is often better represented as events because:
- It has a specific date when it was measured (e.g., biomarker test date)
- It could change over time (e.g., acquired resistance mutations, re-staging)
- Temporal context matters clinically (when was this information known?)
Examples:
| Data Type | Recommended Table | Rationale |
|---|---|---|
| Birth year, biological sex | constant |
Truly immutable |
| Biomarker results (EGFR, ALK, PD-L1) | events |
Has test date, could change |
| Cancer stage | events |
Stage at diagnosis date, may be re-staged |
| Diagnosis information | events |
Occurred at a specific date |
| Lab values, vitals | events |
Longitudinal measurements |
| Treatment administrations | events |
Time-varying interventions |
| Death, progression | events |
Time-to-event outcomes |
2. Include All Available Data First¶
Start by including everything, then trim during data generation if needed:
- The
ConverterInstructiontoken budget automatically controls output length - The framework prioritizes recent and relevant events
- You can always exclude data later, but you can't include what wasn't captured
3. Use Consistent Event Naming¶
Standardize your event names and categories:
# Good: Consistent naming convention
event_name = "hemoglobin_-_718-7" # Includes LOINC code for clarity
event_descriptive_name = "hemoglobin - 718-7" # Human-readable version
# Avoid: Inconsistent naming
event_name = "Hgb" # One record
event_name = "hemoglobin" # Another record
event_name = "HGB" # Yet another
4. Structure Event Categories Meaningfully¶
Choose event categories that align with your modeling objectives:
| Category | Description | Example Events |
|---|---|---|
lab |
Laboratory test results | hemoglobin, platelets, creatinine |
drug |
Drug administrations | pembrolizumab, carboplatin |
lot |
Line of therapy markers | treatment start, line number |
death |
Mortality events | death |
response |
Treatment response | RECIST response, progression |
staging |
Cancer staging | stage, TNM classification |
basic_biomarker |
Molecular markers | EGFR, ALK, KRAS |
5. Use Preprocessing Helper Functions¶
TwinWeaver provides helper functions to analyze and prepare your data:
from twinweaver import (
identify_constant_and_changing_columns,
aggregate_events_to_weeks,
)
# Identify which columns are truly constant vs. changing over time
constant_cols, changing_cols = identify_constant_and_changing_columns(
df, date_column="visit_date", patientid_column="patient_id"
)
# Aggregate frequent measurements to reduce noise
df_aggregated = aggregate_events_to_weeks(
df_events,
patientid_column="patientid",
date_column="date",
event_name_column="event_name",
event_value_column="event_value",
)
6. Validate Your Data Before Training¶
Always validate your data format before proceeding:
def validate_twinweaver_format(df_events, df_constant, df_constant_description):
"""Validate that dataframes conform to TwinWeaver requirements."""
issues = []
# Check required columns
events_required = ["patientid", "date", "event_category", "event_name",
"event_value", "event_descriptive_name"]
for col in events_required:
if col not in df_events.columns:
issues.append(f"df_events missing column: {col}")
# Check patient ID consistency
events_patients = set(df_events["patientid"].unique())
constant_patients = set(df_constant["patientid"].unique())
if events_patients != constant_patients:
issues.append("Patient IDs don't match between events and constants")
return len(issues) == 0, issues
7. Handle Time-to-Event Outcomes Properly¶
Death, progression, and other time-to-event outcomes should be represented as events with a specific date:
# Death event
{
"patientid": "PT001",
"date": "2021-02-10", # Date of death
"event_category": "death",
"event_name": "death",
"event_value": "Yes",
"event_descriptive_name": "Death",
}
Censored Patients
For patients who are alive (censored), simply don't include a death event. The absence of a death event indicates the patient was alive at last follow-up.
Loading Data¶
Data can be loaded as pandas DataFrames directly:
import pandas as pd
from twinweaver import DataManager, Config
# Load your data
df_events = pd.read_csv("events.csv")
df_constant = pd.read_csv("constant.csv")
df_constant_description = pd.read_csv("constant_description.csv")
# Initialize configuration
config = Config()
# <---------------------- CRITICAL CONFIGURATION ---------------------->
# 1. Event category used for data splitting (e.g., split data around Lines of Therapy 'lot')
# Has to be set for all instruction tasks
config.split_event_category = "lot"
# 2. List of event categories we want to forecast (e.g., forecasting 'lab' values)
# Only needs to be set if you want to forecast variables
config.event_category_forecast = ["lab"]
# 3. Mapping of specific time to events to predict (e.g., we want to predict 'death' and 'progression')
# Only needs to be set if you want to do time to event prediction
config.event_category_events_prediction_with_naming = {
"death": "death",
"progression": "next progression", # Custom name in prompt
}
# Initialize DataManager
dm = DataManager(config=config)
dm.load_indication_data(
df_events=df_events,
df_constant=df_constant,
df_constant_description=df_constant_description
)
Configuration Parameters
split_event_category: The event category used to anchor split points for generating training samples (required for instruction tuning)event_category_forecast: Which event categories to forecast as time-series valuesevent_category_events_prediction_with_naming: Maps event names to prediction tasks (e.g., survival, progression)
See the Raw Data Preprocessing Tutorial for transforming raw clinical data into TwinWeaver format, or the Data Preparation Tutorial for a complete walkthrough of instruction-tuning data generation.