Tutorials & Examples¶
The examples/ directory provides comprehensive tutorials to help you get up and running with TwinWeaver.
🔰 Core Tutorials¶
These notebooks cover the primary workflows for most users:
0. Raw Data Preprocessing¶
examples/data_preprocessing/raw_data_preprocessing.ipynb
Start here if you have raw clinical data. This tutorial demonstrates how to transform raw clinical data (e.g., EHR exports, clinical trial databases) into the standardized TwinWeaver format.
What you'll learn:
- Creating the three required TwinWeaver dataframes (
df_events,df_constant,df_constant_description) - Best practices for deciding what goes into events vs. constants
- Handling time-to-event outcomes like death and progression
- Using preprocessing helpers for data aggregation and column classification
- Validating your data format before training
1. Data Preparation for Training¶
examples/01_data_preparation_for_training.ipynb
Demonstrates how to convert raw patient data (events, constants, genetics) into the instruction-tuning text format used by TwinWeaver. This is the core step for preparing data for fine-tuning.
What you'll learn:
- Loading patient data into the DataManager
- Configuring variables and data splits
- Converting data to text format for LLM training
For an in-depth explanation of the splitting logic, see the Data Splitting page.
2. Inference Prompt Preparation¶
examples/02_inference_prompt_preparation.ipynb
Shows how to run inference using the TwinWeaver framework, including setting up the data manager and generating prompts.
What you'll learn:
- Setting up prompts for inference
- Generating predictions with trained models
- Processing model outputs
3. End-to-End LLM Fine-Tuning¶
examples/03_end_to_end_llm_finetuning.ipynb
A complete guide covering the entire pipeline from data ingestion to LLM fine-tuning.
Installation Note
Please install the packages required via the exact following line:
The torch CUDA version might need to be adapted to your system.🚀 Advanced Usage¶
For users needing custom behavior or specific integrations:
Pretraining Data Conversion¶
examples/advanced/pretraining/prepare_pretraining_data.py
A script illustrating how to convert data for the pretraining phase, using template-based generation. Useful if you want to pretrain on your own large-scale unlabeled clinical data.
End-to-End LLM Training with Pretraining Data¶
examples/advanced/pretraining/end_to_end_llm_training_with_pretrain.ipynb
A complete notebook demonstrating how to train LLMs on full patient histories without a specific task. This approach can be used to develop models that generate synthetic patients or embeddings.
Installation Note
Please install the packages required via:
Requires a GPU with at least 30GB of memory.Custom Splitting¶
- Inference:
examples/advanced/custom_splitting/inference_individual_splitters.py— Example script for inference using individual splitters. - Training:
examples/advanced/custom_splitting/training_individual_splitters.ipynb— Notebook demonstrating training data generation with individual splitters. - Custom Split Events:
examples/advanced/custom_splitting/training_custom_split_events.ipynb— Notebook showing how to customize split events and forecast different event categories (e.g., using genetic events as split points and forecasting vitals). - Forecasting QA:
examples/advanced/custom_splitting/training_forecasting_qa.ipynb— Demonstrates the Forecasting QA mode, which bins continuous target values into discrete categories for classification-style prediction. Compares all three forecasting modes ("forecasting","forecasting_qa","both").
TTE Probability Inference¶
examples/advanced/tte_inference/tte_probability_inference.ipynb
Demonstrates how to estimate probabilities for time-to-event outcomes (e.g., death, disease progression) using a fine-tuned LLM served via vLLM. Instead of generating free-text answers, the pipeline scores three mutually exclusive completions (censored, occurred, not occurred) and derives softmax probabilities from length-normalised log-probabilities.
What you'll learn:
- Building events-only instruction prompts for TTE inference
- Launching a vLLM server and scoring completions via the OpenAI-compatible API
- Computing length-normalised softmax probabilities with
compute_length_normalized_probabilities() - Evaluating predictions across multiple time horizons
Requirements
Requires a fine-tuned model for meaningful results. An off-the-shelf instruction model will potentially produce random probabilities. Also requires a GPU with enough memory for vLLM (≥ 16 GB for a 4-bit 8B model).
For a detailed explanation of the mechanism and future research directions, see the TTE Probability Inference documentation page.
Custom Text Generation¶
examples/advanced/custom_output/customizing_text_generation.ipynb
A comprehensive tutorial on customizing every textual component of the instruction generation pipeline. TwinWeaver provides extensive configuration options to tailor generated text prompts to your specific use case.
What you'll learn:
- Customizing preamble and introduction text
- Modifying demographics section formatting
- Changing event day and time interval descriptions
- Switching time units between days and weeks
- Customizing genetic data tags and placeholder text
- Modifying forecasting, time-to-event, and QA task prompts
- Configuring multi-task instruction formatting
- Fine-grained control over specific event categories with overrides
Custom Summarized Row¶
examples/advanced/custom_output/custom_summarized_row.ipynb
Shows how to customize the summarized row section of the instruction prompt using set_custom_summarized_row_fn(). The summarized row is a compact summary inserted just before the task questions ensuring that the LLM sees the most critical information, and by default includes recent genetic info, Line of Therapy starts, and last known target values.
What you'll learn:
- Generating output with the default summarized row
- Writing and applying a custom summarized row function
- Building an advanced summary with event counts, latest treatments, and trend indicators
- Error handling for invalid function signatures and runtime errors
🔗 Integrations¶
MEDS Data Import¶
examples/integrations/meds_data_import.ipynb
A tutorial on importing data in the Medical Event Data Standard (MEDS) format and converting it into TwinWeaver's internal format. Includes a synthetic data example.