🏆 Challenge 1: Data Preparation for Training¶
Difficulty: ⭐⭐ (Intermediate) | Time: 45-60 minutes
🎯 Learning Objectives¶
By completing this challenge, you will:
- Understand the data format required by TwinWeaver
- Configure the pipeline for different prediction tasks
- Generate training splits from patient timelines
- Convert structured data to instruction-tuning format
📋 Rules¶
- Complete all
# TODO:sections - Answer quiz questions before proceeding
- Run checkpoint cells to validate your solutions
- No peeking at the original tutorial!
Part 1: Understanding the Data¶
Before we start coding, let's understand what data we're working with.
import pandas as pd
from twinweaver import (
DataManager,
Config,
)
# Load the example data
df_events = pd.read_csv("../example_data/events.csv")
df_constant = pd.read_csv("../example_data/constant.csv")
df_constant_description = pd.read_csv("../example_data/constant_description.csv")
🔍 Exercise 1.1: Explore the Data¶
Before configuring the pipeline, you need to understand your data. Explore the three dataframes to answer the quiz questions below.
# TODO: Explore df_events - what columns does it have? What are the unique event categories?
# Write your exploration code here
# TODO: Explore df_constant - what patient-level information is available?
# TODO: Explore df_constant_description - how does this map to df_constant?
❓ Quiz 1: Data Understanding¶
Answer these questions based on your exploration:
Q1.1: What column in df_events contains the type of medical event (lab, drug, condition, etc.)?
Q1.2: List all unique event categories in the dataset:
Q1.3: How many unique patients are in the dataset?
Q1.4: What column in df_constant could be used to calculate a patient's age?
Write your answers in the cell below:
Your Answers:
Q1.1:
Q1.2:
Q1.3:
Q1.4:
Part 2: Configuration Challenge¶
Now you need to configure the TwinWeaver pipeline. This is where understanding your data pays off!
🎯 Your Task¶
Configure the pipeline to:
- Split patient histories around Lines of Therapy (treatment changes)
- Forecast lab values into the future
- Predict time-to-event for death and progression
config = Config()
# TODO: Set the event category used for splitting patient timelines
# HINT: Look at your answer to Q1.2 - which category represents treatment lines?
config.split_event_category = None # Replace None with the correct value
# TODO: Set which event categories should be forecasted as time-series
# HINT: We want to predict future lab values
config.event_category_forecast = None # Replace None with a list
# TODO: Configure time-to-event prediction targets
# HINT: This should be a dictionary mapping event names to display names
# Example: {"original_name": "display name in prompt"}
config.event_category_events_prediction_with_naming = None # Replace with dict
🏁 Checkpoint 2.1: Validate Configuration¶
# Run this cell to check your configuration
def validate_config_part1(config):
errors = []
if config.split_event_category is None:
errors.append("❌ split_event_category is not set")
elif config.split_event_category not in df_events["event_category"].unique():
errors.append(f"❌ split_event_category '{config.split_event_category}' not found in data")
else:
print(f"✅ split_event_category: '{config.split_event_category}'")
if config.event_category_forecast is None:
errors.append("❌ event_category_forecast is not set")
elif not isinstance(config.event_category_forecast, list):
errors.append("❌ event_category_forecast should be a list")
elif any([cat not in df_events["event_category"].unique() for cat in config.event_category_forecast]):
errors.append("❌ At least one of the event_category_forecast values not found in data")
else:
print(f"✅ event_category_forecast: {config.event_category_forecast}")
if config.event_category_events_prediction_with_naming is None:
errors.append("❌ event_category_events_prediction_with_naming is not set")
elif not isinstance(config.event_category_events_prediction_with_naming, dict):
errors.append("❌ event_category_events_prediction_with_naming should be a dict")
elif any(
[
cat not in df_events["event_category"].unique()
for cat in config.event_category_events_prediction_with_naming.keys()
]
):
errors.append("❌ At least one key in event_category_events_prediction_with_naming not found in data")
else:
print(f"✅ Event mapping: {config.event_category_events_prediction_with_naming}")
if errors:
print("\n" + "\n".join(errors))
print("\n💡 Hint: Review Part 1 exploration to find the correct values")
else:
print("\n🎉 Part 2.1 Complete! Configuration looks good.")
return len(errors) == 0
validate_config_part1(config)
🔧 Exercise 2.2: Configure Static Variables¶
Now configure which patient demographics to include in the prompts.
# TODO: Look at df_constant columns and decide which ones to include
# Consider: Which variables are clinically relevant for predictions?
# First, explore what's available
print("Available columns in df_constant:")
print(df_constant.columns.tolist())
# TODO: Select which constant columns to use (list of column names)
config.constant_columns_to_use = [] # Fill in the list
# TODO: Specify which column contains birth year/date for age calculation
config.constant_birthdate_column = None # Set the column name
Part 3: Initialize the Pipeline¶
With configuration complete, let's initialize the data processing components.
# TODO: Initialize DataManager and load data
# The DataManager needs to:
# 1. Be created with your config
# 2. Load the indication data (events, constant, constant_description)
# 3. Process the indication data
# 4. Setup unique mapping of events
# 5. Setup dataset splits
# 6. Infer variable types
dm = DataManager(config=config)
# TODO: Call the required methods in the correct order
# dm.????
# dm.????
# dm.????
# dm.????
# dm.????
🏁 Checkpoint 3.1: Validate DataManager¶
# Run this to verify DataManager is set up correctly
try:
n_patients = len(dm.all_patientids)
print(f"✅ DataManager initialized with {n_patients} patients")
# Check if we can get patient data
test_patient = dm.all_patientids[0]
patient_data = dm.get_patient_data(test_patient)
print(f"✅ Successfully retrieved data for patient {test_patient}")
print(f" - Events: {len(patient_data['events'])} rows")
print(f" - Constant: {len(patient_data['constant'])} rows")
print("\n🎉 Part 3 Complete!")
except Exception as e:
print(f"❌ Error: {e}")
print("\n💡 Hint: Make sure you called all DataManager methods in the correct order")
Part 4: Create Splitters and Converter¶
❓ Quiz 2: Understanding Splitters¶
Before creating the splitters, answer these conceptual questions:
Q2.1: What is the purpose of splitting a patient's timeline? Why not use the entire history?
Q2.2: What's the difference between DataSplitterEvents and DataSplitterForecasting?
Q2.3: Why do we need a token budget for the converter?
Your Answers:
Q2.1:
Q2.2:
Q2.3:
# TODO: Initialize DataSplitterEvents
# This handles event prediction tasks (death, progression)
data_splitter_events = None # Create the splitter
# TODO: Don't forget to call setup_variables() on it!
# TODO: Initialize DataSplitterForecasting
# This handles continuous variable forecasting (lab values)
data_splitter_forecasting = None # Create the splitter
# TODO: Call setup_statistics() for forecasting QA and filtering
# TODO: Combine both splitters using DataSplitter wrapper
data_splitter = None # Create the combined splitter
# TODO: Initialize ConverterInstruction
# Parameters needed:
# - nr_tokens_budget_total: How many tokens can the prompt be? (try 8192)
# - config: Your configuration object
# - dm: Your DataManager
# - variable_stats: Statistics from forecasting splitter (optional but recommended)
converter = None # Create the converter
🏁 Checkpoint 4.1: Validate Pipeline Components¶
# Validate all components are created
components_valid = True
if data_splitter_events is None:
print("❌ data_splitter_events is not initialized")
components_valid = False
else:
print("✅ data_splitter_events initialized")
if data_splitter_forecasting is None:
print("❌ data_splitter_forecasting is not initialized")
components_valid = False
else:
print("✅ data_splitter_forecasting initialized")
if data_splitter is None:
print("❌ data_splitter is not initialized")
components_valid = False
else:
print("✅ data_splitter initialized")
if converter is None:
print("❌ converter is not initialized")
components_valid = False
else:
print("✅ converter initialized")
if components_valid:
print("\n🎉 Part 4 Complete! All components ready.")
# Select a patient to work with
patientid = dm.all_patientids[4]
print(f"Working with patient: {patientid}")
# Get patient data
patient_data = dm.get_patient_data(patientid)
# TODO: Generate splits from this patient's data
# Use data_splitter.get_splits_from_patient_with_target()
# This returns: forecasting_splits, events_splits, reference_dates
forecasting_splits, events_splits, reference_dates = None, None, None # Replace with actual call
🔍 Exercise 5.1: Analyze the Splits¶
Before converting, understand what the splitter produced.
# TODO: Answer these questions by exploring the splits:
# 1. How many splits were generated for this patient?
# 2. What dates are the reference points (split dates)?
# 3. What does each split contain?
print("Number of splits: ???") # Fill in
print("Reference dates: ???") # Fill in
# TODO: Convert the first split to instruction format
# Use converter.forward_conversion()
# Parameters:
# - forecasting_splits: the forecasting split for one time point
# - event_splits: the event split for one time point
split_idx = 0
p_converted = None # Replace with actual conversion call
🔍 Exercise 5.2: Examine the Output¶
# TODO: Print and examine the instruction (input prompt)
# What information is included? What's the structure?
# TODO: Print and examine the answer (target output)
# What format is the answer in? What predictions are being made?
❓ Quiz 3: Output Analysis¶
Q3.1: What sections can you identify in the instruction prompt?
Q3.2: How are the forecasting predictions formatted in the answer?
Q3.3: How are the time-to-event predictions formatted?
Your Answers:
Q3.1:
Q3.2:
Q3.3:
Part 6: Reverse Conversion¶
An important capability is converting model outputs back to structured data.
# TODO: Use reverse_conversion to parse the answer back to structured data
# You'll need:
# - The answer string from p_converted
# - The data manager (dm)
# - The reference date for this split
date = reference_dates["date"][split_idx]
return_list = None # Call converter.reverse_conversion()
# TODO: Examine what the reverse conversion produced
# What structure does return_list have? What's in each element?
🌟 Bonus Challenge 1: Custom Configuration¶
+15 points
Modify the configuration to predict only drug-related events instead of death and progression. Generate a new training example and compare the output.
# BONUS: Implement your custom configuration here
🌟 Bonus Challenge 2: Multi-Patient Dataset¶
+25 points
Write a function that generates training examples for ALL patients in the dataset and returns a pandas DataFrame with columns: patientid, split_idx, instruction, answer.
# BONUS: Implement the multi-patient dataset generator
def generate_training_dataset(dm, data_splitter, converter):
"""
Generate training examples for all patients.
Returns:
pd.DataFrame with columns: patientid, split_idx, instruction, answer
"""
# TODO: Implement this function
pass
# Test your function
# df_training = generate_training_dataset(dm, data_splitter, converter)
# print(f"Generated {len(df_training)} training examples")
🏆 Challenge Complete!¶
Congratulations on completing Challenge 1! You've learned how to:
- ✅ Explore and understand clinical data formats
- ✅ Configure the TwinWeaver pipeline for different tasks
- ✅ Generate training splits from patient timelines
- ✅ Convert data to instruction-tuning format
- ✅ Reverse convert predictions back to structured data
Ready for the next challenge? Move on to Challenge 2: End-to-End LLM Fine-tuning!