🏆 Challenge 1: Data Preparation for Training¶

Difficulty: ⭐⭐ (Intermediate) | Time: 45-60 minutes

🎯 Learning Objectives¶

By completing this challenge, you will:

Understand the data format required by TwinWeaver
Configure the pipeline for different prediction tasks
Generate training splits from patient timelines
Convert structured data to instruction-tuning format

📋 Rules¶

Complete all # TODO: sections
Answer quiz questions before proceeding
Run checkpoint cells to validate your solutions
No peeking at the original tutorial!

Part 1: Understanding the Data¶

Before we start coding, let's understand what data we're working with.

In [ ]:

Copied!





import pandas as pd

from twinweaver import (
    DataManager,
    Config,
)
import pandas as pd

from twinweaver import (
    DataManager,
    Config,
)

In [ ]:

Copied!





# Load the example data
df_events = pd.read_csv("../example_data/events.csv")
df_constant = pd.read_csv("../example_data/constant.csv")
df_constant_description = pd.read_csv("../example_data/constant_description.csv")
# Load the example data
df_events = pd.read_csv("../example_data/events.csv")
df_constant = pd.read_csv("../example_data/constant.csv")
df_constant_description = pd.read_csv("../example_data/constant_description.csv")

🔍 Exercise 1.1: Explore the Data¶

Before configuring the pipeline, you need to understand your data. Explore the three dataframes to answer the quiz questions below.

In [ ]:

Copied!

# TODO: Explore df_events - what columns does it have? What are the unique event categories?
# Write your exploration code here
# TODO: Explore df_events - what columns does it have? What are the unique event categories?
# Write your exploration code here

In [ ]:

Copied!

# TODO: Explore df_constant - what patient-level information is available?
# TODO: Explore df_constant - what patient-level information is available?

In [ ]:

Copied!

# TODO: Explore df_constant_description - how does this map to df_constant?
# TODO: Explore df_constant_description - how does this map to df_constant?

❓ Quiz 1: Data Understanding¶

Answer these questions based on your exploration:

Q1.1: What column in df_events contains the type of medical event (lab, drug, condition, etc.)?

Q1.2: List all unique event categories in the dataset:

Q1.3: How many unique patients are in the dataset?

Q1.4: What column in df_constant could be used to calculate a patient's age?

Write your answers in the cell below:

Your Answers:

Q1.1:

Q1.2:

Q1.3:

Q1.4:

Part 2: Configuration Challenge¶

Now you need to configure the TwinWeaver pipeline. This is where understanding your data pays off!

🎯 Your Task¶

Configure the pipeline to:

Split patient histories around Lines of Therapy (treatment changes)
Forecast lab values into the future
Predict time-to-event for death and progression

In [ ]:

Copied!





config = Config()

# TODO: Set the event category used for splitting patient timelines
# HINT: Look at your answer to Q1.2 - which category represents treatment lines?
config.split_event_category = None  # Replace None with the correct value

# TODO: Set which event categories should be forecasted as time-series
# HINT: We want to predict future lab values
config.event_category_forecast = None  # Replace None with a list

# TODO: Configure time-to-event prediction targets
# HINT: This should be a dictionary mapping event names to display names
# Example: {"original_name": "display name in prompt"}
config.event_category_events_prediction_with_naming = None  # Replace with dict
config = Config()

# TODO: Set the event category used for splitting patient timelines
# HINT: Look at your answer to Q1.2 - which category represents treatment lines?
config.split_event_category = None  # Replace None with the correct value

# TODO: Set which event categories should be forecasted as time-series
# HINT: We want to predict future lab values
config.event_category_forecast = None  # Replace None with a list

# TODO: Configure time-to-event prediction targets
# HINT: This should be a dictionary mapping event names to display names
# Example: {"original_name": "display name in prompt"}
config.event_category_events_prediction_with_naming = None  # Replace with dict

🏁 Checkpoint 2.1: Validate Configuration¶

In [ ]:

Copied!





# Run this cell to check your configuration
def validate_config_part1(config):
    errors = []

    if config.split_event_category is None:
        errors.append("❌ split_event_category is not set")
    elif config.split_event_category not in df_events["event_category"].unique():
        errors.append(f"❌ split_event_category '{config.split_event_category}' not found in data")
    else:
        print(f"✅ split_event_category: '{config.split_event_category}'")

    if config.event_category_forecast is None:
        errors.append("❌ event_category_forecast is not set")
    elif not isinstance(config.event_category_forecast, list):
        errors.append("❌ event_category_forecast should be a list")
    elif any([cat not in df_events["event_category"].unique() for cat in config.event_category_forecast]):
        errors.append("❌ At least one of the event_category_forecast values not found in data")
    else:
        print(f"✅ event_category_forecast: {config.event_category_forecast}")

    if config.event_category_events_prediction_with_naming is None:
        errors.append("❌ event_category_events_prediction_with_naming is not set")
    elif not isinstance(config.event_category_events_prediction_with_naming, dict):
        errors.append("❌ event_category_events_prediction_with_naming should be a dict")
    elif any(
        [
            cat not in df_events["event_category"].unique()
            for cat in config.event_category_events_prediction_with_naming.keys()
        ]
    ):
        errors.append("❌ At least one key in event_category_events_prediction_with_naming not found in data")
    else:
        print(f"✅ Event mapping: {config.event_category_events_prediction_with_naming}")

    if errors:
        print("\n" + "\n".join(errors))
        print("\n💡 Hint: Review Part 1 exploration to find the correct values")
    else:
        print("\n🎉 Part 2.1 Complete! Configuration looks good.")

    return len(errors) == 0


validate_config_part1(config)
# Run this cell to check your configuration
def validate_config_part1(config):
    errors = []

    if config.split_event_category is None:
        errors.append("❌ split_event_category is not set")
    elif config.split_event_category not in df_events["event_category"].unique():
        errors.append(f"❌ split_event_category '{config.split_event_category}' not found in data")
    else:
        print(f"✅ split_event_category: '{config.split_event_category}'")

    if config.event_category_forecast is None:
        errors.append("❌ event_category_forecast is not set")
    elif not isinstance(config.event_category_forecast, list):
        errors.append("❌ event_category_forecast should be a list")
    elif any([cat not in df_events["event_category"].unique() for cat in config.event_category_forecast]):
        errors.append("❌ At least one of the event_category_forecast values not found in data")
    else:
        print(f"✅ event_category_forecast: {config.event_category_forecast}")

    if config.event_category_events_prediction_with_naming is None:
        errors.append("❌ event_category_events_prediction_with_naming is not set")
    elif not isinstance(config.event_category_events_prediction_with_naming, dict):
        errors.append("❌ event_category_events_prediction_with_naming should be a dict")
    elif any(
        [
            cat not in df_events["event_category"].unique()
            for cat in config.event_category_events_prediction_with_naming.keys()
        ]
    ):
        errors.append("❌ At least one key in event_category_events_prediction_with_naming not found in data")
    else:
        print(f"✅ Event mapping: {config.event_category_events_prediction_with_naming}")

    if errors:
        print("\n" + "\n".join(errors))
        print("\n💡 Hint: Review Part 1 exploration to find the correct values")
    else:
        print("\n🎉 Part 2.1 Complete! Configuration looks good.")

    return len(errors) == 0


validate_config_part1(config)

🔧 Exercise 2.2: Configure Static Variables¶

Now configure which patient demographics to include in the prompts.

In [ ]:

Copied!





# TODO: Look at df_constant columns and decide which ones to include
# Consider: Which variables are clinically relevant for predictions?

# First, explore what's available
print("Available columns in df_constant:")
print(df_constant.columns.tolist())
# TODO: Look at df_constant columns and decide which ones to include
# Consider: Which variables are clinically relevant for predictions?

# First, explore what's available
print("Available columns in df_constant:")
print(df_constant.columns.tolist())

In [ ]:

Copied!

# TODO: Select which constant columns to use (list of column names)
config.constant_columns_to_use = []  # Fill in the list

# TODO: Specify which column contains birth year/date for age calculation
config.constant_birthdate_column = None  # Set the column name
# TODO: Select which constant columns to use (list of column names)
config.constant_columns_to_use = []  # Fill in the list

# TODO: Specify which column contains birth year/date for age calculation
config.constant_birthdate_column = None  # Set the column name

Part 3: Initialize the Pipeline¶

With configuration complete, let's initialize the data processing components.

In [ ]:

Copied!





# TODO: Initialize DataManager and load data
# The DataManager needs to:
# 1. Be created with your config
# 2. Load the indication data (events, constant, constant_description)
# 3. Process the indication data
# 4. Setup unique mapping of events
# 5. Setup dataset splits
# 6. Infer variable types

dm = DataManager(config=config)

# TODO: Call the required methods in the correct order
# dm.????
# dm.????
# dm.????
# dm.????
# dm.????
# TODO: Initialize DataManager and load data
# The DataManager needs to:
# 1. Be created with your config
# 2. Load the indication data (events, constant, constant_description)
# 3. Process the indication data
# 4. Setup unique mapping of events
# 5. Setup dataset splits
# 6. Infer variable types

dm = DataManager(config=config)

# TODO: Call the required methods in the correct order
# dm.????
# dm.????
# dm.????
# dm.????
# dm.????

🏁 Checkpoint 3.1: Validate DataManager¶

In [ ]:

Copied!





# Run this to verify DataManager is set up correctly
try:
    n_patients = len(dm.all_patientids)
    print(f"✅ DataManager initialized with {n_patients} patients")

    # Check if we can get patient data
    test_patient = dm.all_patientids[0]
    patient_data = dm.get_patient_data(test_patient)
    print(f"✅ Successfully retrieved data for patient {test_patient}")
    print(f"   - Events: {len(patient_data['events'])} rows")
    print(f"   - Constant: {len(patient_data['constant'])} rows")
    print("\n🎉 Part 3 Complete!")
except Exception as e:
    print(f"❌ Error: {e}")
    print("\n💡 Hint: Make sure you called all DataManager methods in the correct order")
# Run this to verify DataManager is set up correctly
try:
    n_patients = len(dm.all_patientids)
    print(f"✅ DataManager initialized with {n_patients} patients")

    # Check if we can get patient data
    test_patient = dm.all_patientids[0]
    patient_data = dm.get_patient_data(test_patient)
    print(f"✅ Successfully retrieved data for patient {test_patient}")
    print(f"   - Events: {len(patient_data['events'])} rows")
    print(f"   - Constant: {len(patient_data['constant'])} rows")
    print("\n🎉 Part 3 Complete!")
except Exception as e:
    print(f"❌ Error: {e}")
    print("\n💡 Hint: Make sure you called all DataManager methods in the correct order")

Part 4: Create Splitters and Converter¶

❓ Quiz 2: Understanding Splitters¶

Before creating the splitters, answer these conceptual questions:

Q2.1: What is the purpose of splitting a patient's timeline? Why not use the entire history?

Q2.2: What's the difference between DataSplitterEvents and DataSplitterForecasting?

Q2.3: Why do we need a token budget for the converter?

Your Answers:

Q2.1:

Q2.2:

Q2.3:

In [ ]:

Copied!

# TODO: Initialize DataSplitterEvents
# This handles event prediction tasks (death, progression)
data_splitter_events = None  # Create the splitter

# TODO: Don't forget to call setup_variables() on it!
# TODO: Initialize DataSplitterEvents
# This handles event prediction tasks (death, progression)
data_splitter_events = None  # Create the splitter

# TODO: Don't forget to call setup_variables() on it!

In [ ]:

Copied!

# TODO: Initialize DataSplitterForecasting
# This handles continuous variable forecasting (lab values)
data_splitter_forecasting = None  # Create the splitter

# TODO: Call setup_statistics() for forecasting QA and filtering
# TODO: Initialize DataSplitterForecasting
# This handles continuous variable forecasting (lab values)
data_splitter_forecasting = None  # Create the splitter

# TODO: Call setup_statistics() for forecasting QA and filtering

In [ ]:

Copied!

# TODO: Combine both splitters using DataSplitter wrapper
data_splitter = None  # Create the combined splitter
# TODO: Combine both splitters using DataSplitter wrapper
data_splitter = None  # Create the combined splitter

In [ ]:

Copied!





# TODO: Initialize ConverterInstruction
# Parameters needed:
# - nr_tokens_budget_total: How many tokens can the prompt be? (try 8192)
# - config: Your configuration object
# - dm: Your DataManager
# - variable_stats: Statistics from forecasting splitter (optional but recommended)

converter = None  # Create the converter
# TODO: Initialize ConverterInstruction
# Parameters needed:
# - nr_tokens_budget_total: How many tokens can the prompt be? (try 8192)
# - config: Your configuration object
# - dm: Your DataManager
# - variable_stats: Statistics from forecasting splitter (optional but recommended)

converter = None  # Create the converter

🏁 Checkpoint 4.1: Validate Pipeline Components¶

In [ ]:

Copied!





# Validate all components are created
components_valid = True

if data_splitter_events is None:
    print("❌ data_splitter_events is not initialized")
    components_valid = False
else:
    print("✅ data_splitter_events initialized")

if data_splitter_forecasting is None:
    print("❌ data_splitter_forecasting is not initialized")
    components_valid = False
else:
    print("✅ data_splitter_forecasting initialized")

if data_splitter is None:
    print("❌ data_splitter is not initialized")
    components_valid = False
else:
    print("✅ data_splitter initialized")

if converter is None:
    print("❌ converter is not initialized")
    components_valid = False
else:
    print("✅ converter initialized")

if components_valid:
    print("\n🎉 Part 4 Complete! All components ready.")
# Validate all components are created
components_valid = True

if data_splitter_events is None:
    print("❌ data_splitter_events is not initialized")
    components_valid = False
else:
    print("✅ data_splitter_events initialized")

if data_splitter_forecasting is None:
    print("❌ data_splitter_forecasting is not initialized")
    components_valid = False
else:
    print("✅ data_splitter_forecasting initialized")

if data_splitter is None:
    print("❌ data_splitter is not initialized")
    components_valid = False
else:
    print("✅ data_splitter initialized")

if converter is None:
    print("❌ converter is not initialized")
    components_valid = False
else:
    print("✅ converter initialized")

if components_valid:
    print("\n🎉 Part 4 Complete! All components ready.")

Part 5: Generate Training Examples¶

Now let's generate actual training examples!

In [ ]:

Copied!





# Select a patient to work with
patientid = dm.all_patientids[4]
print(f"Working with patient: {patientid}")

# Get patient data
patient_data = dm.get_patient_data(patientid)
# Select a patient to work with
patientid = dm.all_patientids[4]
print(f"Working with patient: {patientid}")

# Get patient data
patient_data = dm.get_patient_data(patientid)

In [ ]:

Copied!

# TODO: Generate splits from this patient's data
# Use data_splitter.get_splits_from_patient_with_target()
# This returns: forecasting_splits, events_splits, reference_dates

forecasting_splits, events_splits, reference_dates = None, None, None  # Replace with actual call
# TODO: Generate splits from this patient's data
# Use data_splitter.get_splits_from_patient_with_target()
# This returns: forecasting_splits, events_splits, reference_dates

forecasting_splits, events_splits, reference_dates = None, None, None  # Replace with actual call

🔍 Exercise 5.1: Analyze the Splits¶

Before converting, understand what the splitter produced.

In [ ]:

Copied!





# TODO: Answer these questions by exploring the splits:
# 1. How many splits were generated for this patient?
# 2. What dates are the reference points (split dates)?
# 3. What does each split contain?

print("Number of splits: ???")  # Fill in
print("Reference dates: ???")  # Fill in
# TODO: Answer these questions by exploring the splits:
# 1. How many splits were generated for this patient?
# 2. What dates are the reference points (split dates)?
# 3. What does each split contain?

print("Number of splits: ???")  # Fill in
print("Reference dates: ???")  # Fill in

In [ ]:

Copied!





# TODO: Convert the first split to instruction format
# Use converter.forward_conversion()
# Parameters:
# - forecasting_splits: the forecasting split for one time point
# - event_splits: the event split for one time point

split_idx = 0
p_converted = None  # Replace with actual conversion call
# TODO: Convert the first split to instruction format
# Use converter.forward_conversion()
# Parameters:
# - forecasting_splits: the forecasting split for one time point
# - event_splits: the event split for one time point

split_idx = 0
p_converted = None  # Replace with actual conversion call

🔍 Exercise 5.2: Examine the Output¶

In [ ]:

Copied!

# TODO: Print and examine the instruction (input prompt)
# What information is included? What's the structure?
# TODO: Print and examine the instruction (input prompt)
# What information is included? What's the structure?

In [ ]:

Copied!

# TODO: Print and examine the answer (target output)
# What format is the answer in? What predictions are being made?
# TODO: Print and examine the answer (target output)
# What format is the answer in? What predictions are being made?

❓ Quiz 3: Output Analysis¶

Q3.1: What sections can you identify in the instruction prompt?

Q3.2: How are the forecasting predictions formatted in the answer?

Q3.3: How are the time-to-event predictions formatted?

Your Answers:

Q3.1:

Q3.2:

Q3.3:

Part 6: Reverse Conversion¶

An important capability is converting model outputs back to structured data.

In [ ]:

Copied!





# TODO: Use reverse_conversion to parse the answer back to structured data
# You'll need:
# - The answer string from p_converted
# - The data manager (dm)
# - The reference date for this split

date = reference_dates["date"][split_idx]
return_list = None  # Call converter.reverse_conversion()
# TODO: Use reverse_conversion to parse the answer back to structured data
# You'll need:
# - The answer string from p_converted
# - The data manager (dm)
# - The reference date for this split

date = reference_dates["date"][split_idx]
return_list = None  # Call converter.reverse_conversion()

In [ ]:

Copied!

# TODO: Examine what the reverse conversion produced
# What structure does return_list have? What's in each element?
# TODO: Examine what the reverse conversion produced
# What structure does return_list have? What's in each element?

🌟 Bonus Challenge 1: Custom Configuration¶

+15 points

Modify the configuration to predict only drug-related events instead of death and progression. Generate a new training example and compare the output.

In [ ]:

Copied!

# BONUS: Implement your custom configuration here
# BONUS: Implement your custom configuration here

🌟 Bonus Challenge 2: Multi-Patient Dataset¶

+25 points

Write a function that generates training examples for ALL patients in the dataset and returns a pandas DataFrame with columns: patientid, split_idx, instruction, answer.

In [ ]:

Copied!





# BONUS: Implement the multi-patient dataset generator


def generate_training_dataset(dm, data_splitter, converter):
    """
    Generate training examples for all patients.

    Returns:
        pd.DataFrame with columns: patientid, split_idx, instruction, answer
    """
    # TODO: Implement this function
    pass


# Test your function
# df_training = generate_training_dataset(dm, data_splitter, converter)
# print(f"Generated {len(df_training)} training examples")
# BONUS: Implement the multi-patient dataset generator


def generate_training_dataset(dm, data_splitter, converter):
    """
    Generate training examples for all patients.

    Returns:
        pd.DataFrame with columns: patientid, split_idx, instruction, answer
    """
    # TODO: Implement this function
    pass


# Test your function
# df_training = generate_training_dataset(dm, data_splitter, converter)
# print(f"Generated {len(df_training)} training examples")

🏆 Challenge Complete!¶

Congratulations on completing Challenge 1! You've learned how to:

✅ Explore and understand clinical data formats
✅ Configure the TwinWeaver pipeline for different tasks
✅ Generate training splits from patient timelines
✅ Convert data to instruction-tuning format
✅ Reverse convert predictions back to structured data

Ready for the next challenge? Move on to Challenge 2: End-to-End LLM Fine-tuning!