Config¶

twinweaver.common.config ¶

Classes¶

Config ¶

Centralized configuration repository for data processing, prompt generation, and constants.

This class consolidates various configuration settings essential for the data processing pipeline. It defines standardized column names, specific values for event categories (like 'line of therapy', 'death'), data source identifiers, file paths, table names, and text templates (prompts) used for different language model tasks such as text conversion, forecasting (value prediction, time-to-event), quality assurance (QA) via binning, and setting up multi-task prompts. Default values are provided but can be overridden to adapt to specific datasets, model requirements, or experimental setups.

Attributes:

Name	Type	Description
`date_cutoff`	`str \| None`	If set, only use data before this date (format: "YYYY-MM-DD"), censored after. Default: None.
`delta_time_unit`	`str`	Unit of time used to express intervals between patient visits in the generated text. Options are "days" or "weeks". Default: "weeks".
`numeric_detect_min_fraction`	`float`	Fraction of values that must be numeric to classify a variable as numeric. Defaults to 0.99.
`date_col`	`str`	Standardized column name for date information across datasets. Default: "date".
`patient_id_col`	`str`	Standardized column name for unique patient identifiers. Default: "patientid".
`event_category_col`	`str`	Standardized column name for the category of a recorded event (e.g., 'lab', 'diagnosis'). Default: "event_category".
`event_name_col`	`str`	Standardized column name for the specific name of an event within its category (e.g., 'Glucose level', 'Type 2 Diabetes'). Default: "event_name".
`event_descriptive_name_col`	`str`	Standardized column name for a more human-readable or descriptive name of the event. Default: "event_descriptive_name".
`event_value_col`	`str`	Standardized column name for the value associated with an event (e.g., a lab result, 'present'). Default: "event_value".
`source_col`	`str`	Standardized column name indicating the origin or type of the data record. Default: "source".
`meta_data_col`	`str`	Standardized column name for storing additional metadata related to an event. Default: "meta_data".
`constant_split_col`	`str`	Standardized column name for data split information (train/test/val) in the constant dataframe. Default: "data_split".
`event_category_default_value`	`str`	Default value to assign to `event_category_col` if it is missing in the data. Default: "general".
`event_meta_default_value`	`Any`	Default value to assign to `meta_data_col` if it is missing. Default: pd.NA.
`source_col_default_value`	`str`	Default value to assign to `source_col` if it is missing. Default: "events".
`lot_date_col`	`str`	Column name specifically used for dates related to line of therapy (LoT) events. Default: "lot_date".
`lot_name_col`	`str`	Column name for the name or identifier of the line of therapy (e.g., "First Line"). Default: "lot".
`event_value_lot_start`	`str`	Specific string value used in `event_value_col` to denote the start of a line of therapy. Default: "LoT Start".
`skip_future_lot_filtering`	`bool`	Flag indicating whether to skip filtering out future line of therapy events. Default: False. Useful in case you accidentially overlap LoTs which are actually the same, use with caution.
`lot_concatenate_descriptive_and_value`	`bool`	Flag indicating whether to concatenate the descriptive name and value for line of therapy events. Default: False.
`lot_concatenate_string`	`str`	String used to concatenate the descriptive name and value for line of therapy events when `lot_concatenate_descriptive_and_value` is True. Default: " - ".
`warning_for_splitters_patient_without_lots`	`bool`	Whether to warn if a patient has no LoT events in DataSplitterEvents. Default: True.
`event_category_lot`	`str`	Specific string value used in `event_category_col` to identify 'line of therapy' events. Default: "lot".
`event_category_death`	`str`	Specific string value used in `event_category_col` to identify 'death' events. Default: "death".
`event_category_labs`	`str`	Specific string value used in `event_category_col` to identify 'lab result' events. Default: "lab".
`event_category_forecast`	`list[str]`	List of event categories to be considered for forecasting tasks. Default: ["lab"].
`source_genetic`	`str`	Specific string value used in `source_col` to identify data originating from genetic testing. Default: "genetic".
`source_standard_events`	`str`	Source identifier for standard clinical events. Default: "events".
`genetic_skip_text_value`	`str`	A specific event value (often for genetic data) that might be skipped during text generation to avoid redundancy if its presence is implied elsewhere. Default: "present".
`genetic_tag_opening`	`str`	Opening tag used to demarcate genetic information within generated text. Default: "".
`genetic_tag_closing`	`str`	Closing tag used to demarcate genetic information within generated text. Default: "".
`event_table_name`	`str`	The base name (without extension) for the primary file or table containing event data. Default: "events".
`train_split_name`	`str`	Identifier for the training dataset split (e.g., used in file naming or data loading). Default: "train".
`validation_split_name`	`str`	Identifier for the validation dataset split. Default: "validation".
`test_split_name`	`str`	Identifier for the test dataset split. Default: "test".
`bins_split_name`	`str`	Identifier for a data split used for binning tasks, often related to QA. Default: "5_equal_sized_bins".
`preamble_text`	`str`	Introductory text inserted at the beginning of the textual representation of a patient's record. Default: Explains structure and LOINC codes.
`constant_text`	`str`	Text used to introduce the section containing static demographic data in the textual patient record. Default: "\n\nStarting with demographic data:\n".
`genetic_empty_text`	`str`	Text to use when no genetic data is available for a patient. Default: "No genetic data available.".
`first_day_text`	`str`	Text used to introduce the events that occurred on the patient's very first recorded visit day. Default: "\nOn the first visit, the patient experienced the following: \n".
`event_day_preamble`	`str`	Text inserted before the description of events for visits subsequent to the first one. Default: "\n".
`event_day_text`	`str`	Template text used to introduce events on subsequent visit days, indicating the time elapsed since the previous visit. Default: " self.delta_time_unit : later, the patient visited and experienced the following: \n".
`post_event_text`	`str`	Text appended after listing all events for a specific visit day. Default: ".\n".
`forecasting_fval_prompt_start`	`str`	Initial text for prompts instructing a language model to predict future numerical values of specified variables over time. Default: Instructs prediction per cumulative week.
`forecasting_prompt_var_time`	`str`	Text segment used within forecasting prompts to specify the time frame (e.g., future weeks) for prediction. Default: " the future weeks ".
`forecasting_prompt_summarized_start`	`str`	Initial text for prompts that include a summary of the last known values of variables being forecasted. Default: "\nThe last values of the variables in the input data are:\n".
`forecasting_firstday_override`	`str`	Alternative introductory text for forecasting prompts, possibly used when only a subset of initial data is presented, hinting at omissions. Default: Mentions included events, potential omissions.
`forecasting_prompt_summarized_genetic`	`str`	Text used to introduce a summary section listing the last observed genetic event statuses within a forecasting prompt. Default: "\n\n\n\nHere we repeat the last observed values of each genetic event in the input data:\n".
`forecasting_prompt_summarized_lot`	`str`	Text used to introduce a summary section describing the most recent line of therapy within a forecasting prompt. Default: "\nThe most recent line of therapy:\n".
`forecasting_tte_prompt_start`	`str`	Initial text for prompts instructing a language model to predict time-to-event (TTE) outcomes, specifically focusing on whether an event is censored. Default: Asks for censoring prediction.
`forecasting_tte_prompt_mid`	`str`	Middle text segment for TTE prompts, specifying the prediction horizon (in weeks) and asking about event occurrence status. Default: Specifies weeks and asks about occurrence.
`forecasting_tte_prompt_end`	`str`	Concluding text for TTE prompts, detailing the required output format for the prediction (censoring and occurrence). Default: Specifies format like "'Here is the prediction: the event () was [not] censored and [did not occur]/[occurred].'".
`target_prompt_start`	`str`	Template string used to begin constructing the target (ground truth) output string for TTE tasks, includes placeholder for event name. Default: "\nHere is the prediction: the event ({event_name}) was ".
`target_prompt_censor_true`	`str`	Text segment used in the TTE target output to indicate that the event was censored within the observation period. Default: "censored.".
`target_prompt_censor_false`	`str`	Text segment used in the TTE target output to indicate that the event was not censored. Default: "not censored ".
`target_prompt_before_occur`	`str`	Conjunction used in the TTE target output between the censoring status and the occurrence status. Default: "and ".
`target_prompt_occur`	`str`	Text segment used in the TTE target output to indicate that the event did occur. Default: "occurred.".
`target_prompt_not_occur`	`str`	Text segment used in the TTE target output to indicate that the event did not occur. Default: "did not occur.".
`qa_prompt_start`	`str`	Initial text for prompts instructing a model to perform a Quality Assurance (QA) task, specifically predicting value bins for future variable values. Default: Asks for bin prediction per week.
`qa_bins_start`	`str`	Text used within QA prompts to introduce the list of possible bins the model should choose from. Default: "\tThe possible bins are: ".
`task_prompt_start`	`str`	Introductory text for multi-task prompts, explaining that multiple tasks follow and instructing the model on the required response format (e.g., prefixing each answer with 'Task X:'). Default: Explains multi-task format.
`task_prompt_each_task`	`str`	Template string used to introduce each individual task within a multi-task prompt, includes placeholder for task number. Default: "Task {task_nr} is ".
`task_prompt_end`	`str`	Concluding text for the overall multi-task prompt setup. Default: "" (empty string).
`task_prompt_forecasting`	`str`	Identifier text appended to `task_prompt_each_task` to specify a forecasting sub-task. Default: "forecasting:".
`task_prompt_forecasting_qa`	`str`	Identifier text appended to `task_prompt_each_task` to specify a forecasting QA (binning) sub-task. Default: "forecasting QA:".
`task_prompt_events`	`str`	Identifier text appended to `task_prompt_each_task` to specify a time-to-event prediction sub-task. Default: "time to event prediction:".
`task_prompt_custom`	`str`	Identifier text appended to `task_prompt_each_task` to specify a custom-defined sub-task. Default: " a custom task:".
`task_target_start`	`str`	Template string used to begin the target (ground truth) output corresponding to a specific task number in a multi-task setting. Default: "Task {task_nr} is ".
`task_target_end`	`str`	Concluding text for the target output of a specific task within a multi-task response. Default: "" (empty string).
`decimal_precision`	`int`	Number of decimal places to use when rounding numerical values (e.g., lab results) during text conversion. Default: 2.
`event_category_preamble_mapping_override`	`dict \| None`	Optional dictionary to override the introductory text used before listing events of a specific category on a given day. Structure: `{<event_category>: <preamble_string>}`. Default: None.
`event_category_and_name_replace_override`	`dict \| None`	Optional nested dictionary to define specific replacements for event descriptions based on category and name. Allows replacing the entire event string and defining a value for reverse mapping. Structure: `{<event_category>: {<event_name>: {"full_replacement_string": <str>, "reverse_string_value": <str>}}}`. Default: None.
`always_keep_first_visit`	`bool`	Flag indicating whether the events from the very first visit should always be included in the patient history, regardless of token budget constraints. Default: True.
`seed`	`int`	Seed value for random number generators to ensure reproducibility in processes like data splitting or sampling. Default: 768921.
`nr_tokens_budget_padding`	`int`	Number of tokens reserved as a buffer when calculating token budgets, ensuring outputs don't exceed limits. May need adjustment based on model/task. Default: 200.
`tokenizer_to_use`	`str`	Identifier string for the tokenizer model to be used for counting tokens (e.g., for budget calculations). Should correspond to a model available in the environment (e.g., from Hugging Face). Default: 'microsoft/Phi-4-mini-instruct'.
`constant_columns_to_use`	`list[str]`	List of column names from the constant (demographic) data source to be included in the processing and text conversion. Note: Age might be handled separately. Default: ["race", "gender", "ethnicity", "indication"].
`constant_birthdate_column`	`str \| None`	Column name in the constant table representing the patient's birth date or birth year. If provided, age calculation is performed relative to the first event date. Default: None.
`constant_birthdate_column_format`	`str`	Format of the birthdate column, either "date" or "age". Default: "date".
`data_splitter_events_variables_category_mapping`	`dict`	Mapping defining which event categories correspond to specific prediction types in DataSplitterEvents. Keys are event categories (e.g., 'death', 'progression'), values are descriptive names for the target variable.
`data_splitter_events_backup_category_mapping`	`dict`	Fallback mapping for event categories in DataSplitterEvents. Used if the primary category variables are not found. Keys are the missing categories, values are the backup categories to use.

Source code in twinweaver/common/config.py

class Config:
    """
    Centralized configuration repository for data processing, prompt generation, and constants.

    This class consolidates various configuration settings essential for the data processing
    pipeline. It defines standardized column names, specific values for event categories (like
    'line of therapy', 'death'), data source identifiers, file paths, table names, and text
    templates (prompts) used for different language model tasks such as text conversion,
    forecasting (value prediction, time-to-event), quality assurance (QA) via binning, and
    setting up multi-task prompts. Default values are provided but can be overridden to
    adapt to specific datasets, model requirements, or experimental setups.

    Attributes
    ----------
    date_cutoff : str | None
        If set, only use data before this date (format: "YYYY-MM-DD"), censored after. Default: None.
    delta_time_unit : str
        Unit of time used to express intervals between patient visits in the generated text. Options are "days" or
        "weeks". Default: "weeks".
    numeric_detect_min_fraction: float
        Fraction of values that must be numeric to classify a variable as numeric. Defaults to 0.99.
    date_col : str
        Standardized column name for date information across datasets. Default: "date".
    patient_id_col : str
        Standardized column name for unique patient identifiers. Default: "patientid".
    event_category_col : str
        Standardized column name for the category of a recorded event (e.g., 'lab', 'diagnosis'). Default:
        "event_category".
    event_name_col : str
        Standardized column name for the specific name of an event within its category (e.g., 'Glucose level',
        'Type 2 Diabetes'). Default: "event_name".
    event_descriptive_name_col : str
        Standardized column name for a more human-readable or descriptive name of the event. Default:
        "event_descriptive_name".
    event_value_col : str
        Standardized column name for the value associated with an event (e.g., a lab result, 'present').
        Default: "event_value".
    source_col : str
        Standardized column name indicating the origin or type of the data record. Default: "source".
    meta_data_col : str
        Standardized column name for storing additional metadata related to an event. Default: "meta_data".
    constant_split_col : str
        Standardized column name for data split information (train/test/val) in the constant dataframe.
        Default: "data_split".
    event_category_default_value : str
        Default value to assign to `event_category_col` if it is missing in the data. Default: "general".
    event_meta_default_value : Any
        Default value to assign to `meta_data_col` if it is missing. Default: pd.NA.
    source_col_default_value : str
        Default value to assign to `source_col` if it is missing. Default: "events".
    lot_date_col : str
        Column name specifically used for dates related to line of therapy (LoT) events. Default: "lot_date".
    lot_name_col : str
        Column name for the name or identifier of the line of therapy (e.g., "First Line"). Default: "lot".
    event_value_lot_start : str
        Specific string value used in `event_value_col` to denote the start of a line of therapy. Default: "LoT Start".
    skip_future_lot_filtering : bool
        Flag indicating whether to skip filtering out future line of therapy events. Default: False.
        Useful in case you accidentially overlap LoTs which are actually the same, use with caution.
    lot_concatenate_descriptive_and_value : bool
        Flag indicating whether to concatenate the descriptive name and value for line of therapy events.
        Default: False.
    lot_concatenate_string : str
        String used to concatenate the descriptive name and value for line of therapy events when
        `lot_concatenate_descriptive_and_value` is True. Default: " - ".
    warning_for_splitters_patient_without_lots : bool
        Whether to warn if a patient has no LoT events in DataSplitterEvents. Default: True.
    event_category_lot : str
        Specific string value used in `event_category_col` to identify 'line of therapy' events. Default: "lot".
    event_category_death : str
        Specific string value used in `event_category_col` to identify 'death' events. Default: "death".
    event_category_labs : str
        Specific string value used in `event_category_col` to identify 'lab result' events. Default: "lab".
    event_category_forecast : list[str]
        List of event categories to be considered for forecasting tasks. Default: ["lab"].
    source_genetic : str
        Specific string value used in `source_col` to identify data originating from genetic testing.
        Default: "genetic".
    source_standard_events : str
        Source identifier for standard clinical events. Default: "events".
    genetic_skip_text_value : str
        A specific event value (often for genetic data) that might be skipped during text generation to avoid
        redundancy if its presence is implied elsewhere. Default: "present".
    genetic_tag_opening : str
        Opening tag used to demarcate genetic information within generated text. Default: "<genetic>".
    genetic_tag_closing : str
        Closing tag used to demarcate genetic information within generated text. Default: "</genetic>".
    event_table_name : str
        The base name (without extension) for the primary file or table containing event data. Default: "events".
    train_split_name : str
        Identifier for the training dataset split (e.g., used in file naming or data loading). Default: "train".
    validation_split_name : str
        Identifier for the validation dataset split. Default: "validation".
    test_split_name : str
        Identifier for the test dataset split. Default: "test".
    bins_split_name : str
        Identifier for a data split used for binning tasks, often related to QA. Default: "5_equal_sized_bins".
    preamble_text : str
        Introductory text inserted at the beginning of the textual representation of a patient's record.
        Default: Explains structure and LOINC codes.
    constant_text : str
        Text used to introduce the section containing static demographic data in the textual patient record.
        Default: "\\n\\nStarting with demographic data:\\n".
    genetic_empty_text : str
        Text to use when no genetic data is available for a patient. Default: "No genetic data available.".
    first_day_text : str
        Text used to introduce the events that occurred on the patient's very first recorded visit day.
        Default: "\\nOn the first visit, the patient experienced the following: \\n".
    event_day_preamble : str
        Text inserted before the description of events for visits subsequent to the first one. Default: "\\n".
    event_day_text : str
        Template text used to introduce events on subsequent visit days, indicating the time elapsed since the previous
        visit. Default: " self.delta_time_unit : later, the patient visited and experienced the following: \\n".
    post_event_text : str
        Text appended after listing all events for a specific visit day. Default: ".\\n".
    forecasting_fval_prompt_start : str
        Initial text for prompts instructing a language model to predict future numerical values of specified
        variables over time. Default: Instructs prediction per cumulative week.
    forecasting_prompt_var_time : str
        Text segment used within forecasting prompts to specify the time frame (e.g., future weeks) for prediction.
        Default: " the future weeks ".
    forecasting_prompt_summarized_start : str
        Initial text for prompts that include a summary of the last known values of variables being forecasted.
        Default: "\\nThe last values of the variables in the input data are:\\n".
    forecasting_firstday_override : str
        Alternative introductory text for forecasting prompts, possibly used when only a subset of initial data is
        presented, hinting at omissions. Default: Mentions included events, potential omissions.
    forecasting_prompt_summarized_genetic : str
        Text used to introduce a summary section listing the last observed genetic event statuses within a forecasting
        prompt.
        Default: "\\n\\n\\n\\nHere we repeat the last observed values of each genetic event in the input data:\\n".
    forecasting_prompt_summarized_lot : str
        Text used to introduce a summary section describing the most recent line of therapy within a forecasting
        prompt. Default: "\\nThe most recent line of therapy:\\n".
    forecasting_tte_prompt_start : str
        Initial text for prompts instructing a language model to predict time-to-event (TTE) outcomes, specifically
        focusing on whether an event is censored. Default: Asks for censoring prediction.
    forecasting_tte_prompt_mid : str
        Middle text segment for TTE prompts, specifying the prediction horizon (in weeks) and asking about event
        occurrence status. Default: Specifies weeks and asks about occurrence.
    forecasting_tte_prompt_end : str
        Concluding text for TTE prompts, detailing the required output format for the prediction (censoring and
        occurrence). Default: Specifies format like "'Here is the prediction: the event (<name>) was [not] censored
        and [did not occur]/[occurred].'".
    target_prompt_start : str
        Template string used to begin constructing the target (ground truth) output string for TTE tasks, includes
        placeholder for event name. Default: "\\nHere is the prediction: the event ({event_name}) was ".
    target_prompt_censor_true : str
        Text segment used in the TTE target output to indicate that the event *was* censored within the observation
        period. Default: "censored.".
    target_prompt_censor_false : str
        Text segment used in the TTE target output to indicate that the event *was not* censored. Default:
        "not censored ".
    target_prompt_before_occur : str
        Conjunction used in the TTE target output between the censoring status and the occurrence status.
        Default: "and ".
    target_prompt_occur : str
        Text segment used in the TTE target output to indicate that the event *did* occur. Default: "occurred.".
    target_prompt_not_occur : str
        Text segment used in the TTE target output to indicate that the event *did not* occur.
        Default: "did not occur.".
    qa_prompt_start : str
        Initial text for prompts instructing a model to perform a Quality Assurance (QA) task, specifically predicting
        value bins for future variable values. Default: Asks for bin prediction per week.
    qa_bins_start : str
        Text used within QA prompts to introduce the list of possible bins the model should choose from.
        Default: "\\tThe possible bins are: ".
    task_prompt_start : str
        Introductory text for multi-task prompts, explaining that multiple tasks follow and instructing the model on the
        required response format (e.g., prefixing each answer with 'Task X:'). Default: Explains multi-task format.
    task_prompt_each_task : str
        Template string used to introduce each individual task within a multi-task prompt, includes placeholder for
        task number. Default: "Task {task_nr} is ".
    task_prompt_end : str
        Concluding text for the overall multi-task prompt setup. Default: "" (empty string).
    task_prompt_forecasting : str
        Identifier text appended to `task_prompt_each_task` to specify a forecasting sub-task. Default: "forecasting:".
    task_prompt_forecasting_qa : str
        Identifier text appended to `task_prompt_each_task` to specify a forecasting QA (binning) sub-task.
        Default: "forecasting QA:".
    task_prompt_events : str
        Identifier text appended to `task_prompt_each_task` to specify a time-to-event prediction sub-task.
        Default: "time to event prediction:".
    task_prompt_custom : str
        Identifier text appended to `task_prompt_each_task` to specify a custom-defined sub-task.
        Default: " a custom task:".
    task_target_start : str
        Template string used to begin the target (ground truth) output corresponding to a specific task number in a
        multi-task setting. Default: "Task {task_nr} is ".
    task_target_end : str
        Concluding text for the target output of a specific task within a multi-task response.
        Default: "" (empty string).
    decimal_precision : int
        Number of decimal places to use when rounding numerical values (e.g., lab results) during text conversion.
        Default: 2.
    event_category_preamble_mapping_override : dict | None
        Optional dictionary to override the introductory text used before listing events of a specific category on a
        given day. Structure: `{<event_category>: <preamble_string>}`. Default: None.
    event_category_and_name_replace_override : dict | None
        Optional nested dictionary to define specific replacements for event descriptions based on category and name.
        Allows replacing the entire event string and defining a value for reverse mapping.
        Structure:
        `{<event_category>: {<event_name>: {"full_replacement_string": <str>, "reverse_string_value": <str>}}}`.
        Default: None.
    always_keep_first_visit : bool
        Flag indicating whether the events from the very first visit should always be included in the patient history,
        regardless of token budget constraints. Default: True.
    seed : int
        Seed value for random number generators to ensure reproducibility in processes like data splitting or sampling.
        Default: 768921.
    nr_tokens_budget_padding : int
        Number of tokens reserved as a buffer when calculating token budgets, ensuring outputs don't exceed limits.
        May need adjustment based on model/task. Default: 200.
    tokenizer_to_use : str
        Identifier string for the tokenizer model to be used for counting tokens (e.g., for budget calculations).
        Should correspond to a model available in the environment (e.g., from Hugging Face).
        Default: 'microsoft/Phi-4-mini-instruct'.
    constant_columns_to_use : list[str]
        List of column names from the constant (demographic) data source to be included in the processing and text
        conversion. *Note: Age might be handled separately.* Default: ["race", "gender", "ethnicity", "indication"].
    constant_birthdate_column : str | None
        Column name in the constant table representing the patient's birth date or birth year.
        If provided, age calculation is performed relative to the first event date. Default: None.
    constant_birthdate_column_format : str
        Format of the birthdate column, either "date" or "age". Default: "date".
    data_splitter_events_variables_category_mapping : dict
        Mapping defining which event categories correspond to specific prediction types in DataSplitterEvents.
        Keys are event categories (e.g., 'death', 'progression'), values are descriptive names for the target variable.
    data_splitter_events_backup_category_mapping : dict
         Fallback mapping for event categories in DataSplitterEvents. Used if the primary category variables are not
         found. Keys are the missing categories, values are the backup categories to use.
    """

    def __init__(self):
        # --- Import data parameters ---
        self.date_cutoff = None  # If set, only use data before this date (format: "YYYY-MM-DD"), censored after
        self.delta_time_unit: str = (
            "weeks"  # Either "days" or "weeks" - if you change this, you need to call set_delta_time_unit
        )
        self.numeric_detect_min_fraction: float = (
            0.99  # Fraction of numeric values required to consider an event as numeric
        )

        # --- Core Column Names ---
        self.date_col: str = "date"
        self.patient_id_col: str = "patientid"
        self.event_category_col: str = "event_category"
        self.event_name_col: str = "event_name"
        self.event_descriptive_name_col: str = "event_descriptive_name"
        self.event_value_col: str = "event_value"
        self.source_col: str = "source"
        self.meta_data_col: str = "meta_data"
        self.constant_split_col: str = "data_split"

        # --- Specific Category/Type Column Names ---
        self.event_category_default_value = "general"  # Default value for event category if not present
        self.event_meta_default_value = pd.NA  # Default value for event meta data if not present
        self.source_col_default_value: str = "events"  # Default value for source column if not present
        self.lot_date_col: str = "lot_date"
        self.lot_name_col: str = "lot"
        self.event_value_lot_start: str = "LoT Start"
        self.skip_future_lot_filtering: bool = False  # Whether to skip filtering future LoT events, by default False.
        self.lot_concatenate_descriptive_and_value: bool = (
            False  # If true, concatenate descriptive name and value for LoT events, by default False (only event_vale.)
        )
        self.lot_concatenate_string: str = (
            " - "  # String used to concatenate descriptive name and value for LoT events, by default " - ".
        )

        # Warnings and logs
        self.warning_for_splitters_patient_without_lots: bool = (
            True  # Whether to warn if a patient has no LoT events in DataSplitterEvents
        )

        # --- Specific Event Categories / Values / Sources ---
        self.event_category_lot: str = "lot"
        self.event_category_death: str = "death"
        self.event_category_labs: str = "lab"
        self.event_category_forecast: list = ["lab"]  # List of event categories to be used for forecasting

        self.source_genetic: str = "genetic"
        self.source_standard_events: str = "events"
        self.genetic_skip_text_value: str = "present"
        self.genetic_tag_opening: str = "<genetic>"
        self.genetic_tag_closing: str = "</genetic>"

        # --- Data Paths, Tables, and Splits ---
        self.event_table_name: str = "events"
        self.train_split_name: str = "train"
        self.validation_split_name: str = "validation"
        self.test_split_name: str = "test"
        self.bins_split_name: str = "5_equal_sized_bins"

        # --- Text Conversion Prompts ---
        self.preamble_text: str = (
            "The following is a patient, starting with the demographic data, "
            "following visit by visit everything that the patient experienced. "
            "All lab codes refer to LOINC codes."
        )
        self.constant_text: str = "\n\nStarting with demographic data:\n"
        self.first_day_text: str = "\nOn the first visit, the patient experienced the following: \n"
        self.event_day_preamble: str = "\n"
        self._event_day_text_template: str = " {unit} later, the patient visited and experienced the following: \n"
        self.event_day_text: str = self._event_day_text_template.format(unit=self.delta_time_unit)
        self.post_event_text: str = ".\n"
        self.genetic_empty_text: str = "No genetic data available."

        # --- Forecasting Prompts (General & Summarization) ---
        self._forecasting_fval_prompt_start_template: str = (
            "\nYour task is to predict the future values of the following variables "
            "for each cumulative {unit} starting from the last visit:\n"
        )
        self.forecasting_fval_prompt_start: str = self._forecasting_fval_prompt_start_template.format(
            unit=self.delta_time_unit
        )

        self._forecasting_prompt_var_time_template: str = " the future {unit} "
        self.forecasting_prompt_var_time: str = self._forecasting_prompt_var_time_template.format(
            unit=self.delta_time_unit
        )
        self.forecasting_prompt_summarized_start: str = "\nThe last values of the variables in the input data are:\n"
        self.forecasting_firstday_override: str = (
            "\nThe following events are included in the input data, though "
            "potentially there are more which were omitted. Starting with: \n"
        )
        self.forecasting_prompt_summarized_genetic: str = (
            "\n\n\n\nHere we repeat the last observed values of each genetic event in the input data:\n"
        )
        self.forecasting_prompt_summarized_lot: str = "\nThe most recent line of therapy:\n"

        # --- Forecasting Prompts (Time-to-Event Specific) ---
        self.forecasting_tte_prompt_start: str = "\nYour task is to predict whether the following event was censored "
        self._forecasting_tte_prompt_mid_template: str = (
            " {unit} from the last clinical visit and whether the event occurred or not: "
        )
        self.forecasting_tte_prompt_mid: str = self._forecasting_tte_prompt_mid_template.format(
            unit=self.delta_time_unit
        )
        self.forecasting_tte_prompt_end: str = (
            ".\nPlease provide your prediction in the following format: "
            "'Here is the prediction: the event (<name of event>) was [not] censored "
            "and [did not occur]/[occurred].'"
        )

        # --- Target Output Prompts (Time-to-Event) ---
        self.target_prompt_start: str = "\nHere is the prediction: the event ({event_name}) was "
        self.target_prompt_censor_true: str = "censored."
        self.target_prompt_censor_false: str = "not censored "
        self.target_prompt_before_occur: str = "and "
        self.target_prompt_occur: str = "occurred."
        self.target_prompt_not_occur: str = "did not occur."

        # --- QA Prompts (Binning) ---
        self._qa_prompt_start_template: str = (
            "\nYour task is to predict the appropriate bins for the future values of "
            "the following variables for each cumulative {unit} starting from the date of the last visit:"
        )
        self.qa_prompt_start = self._qa_prompt_start_template.format(unit=self.delta_time_unit)
        self.qa_bins_start: str = "\tThe possible bins are: "

        # --- Multi-Task Prompts ---
        self.task_prompt_start: str = (
            "\nYou will now have multiple tasks to complete. Please answer for each "
            "task in the same order as they are presented. Before every response state the "
            "task nr, e.g. 'Task 2:'.\n\n"
        )
        self.task_prompt_each_task: str = "Task {task_nr} is "
        self.task_prompt_end: str = ""
        self.task_prompt_forecasting: str = "forecasting:"
        self.task_prompt_forecasting_qa: str = "forecasting QA:"
        self.task_prompt_events: str = "time to event prediction:"
        self.task_prompt_custom: str = " a custom task:"
        self.task_target_start: str = "Task {task_nr} is "
        self.task_target_end: str = ""

        # --- Overrides -----
        self.decimal_precision = 2  # Number of decimal places to round values to, by default 2.
        self.event_category_preamble_mapping_override = None
        # Override for the event category preamble mapping (default is None).
        # Structure is {<event_category>: <preamble_string>}

        self.event_category_and_name_replace_override = None
        # dict, optional
        #    Override for the event category and name replace mapping (default is None).
        #    Structure is {<event_category>: {
        #        <event_name>: {
        #            "full_replacement_string": <full_replacement_string>,
        #            "reverse_string_value": <reverse_string_value>
        #            }
        #        }
        #    }

        self.always_keep_first_visit: bool = (
            True  # Whether to always keep the first visit in the patient history, by default True.
        )

        # Seeds
        self.seed = 768921  # I like both of these numbers

        # Token budgets
        self.nr_tokens_budget_padding: int = 200  # Might need to be set to 500 for pretrain

        # Tokenizers for counting
        self.tokenizer_to_use: str = "microsoft/Phi-4-mini-instruct"

        # --- Processing of constant ---
        self.constant_columns_to_use: list = [
            "race",
            "gender",
            "ethnicity",
            "indication",
        ]  # Which columns to use from the constant data
        self.constant_birthdate_column: str = None  # If set, use this column for age calculation
        self.constant_birthdate_column_format: str = "date"  # Either "date" or "age"

        # --- Data splitter events setup ---
        # Used to identify which variables correspond to which event categories for
        # different event types, by default set for the 4 used in GDT
        self.data_splitter_events_variables_category_mapping = {
            "death": "death",
            "progression": "next progression",
            "lot": "next line of therapy",
            "metastasis": "next metastasis",
        }

        # Used to backup event categories for event types if no variables are found
        # e.g. progression -> death
        self.data_splitter_events_backup_category_mapping = {
            "progression": "death",
        }

    def set_delta_time_unit(self, unit: str, unit_sing=None):
        """
        Set the time unit for delta time representation in text conversion. Possible to set either
        "days" (and "day(s)") or "weeks" (and "week(s)"). Optionally, a singular form can be provided
        for use in specific prompts. If not provided, the plural form will be used.
        """
        assert unit in ("days", "weeks", "day(s)", "week(s)"), "unit must be either 'days' or 'weeks'"
        assert unit_sing in (None, "day", "week"), "unit_sing must be either None, 'day' or 'week'"
        self.delta_time_unit = unit
        if unit_sing is None:
            unit_sing = unit

        self.event_day_text = self._event_day_text_template.format(unit=unit)
        self.forecasting_fval_prompt_start = self._forecasting_fval_prompt_start_template.format(unit=unit_sing)
        self.forecasting_prompt_var_time = self._forecasting_prompt_var_time_template.format(unit=unit)
        self.forecasting_tte_prompt_mid = self._forecasting_tte_prompt_mid_template.format(unit=unit)
        self.qa_prompt_start = self._qa_prompt_start_template.format(unit=unit_sing)

Functions¶

set_delta_time_unit ¶

set_delta_time_unit(unit, unit_sing=None)

Set the time unit for delta time representation in text conversion. Possible to set either "days" (and "day(s)") or "weeks" (and "week(s)"). Optionally, a singular form can be provided for use in specific prompts. If not provided, the plural form will be used.

Source code in twinweaver/common/config.py

def set_delta_time_unit(self, unit: str, unit_sing=None):
    """
    Set the time unit for delta time representation in text conversion. Possible to set either
    "days" (and "day(s)") or "weeks" (and "week(s)"). Optionally, a singular form can be provided
    for use in specific prompts. If not provided, the plural form will be used.
    """
    assert unit in ("days", "weeks", "day(s)", "week(s)"), "unit must be either 'days' or 'weeks'"
    assert unit_sing in (None, "day", "week"), "unit_sing must be either None, 'day' or 'week'"
    self.delta_time_unit = unit
    if unit_sing is None:
        unit_sing = unit

    self.event_day_text = self._event_day_text_template.format(unit=unit)
    self.forecasting_fval_prompt_start = self._forecasting_fval_prompt_start_template.format(unit=unit_sing)
    self.forecasting_prompt_var_time = self._forecasting_prompt_var_time_template.format(unit=unit)
    self.forecasting_tte_prompt_mid = self._forecasting_tte_prompt_mid_template.format(unit=unit)
    self.qa_prompt_start = self._qa_prompt_start_template.format(unit=unit_sing)