Config¶
twinweaver.common.config ¶
Classes¶
Config ¶
Centralized configuration repository for data processing, prompt generation, and constants.
This class consolidates various configuration settings essential for the data processing pipeline. It defines standardized column names, specific values for event categories (like 'line of therapy', 'death'), data source identifiers, file paths, table names, and text templates (prompts) used for different language model tasks such as text conversion, forecasting (value prediction, time-to-event), quality assurance (QA) via binning, and setting up multi-task prompts. Default values are provided but can be overridden to adapt to specific datasets, model requirements, or experimental setups.
Attributes:
| Name | Type | Description |
|---|---|---|
date_cutoff |
str | None
|
If set, only use data before this date (format: "YYYY-MM-DD"), censored after. Default: None. |
delta_time_unit |
str
|
Unit of time used to express intervals between patient visits in the generated text. Options are "days" or "weeks". Default: "weeks". |
numeric_detect_min_fraction |
float
|
Fraction of values that must be numeric to classify a variable as numeric. Defaults to 0.99. |
date_col |
str
|
Standardized column name for date information across datasets. Default: "date". |
patient_id_col |
str
|
Standardized column name for unique patient identifiers. Default: "patientid". |
event_category_col |
str
|
Standardized column name for the category of a recorded event (e.g., 'lab', 'diagnosis'). Default: "event_category". |
event_name_col |
str
|
Standardized column name for the specific name of an event within its category (e.g., 'Glucose level', 'Type 2 Diabetes'). Default: "event_name". |
event_descriptive_name_col |
str
|
Standardized column name for a more human-readable or descriptive name of the event. Default: "event_descriptive_name". |
event_value_col |
str
|
Standardized column name for the value associated with an event (e.g., a lab result, 'present'). Default: "event_value". |
source_col |
str
|
Standardized column name indicating the origin or type of the data record. Default: "source". |
meta_data_col |
str
|
Standardized column name for storing additional metadata related to an event. Default: "meta_data". |
constant_split_col |
str
|
Standardized column name for data split information (train/test/val) in the constant dataframe. Default: "data_split". |
event_category_default_value |
str
|
Default value to assign to |
event_meta_default_value |
Any
|
Default value to assign to |
source_col_default_value |
str
|
Default value to assign to |
lot_date_col |
str
|
Column name specifically used for dates related to line of therapy (LoT) events. Default: "lot_date". |
lot_name_col |
str
|
Column name for the name or identifier of the line of therapy (e.g., "First Line"). Default: "lot". |
event_value_lot_start |
str
|
Specific string value used in |
skip_future_lot_filtering |
bool
|
Flag indicating whether to skip filtering out future line of therapy events. Default: False. Useful in case you accidentially overlap LoTs which are actually the same, use with caution. |
lot_concatenate_descriptive_and_value |
bool
|
Flag indicating whether to concatenate the descriptive name and value for line of therapy events. Default: False. |
lot_concatenate_string |
str
|
String used to concatenate the descriptive name and value for line of therapy events when
|
warning_for_splitters_patient_without_lots |
bool
|
Whether to warn if a patient has no LoT events in DataSplitterEvents. Default: True. |
event_category_lot |
str
|
Specific string value used in |
event_category_death |
str
|
Specific string value used in |
event_category_labs |
str
|
Specific string value used in |
event_category_forecast |
list[str]
|
List of event categories to be considered for forecasting tasks. Default: ["lab"]. |
source_genetic |
str
|
Specific string value used in |
source_standard_events |
str
|
Source identifier for standard clinical events. Default: "events". |
genetic_skip_text_value |
str
|
A specific event value (often for genetic data) that might be skipped during text generation to avoid redundancy if its presence is implied elsewhere. Default: "present". |
genetic_tag_opening |
str
|
Opening tag used to demarcate genetic information within generated text. Default: " |
genetic_tag_closing |
str
|
Closing tag used to demarcate genetic information within generated text. Default: "". |
event_table_name |
str
|
The base name (without extension) for the primary file or table containing event data. Default: "events". |
train_split_name |
str
|
Identifier for the training dataset split (e.g., used in file naming or data loading). Default: "train". |
validation_split_name |
str
|
Identifier for the validation dataset split. Default: "validation". |
test_split_name |
str
|
Identifier for the test dataset split. Default: "test". |
bins_split_name |
str
|
Identifier for a data split used for binning tasks, often related to QA. Default: "5_equal_sized_bins". |
preamble_text |
str
|
Introductory text inserted at the beginning of the textual representation of a patient's record. Default: Explains structure and LOINC codes. |
constant_text |
str
|
Text used to introduce the section containing static demographic data in the textual patient record. Default: "\n\nStarting with demographic data:\n". |
genetic_empty_text |
str
|
Text to use when no genetic data is available for a patient. Default: "No genetic data available.". |
first_day_text |
str
|
Text used to introduce the events that occurred on the patient's very first recorded visit day. Default: "\nOn the first visit, the patient experienced the following: \n". |
event_day_preamble |
str
|
Text inserted before the description of events for visits subsequent to the first one. Default: "\n". |
event_day_text |
str
|
Template text used to introduce events on subsequent visit days, indicating the time elapsed since the previous visit. Default: " self.delta_time_unit : later, the patient visited and experienced the following: \n". |
post_event_text |
str
|
Text appended after listing all events for a specific visit day. Default: ".\n". |
forecasting_fval_prompt_start |
str
|
Initial text for prompts instructing a language model to predict future numerical values of specified variables over time. Default: Instructs prediction per cumulative week. |
forecasting_prompt_var_time |
str
|
Text segment used within forecasting prompts to specify the time frame (e.g., future weeks) for prediction. Default: " the future weeks ". |
forecasting_prompt_summarized_start |
str
|
Initial text for prompts that include a summary of the last known values of variables being forecasted. Default: "\nThe last values of the variables in the input data are:\n". |
forecasting_firstday_override |
str
|
Alternative introductory text for forecasting prompts, possibly used when only a subset of initial data is presented, hinting at omissions. Default: Mentions included events, potential omissions. |
forecasting_prompt_summarized_genetic |
str
|
Text used to introduce a summary section listing the last observed genetic event statuses within a forecasting prompt. Default: "\n\n\n\nHere we repeat the last observed values of each genetic event in the input data:\n". |
forecasting_prompt_summarized_lot |
str
|
Text used to introduce a summary section describing the most recent line of therapy within a forecasting prompt. Default: "\nThe most recent line of therapy:\n". |
forecasting_tte_prompt_start |
str
|
Initial text for prompts instructing a language model to predict time-to-event (TTE) outcomes, specifically focusing on whether an event is censored. Default: Asks for censoring prediction. |
forecasting_tte_prompt_mid |
str
|
Middle text segment for TTE prompts, specifying the prediction horizon (in weeks) and asking about event occurrence status. Default: Specifies weeks and asks about occurrence. |
forecasting_tte_prompt_end |
str
|
Concluding text for TTE prompts, detailing the required output format for the prediction (censoring and
occurrence). Default: Specifies format like "'Here is the prediction: the event ( |
target_prompt_start |
str
|
Template string used to begin constructing the target (ground truth) output string for TTE tasks, includes placeholder for event name. Default: "\nHere is the prediction: the event ({event_name}) was ". |
target_prompt_censor_true |
str
|
Text segment used in the TTE target output to indicate that the event was censored within the observation period. Default: "censored.". |
target_prompt_censor_false |
str
|
Text segment used in the TTE target output to indicate that the event was not censored. Default: "not censored ". |
target_prompt_before_occur |
str
|
Conjunction used in the TTE target output between the censoring status and the occurrence status. Default: "and ". |
target_prompt_occur |
str
|
Text segment used in the TTE target output to indicate that the event did occur. Default: "occurred.". |
target_prompt_not_occur |
str
|
Text segment used in the TTE target output to indicate that the event did not occur. Default: "did not occur.". |
qa_prompt_start |
str
|
Initial text for prompts instructing a model to perform a Quality Assurance (QA) task, specifically predicting value bins for future variable values. Default: Asks for bin prediction per week. |
qa_bins_start |
str
|
Text used within QA prompts to introduce the list of possible bins the model should choose from. Default: "\tThe possible bins are: ". |
task_prompt_start |
str
|
Introductory text for multi-task prompts, explaining that multiple tasks follow and instructing the model on the required response format (e.g., prefixing each answer with 'Task X:'). Default: Explains multi-task format. |
task_prompt_each_task |
str
|
Template string used to introduce each individual task within a multi-task prompt, includes placeholder for task number. Default: "Task {task_nr} is ". |
task_prompt_end |
str
|
Concluding text for the overall multi-task prompt setup. Default: "" (empty string). |
task_prompt_forecasting |
str
|
Identifier text appended to |
task_prompt_forecasting_qa |
str
|
Identifier text appended to |
task_prompt_events |
str
|
Identifier text appended to |
task_prompt_custom |
str
|
Identifier text appended to |
task_target_start |
str
|
Template string used to begin the target (ground truth) output corresponding to a specific task number in a multi-task setting. Default: "Task {task_nr} is ". |
task_target_end |
str
|
Concluding text for the target output of a specific task within a multi-task response. Default: "" (empty string). |
decimal_precision |
int
|
Number of decimal places to use when rounding numerical values (e.g., lab results) during text conversion. Default: 2. |
event_category_preamble_mapping_override |
dict | None
|
Optional dictionary to override the introductory text used before listing events of a specific category on a
given day. Structure: |
event_category_and_name_replace_override |
dict | None
|
Optional nested dictionary to define specific replacements for event descriptions based on category and name.
Allows replacing the entire event string and defining a value for reverse mapping.
Structure:
|
always_keep_first_visit |
bool
|
Flag indicating whether the events from the very first visit should always be included in the patient history, regardless of token budget constraints. Default: True. |
seed |
int
|
Seed value for random number generators to ensure reproducibility in processes like data splitting or sampling. Default: 768921. |
nr_tokens_budget_padding |
int
|
Number of tokens reserved as a buffer when calculating token budgets, ensuring outputs don't exceed limits. May need adjustment based on model/task. Default: 200. |
tokenizer_to_use |
str
|
Identifier string for the tokenizer model to be used for counting tokens (e.g., for budget calculations). Should correspond to a model available in the environment (e.g., from Hugging Face). Default: 'microsoft/Phi-4-mini-instruct'. |
constant_columns_to_use |
list[str]
|
List of column names from the constant (demographic) data source to be included in the processing and text conversion. Note: Age might be handled separately. Default: ["race", "gender", "ethnicity", "indication"]. |
constant_birthdate_column |
str | None
|
Column name in the constant table representing the patient's birth date or birth year. If provided, age calculation is performed relative to the first event date. Default: None. |
constant_birthdate_column_format |
str
|
Format of the birthdate column, either "date" or "age". Default: "date". |
data_splitter_events_variables_category_mapping |
dict
|
Mapping defining which event categories correspond to specific prediction types in DataSplitterEvents. Keys are event categories (e.g., 'death', 'progression'), values are descriptive names for the target variable. |
data_splitter_events_backup_category_mapping |
dict
|
Fallback mapping for event categories in DataSplitterEvents. Used if the primary category variables are not found. Keys are the missing categories, values are the backup categories to use. |
Source code in twinweaver/common/config.py
| |
Functions¶
set_delta_time_unit ¶
Set the time unit for delta time representation in text conversion. Possible to set either "days" (and "day(s)") or "weeks" (and "week(s)"). Optionally, a singular form can be provided for use in specific prompts. If not provided, the plural form will be used.