Config¶
twinweaver.common.config ¶
Classes¶
Config ¶
Centralized configuration repository for data processing, prompt generation, and constants.
This class consolidates various configuration settings essential for the data processing pipeline. It defines standardized column names, specific values for event categories (like 'line of therapy', 'death'), data source identifiers, file paths, table names, and text templates (prompts) used for different language model tasks such as text conversion, forecasting (value prediction, time-to-event), quality assurance (QA) via binning, and setting up multi-task prompts. Default values are provided but can be overridden to adapt to specific datasets, model requirements, or experimental setups.
Attributes:
| Name | Type | Description |
|---|---|---|
date_cutoff |
str | None
|
If set, only use data before this date (format: "YYYY-MM-DD"), censored after. Default: None. |
delta_time_unit |
str
|
Unit of time used to express intervals between patient visits in the generated text. Options are "days" or "weeks". Default: "weeks". |
numeric_detect_min_fraction |
float
|
Fraction of values that must be numeric to classify a variable as numeric. Defaults to 0.99. |
date_col |
str
|
Standardized column name for date information across datasets. Default: "date". |
patient_id_col |
str
|
Standardized column name for unique patient identifiers. Default: "patientid". |
event_category_col |
str
|
Standardized column name for the category of a recorded event (e.g., 'lab', 'diagnosis'). Default: "event_category". |
event_name_col |
str
|
Standardized column name for the specific name of an event within its category (e.g., 'Glucose level', 'Type 2 Diabetes'). Default: "event_name". |
event_descriptive_name_col |
str
|
Standardized column name for a more human-readable or descriptive name of the event. Default: "event_descriptive_name". |
event_value_col |
str
|
Standardized column name for the value associated with an event (e.g., a lab result, 'present'). Default: "event_value". |
source_col |
str
|
Standardized column name indicating the origin or type of the data record. Default: "source". |
meta_data_col |
str
|
Standardized column name for storing additional metadata related to an event. Default: "meta_data". |
constant_split_col |
str
|
Standardized column name for data split information (train/test/val) in the constant dataframe. Default: "data_split". |
event_category_default_value |
str
|
Default value to assign to |
event_meta_default_value |
Any
|
Default value to assign to |
source_col_default_value |
str
|
Default value to assign to |
lot_date_col |
str
|
Column name specifically used for dates related to line of therapy (LoT) events. Default: "lot_date". |
lot_name_col |
str
|
Column name for the name or identifier of the line of therapy (e.g., "First Line"). Default: "lot". |
event_value_lot_start |
str
|
Specific string value used in |
skip_future_lot_filtering |
bool
|
Flag indicating whether to skip filtering out future line of therapy events. Default: False. Useful in case you accidentially overlap LoTs which are actually the same, use with caution. |
lot_concatenate_descriptive_and_value |
bool
|
Flag indicating whether to concatenate the descriptive name and value for line of therapy events. Default: False. |
lot_concatenate_string |
str
|
String used to concatenate the descriptive name and value for line of therapy events when
|
warning_for_splitters_patient_without_lots |
bool
|
Whether to warn if a patient has no LoT events in DataSplitterEvents. Default: True. |
event_category_lot |
str
|
Specific string value used in |
event_category_death |
str
|
Specific string value used in |
event_category_labs |
str
|
Specific string value used in |
event_category_forecast |
list[str]
|
List of event categories to be considered for forecasting tasks. Default: ["lab"]. |
source_genetic |
str
|
Specific string value used in |
source_standard_events |
str
|
Source identifier for standard clinical events. Default: "events". |
genetic_skip_text_value |
str
|
A specific event value (often for genetic data) that might be skipped during text generation to avoid redundancy if its presence is implied elsewhere. Default: "present". |
genetic_tag_opening |
str
|
Opening tag used to demarcate genetic information within generated text. Default: " |
genetic_tag_closing |
str
|
Closing tag used to demarcate genetic information within generated text. Default: "". |
event_table_name |
str
|
The base name (without extension) for the primary file or table containing event data. Default: "events". |
train_split_name |
str
|
Identifier for the training dataset split (e.g., used in file naming or data loading). Default: "train". |
validation_split_name |
str
|
Identifier for the validation dataset split. Default: "validation". |
test_split_name |
str
|
Identifier for the test dataset split. Default: "test". |
bins_split_name |
str
|
Identifier for a data split used for binning tasks, often related to QA. Default: "5_equal_sized_bins". |
preamble_text |
str
|
Introductory text inserted at the beginning of the textual representation of a patient's record. Default: Explains structure and LOINC codes. |
constant_text |
str
|
Text used to introduce the section containing static demographic data in the textual patient record. Default: "\n\nStarting with demographic data:\n". |
genetic_empty_text |
str
|
Text to use when no genetic data is available for a patient. Default: "No genetic data available.". |
first_day_text |
str
|
Text used to introduce the events that occurred on the patient's very first recorded visit day. Default: "\nOn the first visit, the patient experienced the following: \n". |
event_day_preamble |
str
|
Text inserted before the description of events for visits subsequent to the first one. Default: "\n". |
event_day_text |
str
|
Template text used to introduce events on subsequent visit days, indicating the time elapsed since the previous visit. Default: " self.delta_time_unit : later, the patient visited and experienced the following: \n". |
post_event_text |
str
|
Text appended after listing all events for a specific visit day. Default: ".\n". |
forecasting_fval_prompt_start |
str
|
Initial text for prompts instructing a language model to predict future numerical values of specified variables over time. Default: Instructs prediction per cumulative week. |
forecasting_prompt_var_time |
str
|
Text segment used within forecasting prompts to specify the time frame (e.g., future weeks) for prediction. Default: " the future weeks ". |
forecasting_prompt_summarized_start |
str
|
Initial text for prompts that include a summary of the last known values of variables being forecasted. Default: "\nThe last values of the variables in the input data are:\n". |
forecasting_firstday_override |
str
|
Alternative introductory text for forecasting prompts, possibly used when only a subset of initial data is presented, hinting at omissions. Default: Mentions included events, potential omissions. |
forecasting_prompt_summarized_genetic |
str
|
Text used to introduce a summary section listing the last observed genetic event statuses within a forecasting prompt. Default: "\n\n\n\nHere we repeat the last observed values of each genetic event in the input data:\n". |
forecasting_prompt_summarized_lot |
str
|
Text used to introduce a summary section describing the most recent line of therapy within a forecasting prompt. Default: "\nThe most recent line of therapy:\n". |
forecasting_tte_prompt_start |
str
|
Initial text for prompts instructing a language model to predict time-to-event (TTE) outcomes, specifically focusing on whether an event is censored. Default: Asks for censoring prediction. |
forecasting_tte_prompt_mid |
str
|
Middle text segment for TTE prompts, specifying the prediction horizon (in weeks) and asking about event occurrence status. Default: Specifies weeks and asks about occurrence. |
forecasting_tte_prompt_end |
str
|
Concluding text for TTE prompts, detailing the required output format for the prediction (censoring and
occurrence). Default: Specifies format like "'Here is the prediction: the event ( |
target_prompt_start |
str
|
Template string used to begin constructing the target (ground truth) output string for TTE tasks, includes placeholder for event name. Default: "\nHere is the prediction: the event ({event_name}) was ". |
target_prompt_censor_true |
str
|
Text segment used in the TTE target output to indicate that the event was censored within the observation period. Default: "censored.". |
target_prompt_censor_false |
str
|
Text segment used in the TTE target output to indicate that the event was not censored. Default: "not censored ". |
target_prompt_before_occur |
str
|
Conjunction used in the TTE target output between the censoring status and the occurrence status. Default: "and ". |
target_prompt_occur |
str
|
Text segment used in the TTE target output to indicate that the event did occur. Default: "occurred.". |
target_prompt_not_occur |
str
|
Text segment used in the TTE target output to indicate that the event did not occur. Default: "did not occur.". |
qa_prompt_start |
str
|
Initial text for prompts instructing a model to perform a Quality Assurance (QA) task, specifically predicting value bins for future variable values. Default: Asks for bin prediction per week. |
qa_bins_start |
str
|
Text used within QA prompts to introduce the list of possible bins the model should choose from. Default: "\tThe possible bins are: ". |
task_prompt_start |
str
|
Introductory text for multi-task prompts, explaining that multiple tasks follow and instructing the model on the required response format (e.g., prefixing each answer with 'Task X:'). Default: Explains multi-task format. |
task_prompt_each_task |
str
|
Template string used to introduce each individual task within a multi-task prompt, includes placeholder for task number. Default: "Task {task_nr} is ". |
task_prompt_end |
str
|
Concluding text for the overall multi-task prompt setup. Default: "" (empty string). |
task_prompt_forecasting |
str
|
Identifier text appended to |
task_prompt_forecasting_qa |
str
|
Identifier text appended to |
task_prompt_events |
str
|
Identifier text appended to |
task_prompt_custom |
str
|
Identifier text appended to |
task_target_start |
str
|
Template string used to begin the target (ground truth) output corresponding to a specific task number in a multi-task setting. Default: "Task {task_nr} is ". |
task_target_end |
str
|
Concluding text for the target output of a specific task within a multi-task response. Default: "" (empty string). |
decimal_precision |
int
|
Number of decimal places to use when rounding numerical values (e.g., lab results) during text conversion. Default: 2. |
event_category_preamble_mapping_override |
dict | None
|
Optional dictionary to override the introductory text used before listing events of a specific category on a
given day. Structure: |
event_category_and_name_replace_override |
dict | None
|
Optional nested dictionary to define specific replacements for event descriptions based on category and name.
Allows replacing the entire event string and defining a value for reverse mapping.
Structure:
|
always_keep_first_visit |
bool
|
Flag indicating whether the events from the very first visit should always be included in the patient history, regardless of token budget constraints. Default: True. |
seed |
int
|
Seed value for random number generators to ensure reproducibility in processes like data splitting or sampling. Default: 768921. |
nr_tokens_budget_padding |
int
|
Number of tokens reserved as a buffer when calculating token budgets, ensuring outputs don't exceed limits. May need adjustment based on model/task. Default: 200. |
tokenizer_to_use |
str
|
Identifier string for the tokenizer model to be used for counting tokens (e.g., for budget calculations). Should correspond to a model available in the environment (e.g., from Hugging Face). Default: 'microsoft/Phi-4-mini-instruct'. |
constant_columns_to_use |
list[str]
|
List of column names from the constant (demographic) data source to be included in the processing and text conversion. Note: Age might be handled separately. Default: ["race", "gender", "ethnicity", "indication"]. |
constant_birthdate_column |
str | None
|
Column name in the constant table representing the patient's birth date or birth year. If provided, age calculation is performed relative to the first event date. Default: None. |
constant_birthdate_column_format |
str
|
Format of the birthdate column, either "date" or "age". Default: "date". |
data_splitter_events_variables_category_mapping |
dict
|
Mapping defining which event categories correspond to specific prediction types in DataSplitterEvents. Keys are event categories (e.g., 'death', 'progression'), values are descriptive names for the target variable. |
data_splitter_events_backup_category_mapping |
dict
|
Fallback mapping for event categories in DataSplitterEvents. Used if the primary category variables are not found. Keys are the missing categories, values are the backup categories to use. |
Source code in twinweaver/common/config.py
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 | |
Functions¶
set_delta_time_unit ¶
Set the time unit for delta time representation in text conversion. Possible to set either "days" (and "day(s)") or "weeks" (and "week(s)"). Optionally, a singular form can be provided for use in specific prompts. If not provided, the plural form will be used.