Skip to content

Config

twinweaver.common.config

Classes

Config

Centralized configuration repository for data processing, prompt generation, and constants.

This class consolidates various configuration settings essential for the data processing pipeline. It defines standardized column names, specific values for event categories (like 'line of therapy', 'death'), data source identifiers, file paths, table names, and text templates (prompts) used for different language model tasks such as text conversion, forecasting (value prediction, time-to-event), quality assurance (QA) via binning, and setting up multi-task prompts. Default values are provided but can be overridden to adapt to specific datasets, model requirements, or experimental setups.

Attributes:

Name Type Description
date_cutoff str | None

If set, only use data before this date (format: "YYYY-MM-DD"), censored after. Default: None.

delta_time_unit str

Unit of time used to express intervals between patient visits in the generated text. Options are "days" or "weeks". Default: "weeks".

numeric_detect_min_fraction float

Fraction of values that must be numeric to classify a variable as numeric. Defaults to 0.99.

date_col str

Standardized column name for date information across datasets. Default: "date".

patient_id_col str

Standardized column name for unique patient identifiers. Default: "patientid".

event_category_col str

Standardized column name for the category of a recorded event (e.g., 'lab', 'diagnosis'). Default: "event_category".

event_name_col str

Standardized column name for the specific name of an event within its category (e.g., 'Glucose level', 'Type 2 Diabetes'). Default: "event_name".

event_descriptive_name_col str

Standardized column name for a more human-readable or descriptive name of the event. Default: "event_descriptive_name".

event_value_col str

Standardized column name for the value associated with an event (e.g., a lab result, 'present'). Default: "event_value".

source_col str

Standardized column name indicating the origin or type of the data record. Default: "source".

meta_data_col str

Standardized column name for storing additional metadata related to an event. Default: "meta_data".

constant_split_col str

Standardized column name for data split information (train/test/val) in the constant dataframe. Default: "data_split".

event_category_default_value str

Default value to assign to event_category_col if it is missing in the data. Default: "general".

event_meta_default_value Any

Default value to assign to meta_data_col if it is missing. Default: pd.NA.

source_col_default_value str

Default value to assign to source_col if it is missing. Default: "events".

lot_date_col str

Column name specifically used for dates related to line of therapy (LoT) events. Default: "lot_date".

lot_name_col str

Column name for the name or identifier of the line of therapy (e.g., "First Line"). Default: "lot".

event_value_lot_start str

Specific string value used in event_value_col to denote the start of a line of therapy. Default: "LoT Start".

skip_future_lot_filtering bool

Flag indicating whether to skip filtering out future line of therapy events. Default: False. Useful in case you accidentially overlap LoTs which are actually the same, use with caution.

lot_concatenate_descriptive_and_value bool

Flag indicating whether to concatenate the descriptive name and value for line of therapy events. Default: False.

lot_concatenate_string str

String used to concatenate the descriptive name and value for line of therapy events when lot_concatenate_descriptive_and_value is True. Default: " - ".

warning_for_splitters_patient_without_lots bool

Whether to warn if a patient has no LoT events in DataSplitterEvents. Default: True.

event_category_lot str

Specific string value used in event_category_col to identify 'line of therapy' events. Default: "lot".

event_category_death str

Specific string value used in event_category_col to identify 'death' events. Default: "death".

event_category_labs str

Specific string value used in event_category_col to identify 'lab result' events. Default: "lab".

event_category_forecast list[str]

List of event categories to be considered for forecasting tasks. Default: ["lab"].

source_genetic str

Specific string value used in source_col to identify data originating from genetic testing. Default: "genetic".

source_standard_events str

Source identifier for standard clinical events. Default: "events".

genetic_skip_text_value str

A specific event value (often for genetic data) that might be skipped during text generation to avoid redundancy if its presence is implied elsewhere. Default: "present".

genetic_tag_opening str

Opening tag used to demarcate genetic information within generated text. Default: "".

genetic_tag_closing str

Closing tag used to demarcate genetic information within generated text. Default: "".

event_table_name str

The base name (without extension) for the primary file or table containing event data. Default: "events".

train_split_name str

Identifier for the training dataset split (e.g., used in file naming or data loading). Default: "train".

validation_split_name str

Identifier for the validation dataset split. Default: "validation".

test_split_name str

Identifier for the test dataset split. Default: "test".

bins_split_name str

Identifier for a data split used for binning tasks, often related to QA. Default: "5_equal_sized_bins".

preamble_text str

Introductory text inserted at the beginning of the textual representation of a patient's record. Default: Explains structure and LOINC codes.

constant_text str

Text used to introduce the section containing static demographic data in the textual patient record. Default: "\n\nStarting with demographic data:\n".

genetic_empty_text str

Text to use when no genetic data is available for a patient. Default: "No genetic data available.".

first_day_text str

Text used to introduce the events that occurred on the patient's very first recorded visit day. Default: "\nOn the first visit, the patient experienced the following: \n".

event_day_preamble str

Text inserted before the description of events for visits subsequent to the first one. Default: "\n".

event_day_text str

Template text used to introduce events on subsequent visit days, indicating the time elapsed since the previous visit. Default: " self.delta_time_unit : later, the patient visited and experienced the following: \n".

post_event_text str

Text appended after listing all events for a specific visit day. Default: ".\n".

forecasting_fval_prompt_start str

Initial text for prompts instructing a language model to predict future numerical values of specified variables over time. Default: Instructs prediction per cumulative week.

forecasting_prompt_var_time str

Text segment used within forecasting prompts to specify the time frame (e.g., future weeks) for prediction. Default: " the future weeks ".

forecasting_prompt_summarized_start str

Initial text for prompts that include a summary of the last known values of variables being forecasted. Default: "\nThe last values of the variables in the input data are:\n".

forecasting_firstday_override str

Alternative introductory text for forecasting prompts, possibly used when only a subset of initial data is presented, hinting at omissions. Default: Mentions included events, potential omissions.

forecasting_prompt_summarized_genetic str

Text used to introduce a summary section listing the last observed genetic event statuses within a forecasting prompt. Default: "\n\n\n\nHere we repeat the last observed values of each genetic event in the input data:\n".

forecasting_prompt_summarized_lot str

Text used to introduce a summary section describing the most recent line of therapy within a forecasting prompt. Default: "\nThe most recent line of therapy:\n".

forecasting_tte_prompt_start str

Initial text for prompts instructing a language model to predict time-to-event (TTE) outcomes, specifically focusing on whether an event is censored. Default: Asks for censoring prediction.

forecasting_tte_prompt_mid str

Middle text segment for TTE prompts, specifying the prediction horizon (in weeks) and asking about event occurrence status. Default: Specifies weeks and asks about occurrence.

forecasting_tte_prompt_end str

Concluding text for TTE prompts, detailing the required output format for the prediction (censoring and occurrence). Default: Specifies format like "'Here is the prediction: the event () was [not] censored and [did not occur]/[occurred].'".

target_prompt_start str

Template string used to begin constructing the target (ground truth) output string for TTE tasks, includes placeholder for event name. Default: "\nHere is the prediction: the event ({event_name}) was ".

target_prompt_censor_true str

Text segment used in the TTE target output to indicate that the event was censored within the observation period. Default: "censored.".

target_prompt_censor_false str

Text segment used in the TTE target output to indicate that the event was not censored. Default: "not censored ".

target_prompt_before_occur str

Conjunction used in the TTE target output between the censoring status and the occurrence status. Default: "and ".

target_prompt_occur str

Text segment used in the TTE target output to indicate that the event did occur. Default: "occurred.".

target_prompt_not_occur str

Text segment used in the TTE target output to indicate that the event did not occur. Default: "did not occur.".

qa_prompt_start str

Initial text for prompts instructing a model to perform a Quality Assurance (QA) task, specifically predicting value bins for future variable values. Default: Asks for bin prediction per week.

qa_bins_start str

Text used within QA prompts to introduce the list of possible bins the model should choose from. Default: "\tThe possible bins are: ".

task_prompt_start str

Introductory text for multi-task prompts, explaining that multiple tasks follow and instructing the model on the required response format (e.g., prefixing each answer with 'Task X:'). Default: Explains multi-task format.

task_prompt_each_task str

Template string used to introduce each individual task within a multi-task prompt, includes placeholder for task number. Default: "Task {task_nr} is ".

task_prompt_end str

Concluding text for the overall multi-task prompt setup. Default: "" (empty string).

task_prompt_forecasting str

Identifier text appended to task_prompt_each_task to specify a forecasting sub-task. Default: "forecasting:".

task_prompt_forecasting_qa str

Identifier text appended to task_prompt_each_task to specify a forecasting QA (binning) sub-task. Default: "forecasting QA:".

task_prompt_events str

Identifier text appended to task_prompt_each_task to specify a time-to-event prediction sub-task. Default: "time to event prediction:".

task_prompt_custom str

Identifier text appended to task_prompt_each_task to specify a custom-defined sub-task. Default: " a custom task:".

task_target_start str

Template string used to begin the target (ground truth) output corresponding to a specific task number in a multi-task setting. Default: "Task {task_nr} is ".

task_target_end str

Concluding text for the target output of a specific task within a multi-task response. Default: "" (empty string).

decimal_precision int

Number of decimal places to use when rounding numerical values (e.g., lab results) during text conversion. Default: 2.

event_category_preamble_mapping_override dict | None

Optional dictionary to override the introductory text used before listing events of a specific category on a given day. Structure: {<event_category>: <preamble_string>}. Default: None.

event_category_and_name_replace_override dict | None

Optional nested dictionary to define specific replacements for event descriptions based on category and name. Allows replacing the entire event string and defining a value for reverse mapping. Structure: {<event_category>: {<event_name>: {"full_replacement_string": <str>, "reverse_string_value": <str>}}}. Default: None.

always_keep_first_visit bool

Flag indicating whether the events from the very first visit should always be included in the patient history, regardless of token budget constraints. Default: True.

seed int

Seed value for random number generators to ensure reproducibility in processes like data splitting or sampling. Default: 768921.

nr_tokens_budget_padding int

Number of tokens reserved as a buffer when calculating token budgets, ensuring outputs don't exceed limits. May need adjustment based on model/task. Default: 200.

tokenizer_to_use str

Identifier string for the tokenizer model to be used for counting tokens (e.g., for budget calculations). Should correspond to a model available in the environment (e.g., from Hugging Face). Default: 'microsoft/Phi-4-mini-instruct'.

constant_columns_to_use list[str]

List of column names from the constant (demographic) data source to be included in the processing and text conversion. Note: Age might be handled separately. Default: ["race", "gender", "ethnicity", "indication"].

constant_birthdate_column str | None

Column name in the constant table representing the patient's birth date or birth year. If provided, age calculation is performed relative to the first event date. Default: None.

constant_birthdate_column_format str

Format of the birthdate column, either "date" or "age". Default: "date".

data_splitter_events_variables_category_mapping dict

Mapping defining which event categories correspond to specific prediction types in DataSplitterEvents. Keys are event categories (e.g., 'death', 'progression'), values are descriptive names for the target variable.

data_splitter_events_backup_category_mapping dict

Fallback mapping for event categories in DataSplitterEvents. Used if the primary category variables are not found. Keys are the missing categories, values are the backup categories to use.

Source code in twinweaver/common/config.py
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
class Config:
    """
    Centralized configuration repository for data processing, prompt generation, and constants.

    This class consolidates various configuration settings essential for the data processing
    pipeline. It defines standardized column names, specific values for event categories (like
    'line of therapy', 'death'), data source identifiers, file paths, table names, and text
    templates (prompts) used for different language model tasks such as text conversion,
    forecasting (value prediction, time-to-event), quality assurance (QA) via binning, and
    setting up multi-task prompts. Default values are provided but can be overridden to
    adapt to specific datasets, model requirements, or experimental setups.

    Attributes
    ----------
    date_cutoff : str | None
        If set, only use data before this date (format: "YYYY-MM-DD"), censored after. Default: None.
    delta_time_unit : str
        Unit of time used to express intervals between patient visits in the generated text. Options are "days" or
        "weeks". Default: "weeks".
    numeric_detect_min_fraction: float
        Fraction of values that must be numeric to classify a variable as numeric. Defaults to 0.99.
    date_col : str
        Standardized column name for date information across datasets. Default: "date".
    patient_id_col : str
        Standardized column name for unique patient identifiers. Default: "patientid".
    event_category_col : str
        Standardized column name for the category of a recorded event (e.g., 'lab', 'diagnosis'). Default:
        "event_category".
    event_name_col : str
        Standardized column name for the specific name of an event within its category (e.g., 'Glucose level',
        'Type 2 Diabetes'). Default: "event_name".
    event_descriptive_name_col : str
        Standardized column name for a more human-readable or descriptive name of the event. Default:
        "event_descriptive_name".
    event_value_col : str
        Standardized column name for the value associated with an event (e.g., a lab result, 'present').
        Default: "event_value".
    source_col : str
        Standardized column name indicating the origin or type of the data record. Default: "source".
    meta_data_col : str
        Standardized column name for storing additional metadata related to an event. Default: "meta_data".
    constant_split_col : str
        Standardized column name for data split information (train/test/val) in the constant dataframe.
        Default: "data_split".
    event_category_default_value : str
        Default value to assign to `event_category_col` if it is missing in the data. Default: "general".
    event_meta_default_value : Any
        Default value to assign to `meta_data_col` if it is missing. Default: pd.NA.
    source_col_default_value : str
        Default value to assign to `source_col` if it is missing. Default: "events".
    lot_date_col : str
        Column name specifically used for dates related to line of therapy (LoT) events. Default: "lot_date".
    lot_name_col : str
        Column name for the name or identifier of the line of therapy (e.g., "First Line"). Default: "lot".
    event_value_lot_start : str
        Specific string value used in `event_value_col` to denote the start of a line of therapy. Default: "LoT Start".
    skip_future_lot_filtering : bool
        Flag indicating whether to skip filtering out future line of therapy events. Default: False.
        Useful in case you accidentially overlap LoTs which are actually the same, use with caution.
    lot_concatenate_descriptive_and_value : bool
        Flag indicating whether to concatenate the descriptive name and value for line of therapy events.
        Default: False.
    lot_concatenate_string : str
        String used to concatenate the descriptive name and value for line of therapy events when
        `lot_concatenate_descriptive_and_value` is True. Default: " - ".
    warning_for_splitters_patient_without_lots : bool
        Whether to warn if a patient has no LoT events in DataSplitterEvents. Default: True.
    event_category_lot : str
        Specific string value used in `event_category_col` to identify 'line of therapy' events. Default: "lot".
    event_category_death : str
        Specific string value used in `event_category_col` to identify 'death' events. Default: "death".
    event_category_labs : str
        Specific string value used in `event_category_col` to identify 'lab result' events. Default: "lab".
    event_category_forecast : list[str]
        List of event categories to be considered for forecasting tasks. Default: ["lab"].
    source_genetic : str
        Specific string value used in `source_col` to identify data originating from genetic testing.
        Default: "genetic".
    source_standard_events : str
        Source identifier for standard clinical events. Default: "events".
    genetic_skip_text_value : str
        A specific event value (often for genetic data) that might be skipped during text generation to avoid
        redundancy if its presence is implied elsewhere. Default: "present".
    genetic_tag_opening : str
        Opening tag used to demarcate genetic information within generated text. Default: "<genetic>".
    genetic_tag_closing : str
        Closing tag used to demarcate genetic information within generated text. Default: "</genetic>".
    event_table_name : str
        The base name (without extension) for the primary file or table containing event data. Default: "events".
    train_split_name : str
        Identifier for the training dataset split (e.g., used in file naming or data loading). Default: "train".
    validation_split_name : str
        Identifier for the validation dataset split. Default: "validation".
    test_split_name : str
        Identifier for the test dataset split. Default: "test".
    bins_split_name : str
        Identifier for a data split used for binning tasks, often related to QA. Default: "5_equal_sized_bins".
    preamble_text : str
        Introductory text inserted at the beginning of the textual representation of a patient's record.
        Default: Explains structure and LOINC codes.
    constant_text : str
        Text used to introduce the section containing static demographic data in the textual patient record.
        Default: "\\n\\nStarting with demographic data:\\n".
    genetic_empty_text : str
        Text to use when no genetic data is available for a patient. Default: "No genetic data available.".
    first_day_text : str
        Text used to introduce the events that occurred on the patient's very first recorded visit day.
        Default: "\\nOn the first visit, the patient experienced the following: \\n".
    event_day_preamble : str
        Text inserted before the description of events for visits subsequent to the first one. Default: "\\n".
    event_day_text : str
        Template text used to introduce events on subsequent visit days, indicating the time elapsed since the previous
        visit. Default: " self.delta_time_unit : later, the patient visited and experienced the following: \\n".
    post_event_text : str
        Text appended after listing all events for a specific visit day. Default: ".\\n".
    forecasting_fval_prompt_start : str
        Initial text for prompts instructing a language model to predict future numerical values of specified
        variables over time. Default: Instructs prediction per cumulative week.
    forecasting_prompt_var_time : str
        Text segment used within forecasting prompts to specify the time frame (e.g., future weeks) for prediction.
        Default: " the future weeks ".
    forecasting_prompt_summarized_start : str
        Initial text for prompts that include a summary of the last known values of variables being forecasted.
        Default: "\\nThe last values of the variables in the input data are:\\n".
    forecasting_firstday_override : str
        Alternative introductory text for forecasting prompts, possibly used when only a subset of initial data is
        presented, hinting at omissions. Default: Mentions included events, potential omissions.
    forecasting_prompt_summarized_genetic : str
        Text used to introduce a summary section listing the last observed genetic event statuses within a forecasting
        prompt.
        Default: "\\n\\n\\n\\nHere we repeat the last observed values of each genetic event in the input data:\\n".
    forecasting_prompt_summarized_lot : str
        Text used to introduce a summary section describing the most recent line of therapy within a forecasting
        prompt. Default: "\\nThe most recent line of therapy:\\n".
    forecasting_tte_prompt_start : str
        Initial text for prompts instructing a language model to predict time-to-event (TTE) outcomes, specifically
        focusing on whether an event is censored. Default: Asks for censoring prediction.
    forecasting_tte_prompt_mid : str
        Middle text segment for TTE prompts, specifying the prediction horizon (in weeks) and asking about event
        occurrence status. Default: Specifies weeks and asks about occurrence.
    forecasting_tte_prompt_end : str
        Concluding text for TTE prompts, detailing the required output format for the prediction (censoring and
        occurrence). Default: Specifies format like "'Here is the prediction: the event (<name>) was [not] censored
        and [did not occur]/[occurred].'".
    target_prompt_start : str
        Template string used to begin constructing the target (ground truth) output string for TTE tasks, includes
        placeholder for event name. Default: "\\nHere is the prediction: the event ({event_name}) was ".
    target_prompt_censor_true : str
        Text segment used in the TTE target output to indicate that the event *was* censored within the observation
        period. Default: "censored.".
    target_prompt_censor_false : str
        Text segment used in the TTE target output to indicate that the event *was not* censored. Default:
        "not censored ".
    target_prompt_before_occur : str
        Conjunction used in the TTE target output between the censoring status and the occurrence status.
        Default: "and ".
    target_prompt_occur : str
        Text segment used in the TTE target output to indicate that the event *did* occur. Default: "occurred.".
    target_prompt_not_occur : str
        Text segment used in the TTE target output to indicate that the event *did not* occur.
        Default: "did not occur.".
    qa_prompt_start : str
        Initial text for prompts instructing a model to perform a Quality Assurance (QA) task, specifically predicting
        value bins for future variable values. Default: Asks for bin prediction per week.
    qa_bins_start : str
        Text used within QA prompts to introduce the list of possible bins the model should choose from.
        Default: "\\tThe possible bins are: ".
    task_prompt_start : str
        Introductory text for multi-task prompts, explaining that multiple tasks follow and instructing the model on the
        required response format (e.g., prefixing each answer with 'Task X:'). Default: Explains multi-task format.
    task_prompt_each_task : str
        Template string used to introduce each individual task within a multi-task prompt, includes placeholder for
        task number. Default: "Task {task_nr} is ".
    task_prompt_end : str
        Concluding text for the overall multi-task prompt setup. Default: "" (empty string).
    task_prompt_forecasting : str
        Identifier text appended to `task_prompt_each_task` to specify a forecasting sub-task. Default: "forecasting:".
    task_prompt_forecasting_qa : str
        Identifier text appended to `task_prompt_each_task` to specify a forecasting QA (binning) sub-task.
        Default: "forecasting QA:".
    task_prompt_events : str
        Identifier text appended to `task_prompt_each_task` to specify a time-to-event prediction sub-task.
        Default: "time to event prediction:".
    task_prompt_custom : str
        Identifier text appended to `task_prompt_each_task` to specify a custom-defined sub-task.
        Default: " a custom task:".
    task_target_start : str
        Template string used to begin the target (ground truth) output corresponding to a specific task number in a
        multi-task setting. Default: "Task {task_nr} is ".
    task_target_end : str
        Concluding text for the target output of a specific task within a multi-task response.
        Default: "" (empty string).
    decimal_precision : int
        Number of decimal places to use when rounding numerical values (e.g., lab results) during text conversion.
        Default: 2.
    event_category_preamble_mapping_override : dict | None
        Optional dictionary to override the introductory text used before listing events of a specific category on a
        given day. Structure: `{<event_category>: <preamble_string>}`. Default: None.
    event_category_and_name_replace_override : dict | None
        Optional nested dictionary to define specific replacements for event descriptions based on category and name.
        Allows replacing the entire event string and defining a value for reverse mapping.
        Structure:
        `{<event_category>: {<event_name>: {"full_replacement_string": <str>, "reverse_string_value": <str>}}}`.
        Default: None.
    always_keep_first_visit : bool
        Flag indicating whether the events from the very first visit should always be included in the patient history,
        regardless of token budget constraints. Default: True.
    seed : int
        Seed value for random number generators to ensure reproducibility in processes like data splitting or sampling.
        Default: 768921.
    nr_tokens_budget_padding : int
        Number of tokens reserved as a buffer when calculating token budgets, ensuring outputs don't exceed limits.
        May need adjustment based on model/task. Default: 200.
    tokenizer_to_use : str
        Identifier string for the tokenizer model to be used for counting tokens (e.g., for budget calculations).
        Should correspond to a model available in the environment (e.g., from Hugging Face).
        Default: 'microsoft/Phi-4-mini-instruct'.
    constant_columns_to_use : list[str]
        List of column names from the constant (demographic) data source to be included in the processing and text
        conversion. *Note: Age might be handled separately.* Default: ["race", "gender", "ethnicity", "indication"].
    constant_birthdate_column : str | None
        Column name in the constant table representing the patient's birth date or birth year.
        If provided, age calculation is performed relative to the first event date. Default: None.
    constant_birthdate_column_format : str
        Format of the birthdate column, either "date" or "age". Default: "date".
    data_splitter_events_variables_category_mapping : dict
        Mapping defining which event categories correspond to specific prediction types in DataSplitterEvents.
        Keys are event categories (e.g., 'death', 'progression'), values are descriptive names for the target variable.
    data_splitter_events_backup_category_mapping : dict
         Fallback mapping for event categories in DataSplitterEvents. Used if the primary category variables are not
         found. Keys are the missing categories, values are the backup categories to use.
    """

    def __init__(self):
        # --- Import data parameters ---
        self.date_cutoff = None  # If set, only use data before this date (format: "YYYY-MM-DD"), censored after
        self.delta_time_unit: str = (
            "weeks"  # Either "days" or "weeks" - if you change this, you need to call set_delta_time_unit
        )
        self.numeric_detect_min_fraction: float = (
            0.99  # Fraction of numeric values required to consider an event as numeric
        )

        # --- Core Column Names ---
        self.date_col: str = "date"
        self.patient_id_col: str = "patientid"
        self.event_category_col: str = "event_category"
        self.event_name_col: str = "event_name"
        self.event_descriptive_name_col: str = "event_descriptive_name"
        self.event_value_col: str = "event_value"
        self.source_col: str = "source"
        self.meta_data_col: str = "meta_data"
        self.constant_split_col: str = "data_split"

        # --- Specific Category/Type Column Names ---
        self.event_category_default_value = "general"  # Default value for event category if not present
        self.event_meta_default_value = pd.NA  # Default value for event meta data if not present
        self.source_col_default_value: str = "events"  # Default value for source column if not present
        self.lot_date_col: str = "lot_date"
        self.lot_name_col: str = "lot"
        self.event_value_lot_start: str = "LoT Start"
        self.skip_future_lot_filtering: bool = False  # Whether to skip filtering future LoT events, by default False.
        self.lot_concatenate_descriptive_and_value: bool = (
            False  # If true, concatenate descriptive name and value for LoT events, by default False (only event_vale.)
        )
        self.lot_concatenate_string: str = (
            " - "  # String used to concatenate descriptive name and value for LoT events, by default " - ".
        )

        # Warnings and logs
        self.warning_for_splitters_patient_without_lots: bool = (
            True  # Whether to warn if a patient has no LoT events in DataSplitterEvents
        )

        # --- Specific Event Categories / Values / Sources ---
        self.event_category_lot: str = "lot"
        self.event_category_death: str = "death"
        self.event_category_labs: str = "lab"
        self.event_category_forecast: list = ["lab"]  # List of event categories to be used for forecasting

        self.source_genetic: str = "genetic"
        self.source_standard_events: str = "events"
        self.genetic_skip_text_value: str = "present"
        self.genetic_tag_opening: str = "<genetic>"
        self.genetic_tag_closing: str = "</genetic>"

        # --- Data Paths, Tables, and Splits ---
        self.event_table_name: str = "events"
        self.train_split_name: str = "train"
        self.validation_split_name: str = "validation"
        self.test_split_name: str = "test"
        self.bins_split_name: str = "5_equal_sized_bins"

        # --- Text Conversion Prompts ---
        self.preamble_text: str = (
            "The following is a patient, starting with the demographic data, "
            "following visit by visit everything that the patient experienced. "
            "All lab codes refer to LOINC codes."
        )
        self.constant_text: str = "\n\nStarting with demographic data:\n"
        self.first_day_text: str = "\nOn the first visit, the patient experienced the following: \n"
        self.event_day_preamble: str = "\n"
        self._event_day_text_template: str = " {unit} later, the patient visited and experienced the following: \n"
        self.event_day_text: str = self._event_day_text_template.format(unit=self.delta_time_unit)
        self.post_event_text: str = ".\n"
        self.genetic_empty_text: str = "No genetic data available."

        # --- Forecasting Prompts (General & Summarization) ---
        self._forecasting_fval_prompt_start_template: str = (
            "\nYour task is to predict the future values of the following variables "
            "for each cumulative {unit} starting from the last visit:\n"
        )
        self.forecasting_fval_prompt_start: str = self._forecasting_fval_prompt_start_template.format(
            unit=self.delta_time_unit
        )

        self._forecasting_prompt_var_time_template: str = " the future {unit} "
        self.forecasting_prompt_var_time: str = self._forecasting_prompt_var_time_template.format(
            unit=self.delta_time_unit
        )
        self.forecasting_prompt_summarized_start: str = "\nThe last values of the variables in the input data are:\n"
        self.forecasting_firstday_override: str = (
            "\nThe following events are included in the input data, though "
            "potentially there are more which were omitted. Starting with: \n"
        )
        self.forecasting_prompt_summarized_genetic: str = (
            "\n\n\n\nHere we repeat the last observed values of each genetic event in the input data:\n"
        )
        self.forecasting_prompt_summarized_lot: str = "\nThe most recent line of therapy:\n"

        # --- Forecasting Prompts (Time-to-Event Specific) ---
        self.forecasting_tte_prompt_start: str = "\nYour task is to predict whether the following event was censored "
        self._forecasting_tte_prompt_mid_template: str = (
            " {unit} from the last clinical visit and whether the event occurred or not: "
        )
        self.forecasting_tte_prompt_mid: str = self._forecasting_tte_prompt_mid_template.format(
            unit=self.delta_time_unit
        )
        self.forecasting_tte_prompt_end: str = (
            ".\nPlease provide your prediction in the following format: "
            "'Here is the prediction: the event (<name of event>) was [not] censored "
            "and [did not occur]/[occurred].'"
        )

        # --- Target Output Prompts (Time-to-Event) ---
        self.target_prompt_start: str = "\nHere is the prediction: the event ({event_name}) was "
        self.target_prompt_censor_true: str = "censored."
        self.target_prompt_censor_false: str = "not censored "
        self.target_prompt_before_occur: str = "and "
        self.target_prompt_occur: str = "occurred."
        self.target_prompt_not_occur: str = "did not occur."

        # --- QA Prompts (Binning) ---
        self._qa_prompt_start_template: str = (
            "\nYour task is to predict the appropriate bins for the future values of "
            "the following variables for each cumulative {unit} starting from the date of the last visit:"
        )
        self.qa_prompt_start = self._qa_prompt_start_template.format(unit=self.delta_time_unit)
        self.qa_bins_start: str = "\tThe possible bins are: "

        # --- Multi-Task Prompts ---
        self.task_prompt_start: str = (
            "\nYou will now have multiple tasks to complete. Please answer for each "
            "task in the same order as they are presented. Before every response state the "
            "task nr, e.g. 'Task 2:'.\n\n"
        )
        self.task_prompt_each_task: str = "Task {task_nr} is "
        self.task_prompt_end: str = ""
        self.task_prompt_forecasting: str = "forecasting:"
        self.task_prompt_forecasting_qa: str = "forecasting QA:"
        self.task_prompt_events: str = "time to event prediction:"
        self.task_prompt_custom: str = " a custom task:"
        self.task_target_start: str = "Task {task_nr} is "
        self.task_target_end: str = ""

        # --- Overrides -----
        self.decimal_precision = 2  # Number of decimal places to round values to, by default 2.
        self.event_category_preamble_mapping_override = None
        # Override for the event category preamble mapping (default is None).
        # Structure is {<event_category>: <preamble_string>}

        self.event_category_and_name_replace_override = None
        # dict, optional
        #    Override for the event category and name replace mapping (default is None).
        #    Structure is {<event_category>: {
        #        <event_name>: {
        #            "full_replacement_string": <full_replacement_string>,
        #            "reverse_string_value": <reverse_string_value>
        #            }
        #        }
        #    }

        self.always_keep_first_visit: bool = (
            True  # Whether to always keep the first visit in the patient history, by default True.
        )

        # Seeds
        self.seed = 768921  # I like both of these numbers

        # Token budgets
        self.nr_tokens_budget_padding: int = 200  # Might need to be set to 500 for pretrain

        # Tokenizers for counting
        self.tokenizer_to_use: str = "microsoft/Phi-4-mini-instruct"

        # --- Processing of constant ---
        self.constant_columns_to_use: list = [
            "race",
            "gender",
            "ethnicity",
            "indication",
        ]  # Which columns to use from the constant data
        self.constant_birthdate_column: str = None  # If set, use this column for age calculation
        self.constant_birthdate_column_format: str = "date"  # Either "date" or "age"

        # --- Data splitter events setup ---
        # Used to identify which variables correspond to which event categories for
        # different event types, by default set for the 4 used in GDT
        self.data_splitter_events_variables_category_mapping = {
            "death": "death",
            "progression": "next progression",
            "lot": "next line of therapy",
            "metastasis": "next metastasis",
        }

        # Used to backup event categories for event types if no variables are found
        # e.g. progression -> death
        self.data_splitter_events_backup_category_mapping = {
            "progression": "death",
        }

    def set_delta_time_unit(self, unit: str, unit_sing=None):
        """
        Set the time unit for delta time representation in text conversion. Possible to set either
        "days" (and "day(s)") or "weeks" (and "week(s)"). Optionally, a singular form can be provided
        for use in specific prompts. If not provided, the plural form will be used.
        """
        assert unit in ("days", "weeks", "day(s)", "week(s)"), "unit must be either 'days' or 'weeks'"
        assert unit_sing in (None, "day", "week"), "unit_sing must be either None, 'day' or 'week'"
        self.delta_time_unit = unit
        if unit_sing is None:
            unit_sing = unit

        self.event_day_text = self._event_day_text_template.format(unit=unit)
        self.forecasting_fval_prompt_start = self._forecasting_fval_prompt_start_template.format(unit=unit_sing)
        self.forecasting_prompt_var_time = self._forecasting_prompt_var_time_template.format(unit=unit)
        self.forecasting_tte_prompt_mid = self._forecasting_tte_prompt_mid_template.format(unit=unit)
        self.qa_prompt_start = self._qa_prompt_start_template.format(unit=unit_sing)
Functions
set_delta_time_unit
set_delta_time_unit(unit, unit_sing=None)

Set the time unit for delta time representation in text conversion. Possible to set either "days" (and "day(s)") or "weeks" (and "week(s)"). Optionally, a singular form can be provided for use in specific prompts. If not provided, the plural form will be used.

Source code in twinweaver/common/config.py
def set_delta_time_unit(self, unit: str, unit_sing=None):
    """
    Set the time unit for delta time representation in text conversion. Possible to set either
    "days" (and "day(s)") or "weeks" (and "week(s)"). Optionally, a singular form can be provided
    for use in specific prompts. If not provided, the plural form will be used.
    """
    assert unit in ("days", "weeks", "day(s)", "week(s)"), "unit must be either 'days' or 'weeks'"
    assert unit_sing in (None, "day", "week"), "unit_sing must be either None, 'day' or 'week'"
    self.delta_time_unit = unit
    if unit_sing is None:
        unit_sing = unit

    self.event_day_text = self._event_day_text_template.format(unit=unit)
    self.forecasting_fval_prompt_start = self._forecasting_fval_prompt_start_template.format(unit=unit_sing)
    self.forecasting_prompt_var_time = self._forecasting_prompt_var_time_template.format(unit=unit)
    self.forecasting_tte_prompt_mid = self._forecasting_tte_prompt_mid_template.format(unit=unit)
    self.qa_prompt_start = self._qa_prompt_start_template.format(unit=unit_sing)