Skip to content

Converter Forecasting

twinweaver.instruction.converter_forecasting

Classes

ConverterForecasting

Bases: ConverterBase

Handles the conversion between structured patient data splits and text-based formats suitable for forecasting language models.

This class focuses specifically on generating prompts that ask a model to predict future values of specific variables (e.g., lab results) at given future time points (days/weeks relative to a split date). It also formats the actual future data (target) into a corresponding text string and handles the reverse conversion from model-generated text back to a structured DataFrame. Warning: variables including "weeks" can also mean days if configured so.

Attributes:

Name Type Description
constant_description DataFrame

DataFrame holding descriptions for constant variables.

nr_tokens_budget_total int

The target token budget for the combined input and output strings.

forecasting_prompt_start str

The initial text segment of the forecasting prompt.

forecasting_prompt_var_time str

The text segment used in the prompt to link a variable to its prediction times.

forecasting_prompt_summarized_start str

Starting text for summarized prompts (not used in core methods here).

forecasting_prompt_summarized_genetic str

Text for genetic info in summarized prompts (not used here).

forecasting_prompt_summarized_lot str

Text for LoT info in summarized prompts (not used here).

tokenizer AutoTokenizer

Tokenizer instance for calculating token counts.

nr_tokens_budget_padding int

Padding added to token budget calculations.

always_keep_first_visit bool

Flag indicating if the first visit's data should always be kept during token budget trimming.

Source code in twinweaver/instruction/converter_forecasting.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
class ConverterForecasting(ConverterBase):
    """
    Handles the conversion between structured patient data splits and
    text-based formats suitable for forecasting language models.

    This class focuses specifically on generating prompts that ask a model
    to predict future values of specific variables (e.g., lab results) at
    given future time points (days/weeks relative to a split date). It also
    formats the actual future data (target) into a corresponding text string
    and handles the reverse conversion from model-generated text back to a
    structured DataFrame.
    Warning: variables including "weeks" can also mean days if configured so.

    Attributes
    ----------
    constant_description : pd.DataFrame
        DataFrame holding descriptions for constant variables.
    nr_tokens_budget_total : int
        The target token budget for the combined input and output strings.
    forecasting_prompt_start : str
        The initial text segment of the forecasting prompt.
    forecasting_prompt_var_time : str
        The text segment used in the prompt to link a variable to its prediction
        times.
    forecasting_prompt_summarized_start : str
        Starting text for summarized prompts (not used in core methods here).
    forecasting_prompt_summarized_genetic : str
        Text for genetic info in summarized prompts (not used here).
    forecasting_prompt_summarized_lot : str
        Text for LoT info in summarized prompts (not used here).
    tokenizer : AutoTokenizer
        Tokenizer instance for calculating token counts.
    nr_tokens_budget_padding : int
        Padding added to token budget calculations.
    always_keep_first_visit : bool
        Flag indicating if the first visit's data should always be kept during token
        budget trimming.
    """

    def __init__(
        self,
        constant_description: pd.DataFrame,
        nr_tokens_budget_total: int,
        config: Config,
        dm: DataManager,
    ) -> None:
        """
        Initializes the ConverterForecasting instance.

        Sets up the converter with necessary configurations, including descriptions
        for constant patient features, the total token budget for generated text,
        and various prompt templates defined in the Config object.

        Parameters
        ----------
        constant_description : pd.DataFrame
            DataFrame containing descriptions for constant
            patient attributes (e.g., 'Sex: Male'). Used potentially by base
            class methods for input string generation.
        nr_tokens_budget_total : int
            The target maximum number of tokens for the
            combined input (history) and output (forecast) text.
        config : Config
            A Config object containing shared configuration settings like
            prompt templates, column names, tokenizer details, etc.
        dm : DataManager
            DataManager object containing the variable types and data frames.
        """
        # Initialize base class with config and potential overrides
        super().__init__(config)

        # Store specific attributes for this class
        self.constant_description = constant_description
        self.nr_tokens_budget_total = nr_tokens_budget_total
        self.dm = dm

        # Set forecasting prompts, using overrides or config defaults
        self.forecasting_prompt_start = self.config.forecasting_fval_prompt_start
        self.forecasting_prompt_var_time = self.config.forecasting_prompt_var_time
        self.forecasting_prompt_summarized_start = self.config.forecasting_prompt_summarized_start
        self.forecasting_prompt_summarized_genetic = self.config.forecasting_prompt_summarized_genetic
        self.forecasting_prompt_summarized_lot = self.config.forecasting_prompt_summarized_lot

        # Initialize tokenizer and budget padding
        self.nr_tokens_budget_padding = self.config.nr_tokens_budget_padding
        self.always_keep_first_visit = self.config.always_keep_first_visit

    def _generate_target_string(self, patient_split: DataSplitterForecastingOption) -> tuple[str, dict]:
        """
        Generates the target forecast string and associated metadata.

        Takes a patient data split containing future events, processes them,
        and formats them into a structured text string representing the forecast
        target. It also computes metadata detailing which variables are forecasted
        at which future days/weeks relative to the split date, and includes the raw
        and processed target data.

        Parameters
        ----------
        patient_split : DataSplitterForecastingOption
            A DataSplitterForecastingOption containing the data for a single split.

        Returns
        -------
        target_str : str
            The formatted string representing the target forecast
            (e.g., "[2 weeks later]\\n\\tLab X: 10.5\\n[4 weeks later]\\n\\tLab Y: 2.1").
        target_meta : dict
            A dictionary containing metadata about the target,
            including: 'dates_to_forecast', 'future_weeks_to_forecast',
            'dates_per_variable', 'future_prediction_time_per_variable',
            'variable_name_mapping', 'target_data_raw', 'target_data_processed',
            'lot_date', 'last_observed_values' (DataFrame of last known values
            for forecasted variables from the input history).
        """

        #: preprocess:
        target_data = patient_split.target_events_after_split.copy()
        target_cleaned = self._preprocess_events(target_data.copy())

        #: get delta between split and first target
        target_first_day = target_cleaned[self.config.date_col].min()
        split_date = patient_split.split_date_included_in_input
        delta_days = (target_first_day - split_date).days / self._time_divisor

        #: convert to string using default approach
        target_str = self._get_event_string(
            target_cleaned,
            events_delta_0=delta_days,
            use_accumulative_dates=False,
            add_first_day_preamble=False,
        )

        #: get meta, i.e. which dates to forecast and the relative future days/weeks grouped by variable
        dates_per_variable = target_cleaned.groupby(self.config.event_name_col)[self.config.date_col].unique()
        future_prediction_time_per_variable = {}
        variable_name_mapping = {}
        for variable, dates in dates_per_variable.items():
            # Get days/weeks to predict (relative to split date)
            # Ensure dates is a pd.Series or pd.DatetimeIndex for subtraction
            if not isinstance(dates, (pd.Series, pd.DatetimeIndex)):
                dates = pd.to_datetime(dates)
            future_prediction_time_per_variable[variable] = (dates - split_date).days / self._time_divisor

            # Get descriptive name
            curr_var = target_cleaned[target_cleaned[self.config.event_name_col] == variable]
            # Handle case where variable might not be found or has no descriptive name
            if not curr_var.empty and self.config.event_descriptive_name_col in curr_var.columns:
                descriptive_name = curr_var[self.config.event_descriptive_name_col].iloc[0]
            else:
                descriptive_name = variable  # Fallback to variable name itself
            variable_name_mapping[variable] = descriptive_name

        dates_to_forecast = target_cleaned[self.config.date_col].unique()
        # Ensure dates_to_forecast is pd.DatetimeIndex for subtraction
        if not isinstance(dates_to_forecast, pd.DatetimeIndex):
            dates_to_forecast = pd.to_datetime(dates_to_forecast)
        future_weeks_to_forecast = (dates_to_forecast - split_date).days / self._time_divisor

        #: add last observed values of each variable from input history
        input_history = patient_split.events_until_split
        last_observed_values_dict = {}
        for variable in future_prediction_time_per_variable.keys():
            # Check if variable is in input history by event_name
            matches = input_history[input_history[self.config.event_name_col] == variable]
            if not matches.empty:
                last_observed = matches.sort_values(by=self.config.date_col).iloc[-1]
            else:
                # Fallback to looking at descriptive name if event_name not found
                descriptive_name_to_match = variable_name_mapping.get(variable, variable)
                matches = input_history[
                    input_history[self.config.event_descriptive_name_col] == descriptive_name_to_match
                ]
                if not matches.empty:
                    last_observed = matches.sort_values(by=self.config.date_col).iloc[-1]
                else:
                    last_observed = None  # Variable not found in history

            #: add to dict if found
            if last_observed is not None:
                last_observed_values_dict[variable] = last_observed

        # Convert dict of Series to DataFrame
        if last_observed_values_dict:
            last_observed_values = pd.DataFrame(last_observed_values_dict).T
        else:
            # Create an empty DataFrame with expected columns if nothing was observed
            expected_cols = [
                self.config.date_col,
                self.config.event_category_col,
                self.config.event_name_col,
                self.config.event_descriptive_name_col,
                self.config.event_value_col,
                self.config.source_col,
            ]
            last_observed_values = pd.DataFrame(columns=expected_cols)

        #: setup return metadata dictionary
        target_meta = {
            "dates_to_forecast": dates_to_forecast,
            "future_weeks_to_forecast": future_weeks_to_forecast,
            "dates_per_variable": dates_per_variable,
            "future_prediction_time_per_variable": future_prediction_time_per_variable,
            "variable_name_mapping": variable_name_mapping,
            "target_data_raw": target_data,
            "target_data_processed": target_cleaned,
            "lot_date": patient_split.lot_date,
            "last_observed_values": last_observed_values,
        }

        return target_str, target_meta

    def _generate_prompt(self, target_meta: dict) -> str:
        """
        Generates the forecasting prompt based on target metadata.

        Constructs the text prompt that instructs the language model what to
        forecast. It uses the metadata about which variables need prediction
        at which future days/weeks.

        Parameters
        ----------
        target_meta : dict
            A dictionary containing metadata about the target forecast,
            as generated by `_generate_target_string`. Key required elements are:
            - "future_prediction_time_per_variable": Dict mapping variable names to lists
              of future days/weeks (relative to split date) they should be predicted at.
            - "variable_name_mapping": Dict mapping internal variable names to
              their descriptive names for use in the prompt.

        Returns
        -------
        str
            The fully constructed prompt string asking the model to perform the
            specified forecast.
        """

        # start ret
        ret_prompt = ""

        #: make sure to round days/weeks from future_prediction_time_per_variable to predict to specified precision
        future_prediction_time_per_variable = target_meta["future_prediction_time_per_variable"]
        future_prediction_time_per_variable_rounded = {}
        for k, v in future_prediction_time_per_variable.items():
            # Check if v is iterable, handle potential scalar values if necessary
            if hasattr(v, "__iter__"):
                future_prediction_time_per_variable_rounded[k] = [
                    round_and_strip(v2, self.decimal_precision) for v2 in v
                ]
            else:
                # Handle non-iterable case, maybe log a warning or convert to list
                future_prediction_time_per_variable_rounded[k] = [round_and_strip(v, self.decimal_precision)]

        #: add base prompt start text
        ret_prompt += self.forecasting_prompt_start

        #: sort alphabetically by variable name for consistent prompt order
        future_prediction_time_per_variable_sorted = dict(
            sorted(future_prediction_time_per_variable_rounded.items(), key=lambda item: item[0])
        )

        #: create prompt for which variables to predict and when
        for variable, weeks in future_prediction_time_per_variable_sorted.items():
            #: need event_descriptive_name from mapping
            variable_desc_name = target_meta["variable_name_mapping"].get(
                variable, variable
            )  # Fallback to variable name

            # Create prompt line for this variable
            ret_prompt += "\n" + "\t" + variable_desc_name
            ret_prompt += self.forecasting_prompt_var_time
            # Format the list of days/weeks nicely
            ret_prompt += ", ".join(map(str, weeks))  # Use map for clean conversion to string

        #: return the fully constructed prompt
        return ret_prompt

    def forward_conversion(self, patient_split: DataSplitterForecastingOption) -> tuple[str, str, dict]:
        """
        Performs the primary conversion from a data split to prompt and target strings.

        This method orchestrates the generation of the target string (what the model
        should predict) and the corresponding prompt string (the instruction asking
        for the prediction) based on a patient data split.

        Note: This method generates the *prompt* and *target* strings. The generation
        of the *input* string (patient history) is typically handled separately, often
        by the base class's `_get_input_string` method, using the "events_until_split"
        and "constant_data" from the `patient_split`.

        Parameters
        ----------
        patient_split : DataSplitterForecastingOption
            DataSplitterForecastingOption containing the data for a single split.

        Returns
        -------
        prompt_str : str
            The generated prompt instructing the model on the
            forecasting task.
        target_str : str
            The formatted string representing the target forecast values.
        target_meta : dict
            Metadata associated with the generated target string
            (output from `_generate_target_string`).
        """

        #: generate target string and associated metadata
        target_str, target_meta = self._generate_target_string(patient_split)

        #: generate prompt string based on the target metadata
        prompt_str = self._generate_prompt(target_meta)

        # Note: The input string (patient history text) generation is handled by the base class's
        # `_get_input_string` method, which would typically be called elsewhere
        # in the pipeline that uses this forward_conversion result along with the input events.
        # This method focuses on generating the *prompt* and *target* based on the split.

        return prompt_str, target_str, target_meta

    def forward_conversion_inference(
        self, patient_split: DataSplitterForecastingOption, future_weeks_per_variable: dict
    ) -> tuple[str, dict]:
        """
        Generates only the prompt string for inference scenarios.

        Used when the goal is to get a prediction from the model, not for training.
        It takes the patient's historical data and a specific set of variables and
        future days/weeks to predict, then constructs the appropriate prompt string.
        It does not generate a target string, as the target is what the model
        is expected to produce.

        Parameters
        ----------
        patient_split : DataSplitterForecastingOption
            DataSplitterForecastingOption containing the data for a single split.
        future_weeks_per_variable : dict
            Dictionary explicitly defining the forecasting task.
            Format: {<variable_name>: [<future_week_1>, <future_week_2>, ...], ...}
            These weeks are relative to the `split_date_included_in_input`.

        Returns
        -------
        prompt_str : str
            The generated prompt for the inference task.
        target_pseudo_meta : dict
            A dictionary containing the metadata constructed
            specifically for generating this inference prompt (includes the provided
            `future_prediction_time_per_variable`, derived `dates_per_variable`, and looked-up
            `variable_name_mapping`).
        """

        #: generate target_pseudo meta data needed for prompt generation
        target_pseudo_meta = {}

        #: Use the provided future_weeks_per_variable directly
        target_pseudo_meta["future_prediction_time_per_variable"] = future_weeks_per_variable

        # Make dates per variable based on the provided days/weeks and split date
        dates_per_variable = {}
        split_date = patient_split.split_date_included_in_input
        for variable, weeks in future_weeks_per_variable.items():
            # Ensure weeks is iterable
            if not hasattr(weeks, "__iter__"):
                weeks = [weeks]
            dates_per_variable[variable] = [
                split_date + pd.Timedelta(days=float(w) * self._time_divisor) for w in weeks
            ]
        target_pseudo_meta["dates_per_variable"] = dates_per_variable

        #: make target_meta["variable_name_mapping"] by looking up in input events
        variable_name_mapping = {}
        input_events = patient_split.events_until_split
        for variable in future_weeks_per_variable.keys():
            # Get descriptive name from input history
            curr_var = input_events[input_events[self.config.event_name_col] == variable]
            if not curr_var.empty and self.config.event_descriptive_name_col in curr_var.columns:
                descriptive_name = curr_var[self.config.event_descriptive_name_col].iloc[0]
            else:
                # Fallback: If not found by event_name, try descriptive name directly,
                # or just use the variable name itself. This part might need refinement
                # depending on how variables are guaranteed to be present/identifiable.
                descriptive_name = variable  # Simple fallback
            variable_name_mapping[variable] = descriptive_name

        target_pseudo_meta["variable_name_mapping"] = variable_name_mapping

        #: generate prompt (including when to generate what) based on constructed metadata
        prompt_str = self._generate_prompt(target_pseudo_meta)

        return prompt_str, target_pseudo_meta

    def generate_target_manual(
        self,
        target_events_after_split: pd.DataFrame,
        split_date_included_in_input: datetime,
        events_until_split: pd.DataFrame,
        lot_date: datetime = None,
    ) -> tuple:  # Added optional lot_date param
        """
        Manually generates the target string and metadata from explicit components.

        Provides an alternative way to generate the target string and metadata
        by directly passing the necessary DataFrames and the split date, rather
        than relying on a pre-packaged `patient_split` dictionary. This might be
        useful in scenarios where the data components are sourced differently.

        Parameters
        ----------
        target_events_after_split : pd.DataFrame
            DataFrame containing the future events
            that constitute the target forecast.
        split_date_included_in_input : datetime
            The reference date marking the end of
            the input history and the start of the forecast period.
        events_until_split : pd.DataFrame
            DataFrame containing the historical events up to
            the `split_date_included_in_input`. Used to find the last observed
            values for context in the metadata.
        lot_date : datetime, optional
            The start date of the relevant Line of Therapy, if applicable.
            Defaults to None.

        Returns
        -------
        target_str : str
            The formatted string representing the target forecast.
        target_meta : dict
            Metadata associated with the generated target string
            (same structure as output from `_generate_target_string`).
        """
        # Construct the dictionary expected by _generate_target_string
        patient_dic = {
            "target_events_after_split": target_events_after_split,
            "split_date_included_in_input": split_date_included_in_input,
            "events_until_split": events_until_split,
            "lot_date": lot_date,  # Pass the provided lot_date
        }
        # Call the internal method to generate target string and metadata
        return self._generate_target_string(patient_dic)

    def aggregate_multiple_responses(self, responses_dfs: list[pd.DataFrame]) -> tuple[pd.DataFrame, dict]:
        """
        Aggregates structured data from multiple model responses (e.g., trajectories).

        Takes a list of DataFrames, where each DataFrame represents the structured
        output converted from a single model response (e.g., using `reverse_conversion`).
        It concatenates these DataFrames and calculates the mean of numeric event
        values, grouping by date and event identifiers. This is useful for combining
        results from multiple stochastic samples from a model.

        Parameters
        ----------
        responses_dfs : list[pd.DataFrame]
            A list of DataFrames, each converted from a separate
            model forecast output string. Expected to have consistent columns
            as defined in the config (date, event name, value, etc.).

        Returns
        -------
        resulting_df : pd.DataFrame
            A DataFrame containing the aggregated
            results, typically with the event values averaged per date/event.
        meta : dict
            A dictionary containing metadata, currently includes
            "all_trajectory_data" which holds the original list of input DataFrames.
        """

        #: concat all response DataFrames together
        if not responses_dfs:  # Handle empty list case
            return pd.DataFrame(), {"all_trajectory_data": []}

        concatenated = pd.concat(responses_dfs, axis=0, ignore_index=True)

        #: get average of values per date/event identifier, keeping relevant columns
        # Define columns to group by
        grouping_cols = [
            self.config.date_col,
            self.config.event_name_col,
            self.config.event_descriptive_name_col,
            self.config.event_category_col,
            self.config.patient_id_col,
            self.config.source_col,
        ]

        # Ensure all grouping columns exist, handle potential missing ones gracefully
        valid_grouping_cols = [col for col in grouping_cols if col in concatenated.columns]
        if not valid_grouping_cols:
            # Cannot group if no valid columns are present
            # Consider logging a warning or raising an error
            return pd.DataFrame(), {"all_trajectory_data": responses_dfs.copy()}

        # APPLY DIFFERENT LOGIC FOR NUMERIC VS CATEGORICAL
        var_types_lookup = self.dm.variable_types

        # Helper function to aggregate based on variable type
        def _agg_by_precomputed_type(group_df: pd.DataFrame):
            """
            Helper function for .apply()
            Aggregates the 'event_value_col' based on the type stored in
            self.dm.variable_types.
            """
            event_name = group_df[self.config.event_name_col].iloc[0]
            var_type = var_types_lookup.get(event_name)
            value_series = group_df[self.config.event_value_col]

            if var_type == "numeric":
                numeric_values = pd.to_numeric(value_series, errors="coerce")
                return numeric_values.mean()
            elif var_type == "categorical":  # get set overlap
                # Get counts of each unique value in the group (unique date, event)
                value_counts = value_series.value_counts()

                # Return as a dictionary
                return value_counts.to_dict()
            else:  # if var_type is None or unrecognized
                logging.warning(
                    f"Variable type for '{event_name}' not found in var_types_lookup "
                    f"(Type: '{var_type}'). Returning NA for this group."
                )
                return pd.NA

        # Perform the groupby and aggregation
        # Use observed=True for potentially better performance with categorical data
        resulting = (
            concatenated.groupby(valid_grouping_cols, observed=True)[
                [self.config.event_name_col, self.config.event_value_col]
            ]
            .apply(_agg_by_precomputed_type)
            .reset_index(name=self.config.event_value_col)
        )

        #: prepare metadata dictionary
        meta = {
            "all_trajectory_data": responses_dfs.copy(),  # Keep original list
        }
        return resulting, meta

    def reverse_conversion(
        self, text_to_convert: str, unique_events: pd.DataFrame, split_date: datetime
    ) -> pd.DataFrame:
        """
        Converts a formatted forecast text string back into a structured DataFrame.

        Parses a text string (assumed to be generated by a forecasting model in
        a specific format, e.g., using "[X] weeks later..." markers) and extracts
        the forecasted event data. It uses the `split_date` as the reference point
        for calculating absolute dates from relative week offsets in the text.
        Relies on the base class's `_extract_event_data` method for the core parsing logic.

        Parameters
        ----------
        text_to_convert : str
            The text string generated by the model containing the forecast.
        unique_events : pd.DataFrame
            A DataFrame defining known event types (names, categories)
            to help with parsing and validation within the base class method.
        split_date : datetime
            The date relative to which the week offsets (e.g., "[X] weeks later")
            in the `text_to_convert` should be interpreted.

        Returns
        -------
        converted_data : pd.DataFrame
            A DataFrame containing the structured event data parsed from the input text,
            sorted by date, category, and event name for consistency.
        """

        #: Prepend the event day preamble if it's missing at the start, as the base
        #  method might expect it for splitting days/visits.
        if self.event_day_preamble and not text_to_convert.startswith(self.event_day_preamble):
            text_to_convert = self.event_day_preamble + text_to_convert

        #: Call the base class's extraction method. Assuming the forecast output format
        #  is compatible with the general event extraction logic (e.g., uses the
        #  same "[X] weeks later..." structure). Set `only_contains_events=True`
        #  as we don't expect demographic data in the forecast output.
        converted_data = self._extract_event_data(
            text_to_convert,
            unique_events,
            init_date=split_date,
            only_contains_events=True,
        )

        #: sort for consistency based on config columns
        sort_columns = [
            col
            for col in [
                self.config.date_col,
                self.config.event_category_col,
                self.config.event_name_col,
            ]
            if col in converted_data.columns
        ]
        if sort_columns:
            converted_data = converted_data.sort_values(by=sort_columns).reset_index(drop=True)

        return converted_data
Functions
__init__
__init__(
    constant_description, nr_tokens_budget_total, config, dm
)

Initializes the ConverterForecasting instance.

Sets up the converter with necessary configurations, including descriptions for constant patient features, the total token budget for generated text, and various prompt templates defined in the Config object.

Parameters:

Name Type Description Default
constant_description DataFrame

DataFrame containing descriptions for constant patient attributes (e.g., 'Sex: Male'). Used potentially by base class methods for input string generation.

required
nr_tokens_budget_total int

The target maximum number of tokens for the combined input (history) and output (forecast) text.

required
config Config

A Config object containing shared configuration settings like prompt templates, column names, tokenizer details, etc.

required
dm DataManager

DataManager object containing the variable types and data frames.

required
Source code in twinweaver/instruction/converter_forecasting.py
def __init__(
    self,
    constant_description: pd.DataFrame,
    nr_tokens_budget_total: int,
    config: Config,
    dm: DataManager,
) -> None:
    """
    Initializes the ConverterForecasting instance.

    Sets up the converter with necessary configurations, including descriptions
    for constant patient features, the total token budget for generated text,
    and various prompt templates defined in the Config object.

    Parameters
    ----------
    constant_description : pd.DataFrame
        DataFrame containing descriptions for constant
        patient attributes (e.g., 'Sex: Male'). Used potentially by base
        class methods for input string generation.
    nr_tokens_budget_total : int
        The target maximum number of tokens for the
        combined input (history) and output (forecast) text.
    config : Config
        A Config object containing shared configuration settings like
        prompt templates, column names, tokenizer details, etc.
    dm : DataManager
        DataManager object containing the variable types and data frames.
    """
    # Initialize base class with config and potential overrides
    super().__init__(config)

    # Store specific attributes for this class
    self.constant_description = constant_description
    self.nr_tokens_budget_total = nr_tokens_budget_total
    self.dm = dm

    # Set forecasting prompts, using overrides or config defaults
    self.forecasting_prompt_start = self.config.forecasting_fval_prompt_start
    self.forecasting_prompt_var_time = self.config.forecasting_prompt_var_time
    self.forecasting_prompt_summarized_start = self.config.forecasting_prompt_summarized_start
    self.forecasting_prompt_summarized_genetic = self.config.forecasting_prompt_summarized_genetic
    self.forecasting_prompt_summarized_lot = self.config.forecasting_prompt_summarized_lot

    # Initialize tokenizer and budget padding
    self.nr_tokens_budget_padding = self.config.nr_tokens_budget_padding
    self.always_keep_first_visit = self.config.always_keep_first_visit
aggregate_multiple_responses
aggregate_multiple_responses(responses_dfs)

Aggregates structured data from multiple model responses (e.g., trajectories).

Takes a list of DataFrames, where each DataFrame represents the structured output converted from a single model response (e.g., using reverse_conversion). It concatenates these DataFrames and calculates the mean of numeric event values, grouping by date and event identifiers. This is useful for combining results from multiple stochastic samples from a model.

Parameters:

Name Type Description Default
responses_dfs list[DataFrame]

A list of DataFrames, each converted from a separate model forecast output string. Expected to have consistent columns as defined in the config (date, event name, value, etc.).

required

Returns:

Name Type Description
resulting_df DataFrame

A DataFrame containing the aggregated results, typically with the event values averaged per date/event.

meta dict

A dictionary containing metadata, currently includes "all_trajectory_data" which holds the original list of input DataFrames.

Source code in twinweaver/instruction/converter_forecasting.py
def aggregate_multiple_responses(self, responses_dfs: list[pd.DataFrame]) -> tuple[pd.DataFrame, dict]:
    """
    Aggregates structured data from multiple model responses (e.g., trajectories).

    Takes a list of DataFrames, where each DataFrame represents the structured
    output converted from a single model response (e.g., using `reverse_conversion`).
    It concatenates these DataFrames and calculates the mean of numeric event
    values, grouping by date and event identifiers. This is useful for combining
    results from multiple stochastic samples from a model.

    Parameters
    ----------
    responses_dfs : list[pd.DataFrame]
        A list of DataFrames, each converted from a separate
        model forecast output string. Expected to have consistent columns
        as defined in the config (date, event name, value, etc.).

    Returns
    -------
    resulting_df : pd.DataFrame
        A DataFrame containing the aggregated
        results, typically with the event values averaged per date/event.
    meta : dict
        A dictionary containing metadata, currently includes
        "all_trajectory_data" which holds the original list of input DataFrames.
    """

    #: concat all response DataFrames together
    if not responses_dfs:  # Handle empty list case
        return pd.DataFrame(), {"all_trajectory_data": []}

    concatenated = pd.concat(responses_dfs, axis=0, ignore_index=True)

    #: get average of values per date/event identifier, keeping relevant columns
    # Define columns to group by
    grouping_cols = [
        self.config.date_col,
        self.config.event_name_col,
        self.config.event_descriptive_name_col,
        self.config.event_category_col,
        self.config.patient_id_col,
        self.config.source_col,
    ]

    # Ensure all grouping columns exist, handle potential missing ones gracefully
    valid_grouping_cols = [col for col in grouping_cols if col in concatenated.columns]
    if not valid_grouping_cols:
        # Cannot group if no valid columns are present
        # Consider logging a warning or raising an error
        return pd.DataFrame(), {"all_trajectory_data": responses_dfs.copy()}

    # APPLY DIFFERENT LOGIC FOR NUMERIC VS CATEGORICAL
    var_types_lookup = self.dm.variable_types

    # Helper function to aggregate based on variable type
    def _agg_by_precomputed_type(group_df: pd.DataFrame):
        """
        Helper function for .apply()
        Aggregates the 'event_value_col' based on the type stored in
        self.dm.variable_types.
        """
        event_name = group_df[self.config.event_name_col].iloc[0]
        var_type = var_types_lookup.get(event_name)
        value_series = group_df[self.config.event_value_col]

        if var_type == "numeric":
            numeric_values = pd.to_numeric(value_series, errors="coerce")
            return numeric_values.mean()
        elif var_type == "categorical":  # get set overlap
            # Get counts of each unique value in the group (unique date, event)
            value_counts = value_series.value_counts()

            # Return as a dictionary
            return value_counts.to_dict()
        else:  # if var_type is None or unrecognized
            logging.warning(
                f"Variable type for '{event_name}' not found in var_types_lookup "
                f"(Type: '{var_type}'). Returning NA for this group."
            )
            return pd.NA

    # Perform the groupby and aggregation
    # Use observed=True for potentially better performance with categorical data
    resulting = (
        concatenated.groupby(valid_grouping_cols, observed=True)[
            [self.config.event_name_col, self.config.event_value_col]
        ]
        .apply(_agg_by_precomputed_type)
        .reset_index(name=self.config.event_value_col)
    )

    #: prepare metadata dictionary
    meta = {
        "all_trajectory_data": responses_dfs.copy(),  # Keep original list
    }
    return resulting, meta
forward_conversion
forward_conversion(patient_split)

Performs the primary conversion from a data split to prompt and target strings.

This method orchestrates the generation of the target string (what the model should predict) and the corresponding prompt string (the instruction asking for the prediction) based on a patient data split.

Note: This method generates the prompt and target strings. The generation of the input string (patient history) is typically handled separately, often by the base class's _get_input_string method, using the "events_until_split" and "constant_data" from the patient_split.

Parameters:

Name Type Description Default
patient_split DataSplitterForecastingOption

DataSplitterForecastingOption containing the data for a single split.

required

Returns:

Name Type Description
prompt_str str

The generated prompt instructing the model on the forecasting task.

target_str str

The formatted string representing the target forecast values.

target_meta dict

Metadata associated with the generated target string (output from _generate_target_string).

Source code in twinweaver/instruction/converter_forecasting.py
def forward_conversion(self, patient_split: DataSplitterForecastingOption) -> tuple[str, str, dict]:
    """
    Performs the primary conversion from a data split to prompt and target strings.

    This method orchestrates the generation of the target string (what the model
    should predict) and the corresponding prompt string (the instruction asking
    for the prediction) based on a patient data split.

    Note: This method generates the *prompt* and *target* strings. The generation
    of the *input* string (patient history) is typically handled separately, often
    by the base class's `_get_input_string` method, using the "events_until_split"
    and "constant_data" from the `patient_split`.

    Parameters
    ----------
    patient_split : DataSplitterForecastingOption
        DataSplitterForecastingOption containing the data for a single split.

    Returns
    -------
    prompt_str : str
        The generated prompt instructing the model on the
        forecasting task.
    target_str : str
        The formatted string representing the target forecast values.
    target_meta : dict
        Metadata associated with the generated target string
        (output from `_generate_target_string`).
    """

    #: generate target string and associated metadata
    target_str, target_meta = self._generate_target_string(patient_split)

    #: generate prompt string based on the target metadata
    prompt_str = self._generate_prompt(target_meta)

    # Note: The input string (patient history text) generation is handled by the base class's
    # `_get_input_string` method, which would typically be called elsewhere
    # in the pipeline that uses this forward_conversion result along with the input events.
    # This method focuses on generating the *prompt* and *target* based on the split.

    return prompt_str, target_str, target_meta
forward_conversion_inference
forward_conversion_inference(
    patient_split, future_weeks_per_variable
)

Generates only the prompt string for inference scenarios.

Used when the goal is to get a prediction from the model, not for training. It takes the patient's historical data and a specific set of variables and future days/weeks to predict, then constructs the appropriate prompt string. It does not generate a target string, as the target is what the model is expected to produce.

Parameters:

Name Type Description Default
patient_split DataSplitterForecastingOption

DataSplitterForecastingOption containing the data for a single split.

required
future_weeks_per_variable dict

Dictionary explicitly defining the forecasting task. Format: {: [, , ...], ...} These weeks are relative to the split_date_included_in_input.

required

Returns:

Name Type Description
prompt_str str

The generated prompt for the inference task.

target_pseudo_meta dict

A dictionary containing the metadata constructed specifically for generating this inference prompt (includes the provided future_prediction_time_per_variable, derived dates_per_variable, and looked-up variable_name_mapping).

Source code in twinweaver/instruction/converter_forecasting.py
def forward_conversion_inference(
    self, patient_split: DataSplitterForecastingOption, future_weeks_per_variable: dict
) -> tuple[str, dict]:
    """
    Generates only the prompt string for inference scenarios.

    Used when the goal is to get a prediction from the model, not for training.
    It takes the patient's historical data and a specific set of variables and
    future days/weeks to predict, then constructs the appropriate prompt string.
    It does not generate a target string, as the target is what the model
    is expected to produce.

    Parameters
    ----------
    patient_split : DataSplitterForecastingOption
        DataSplitterForecastingOption containing the data for a single split.
    future_weeks_per_variable : dict
        Dictionary explicitly defining the forecasting task.
        Format: {<variable_name>: [<future_week_1>, <future_week_2>, ...], ...}
        These weeks are relative to the `split_date_included_in_input`.

    Returns
    -------
    prompt_str : str
        The generated prompt for the inference task.
    target_pseudo_meta : dict
        A dictionary containing the metadata constructed
        specifically for generating this inference prompt (includes the provided
        `future_prediction_time_per_variable`, derived `dates_per_variable`, and looked-up
        `variable_name_mapping`).
    """

    #: generate target_pseudo meta data needed for prompt generation
    target_pseudo_meta = {}

    #: Use the provided future_weeks_per_variable directly
    target_pseudo_meta["future_prediction_time_per_variable"] = future_weeks_per_variable

    # Make dates per variable based on the provided days/weeks and split date
    dates_per_variable = {}
    split_date = patient_split.split_date_included_in_input
    for variable, weeks in future_weeks_per_variable.items():
        # Ensure weeks is iterable
        if not hasattr(weeks, "__iter__"):
            weeks = [weeks]
        dates_per_variable[variable] = [
            split_date + pd.Timedelta(days=float(w) * self._time_divisor) for w in weeks
        ]
    target_pseudo_meta["dates_per_variable"] = dates_per_variable

    #: make target_meta["variable_name_mapping"] by looking up in input events
    variable_name_mapping = {}
    input_events = patient_split.events_until_split
    for variable in future_weeks_per_variable.keys():
        # Get descriptive name from input history
        curr_var = input_events[input_events[self.config.event_name_col] == variable]
        if not curr_var.empty and self.config.event_descriptive_name_col in curr_var.columns:
            descriptive_name = curr_var[self.config.event_descriptive_name_col].iloc[0]
        else:
            # Fallback: If not found by event_name, try descriptive name directly,
            # or just use the variable name itself. This part might need refinement
            # depending on how variables are guaranteed to be present/identifiable.
            descriptive_name = variable  # Simple fallback
        variable_name_mapping[variable] = descriptive_name

    target_pseudo_meta["variable_name_mapping"] = variable_name_mapping

    #: generate prompt (including when to generate what) based on constructed metadata
    prompt_str = self._generate_prompt(target_pseudo_meta)

    return prompt_str, target_pseudo_meta
generate_target_manual
generate_target_manual(
    target_events_after_split,
    split_date_included_in_input,
    events_until_split,
    lot_date=None,
)

Manually generates the target string and metadata from explicit components.

Provides an alternative way to generate the target string and metadata by directly passing the necessary DataFrames and the split date, rather than relying on a pre-packaged patient_split dictionary. This might be useful in scenarios where the data components are sourced differently.

Parameters:

Name Type Description Default
target_events_after_split DataFrame

DataFrame containing the future events that constitute the target forecast.

required
split_date_included_in_input datetime

The reference date marking the end of the input history and the start of the forecast period.

required
events_until_split DataFrame

DataFrame containing the historical events up to the split_date_included_in_input. Used to find the last observed values for context in the metadata.

required
lot_date datetime

The start date of the relevant Line of Therapy, if applicable. Defaults to None.

None

Returns:

Name Type Description
target_str str

The formatted string representing the target forecast.

target_meta dict

Metadata associated with the generated target string (same structure as output from _generate_target_string).

Source code in twinweaver/instruction/converter_forecasting.py
def generate_target_manual(
    self,
    target_events_after_split: pd.DataFrame,
    split_date_included_in_input: datetime,
    events_until_split: pd.DataFrame,
    lot_date: datetime = None,
) -> tuple:  # Added optional lot_date param
    """
    Manually generates the target string and metadata from explicit components.

    Provides an alternative way to generate the target string and metadata
    by directly passing the necessary DataFrames and the split date, rather
    than relying on a pre-packaged `patient_split` dictionary. This might be
    useful in scenarios where the data components are sourced differently.

    Parameters
    ----------
    target_events_after_split : pd.DataFrame
        DataFrame containing the future events
        that constitute the target forecast.
    split_date_included_in_input : datetime
        The reference date marking the end of
        the input history and the start of the forecast period.
    events_until_split : pd.DataFrame
        DataFrame containing the historical events up to
        the `split_date_included_in_input`. Used to find the last observed
        values for context in the metadata.
    lot_date : datetime, optional
        The start date of the relevant Line of Therapy, if applicable.
        Defaults to None.

    Returns
    -------
    target_str : str
        The formatted string representing the target forecast.
    target_meta : dict
        Metadata associated with the generated target string
        (same structure as output from `_generate_target_string`).
    """
    # Construct the dictionary expected by _generate_target_string
    patient_dic = {
        "target_events_after_split": target_events_after_split,
        "split_date_included_in_input": split_date_included_in_input,
        "events_until_split": events_until_split,
        "lot_date": lot_date,  # Pass the provided lot_date
    }
    # Call the internal method to generate target string and metadata
    return self._generate_target_string(patient_dic)
reverse_conversion
reverse_conversion(
    text_to_convert, unique_events, split_date
)

Converts a formatted forecast text string back into a structured DataFrame.

Parses a text string (assumed to be generated by a forecasting model in a specific format, e.g., using "[X] weeks later..." markers) and extracts the forecasted event data. It uses the split_date as the reference point for calculating absolute dates from relative week offsets in the text. Relies on the base class's _extract_event_data method for the core parsing logic.

Parameters:

Name Type Description Default
text_to_convert str

The text string generated by the model containing the forecast.

required
unique_events DataFrame

A DataFrame defining known event types (names, categories) to help with parsing and validation within the base class method.

required
split_date datetime

The date relative to which the week offsets (e.g., "[X] weeks later") in the text_to_convert should be interpreted.

required

Returns:

Name Type Description
converted_data DataFrame

A DataFrame containing the structured event data parsed from the input text, sorted by date, category, and event name for consistency.

Source code in twinweaver/instruction/converter_forecasting.py
def reverse_conversion(
    self, text_to_convert: str, unique_events: pd.DataFrame, split_date: datetime
) -> pd.DataFrame:
    """
    Converts a formatted forecast text string back into a structured DataFrame.

    Parses a text string (assumed to be generated by a forecasting model in
    a specific format, e.g., using "[X] weeks later..." markers) and extracts
    the forecasted event data. It uses the `split_date` as the reference point
    for calculating absolute dates from relative week offsets in the text.
    Relies on the base class's `_extract_event_data` method for the core parsing logic.

    Parameters
    ----------
    text_to_convert : str
        The text string generated by the model containing the forecast.
    unique_events : pd.DataFrame
        A DataFrame defining known event types (names, categories)
        to help with parsing and validation within the base class method.
    split_date : datetime
        The date relative to which the week offsets (e.g., "[X] weeks later")
        in the `text_to_convert` should be interpreted.

    Returns
    -------
    converted_data : pd.DataFrame
        A DataFrame containing the structured event data parsed from the input text,
        sorted by date, category, and event name for consistency.
    """

    #: Prepend the event day preamble if it's missing at the start, as the base
    #  method might expect it for splitting days/visits.
    if self.event_day_preamble and not text_to_convert.startswith(self.event_day_preamble):
        text_to_convert = self.event_day_preamble + text_to_convert

    #: Call the base class's extraction method. Assuming the forecast output format
    #  is compatible with the general event extraction logic (e.g., uses the
    #  same "[X] weeks later..." structure). Set `only_contains_events=True`
    #  as we don't expect demographic data in the forecast output.
    converted_data = self._extract_event_data(
        text_to_convert,
        unique_events,
        init_date=split_date,
        only_contains_events=True,
    )

    #: sort for consistency based on config columns
    sort_columns = [
        col
        for col in [
            self.config.date_col,
            self.config.event_category_col,
            self.config.event_name_col,
        ]
        if col in converted_data.columns
    ]
    if sort_columns:
        converted_data = converted_data.sort_values(by=sort_columns).reset_index(drop=True)

    return converted_data

Functions