Preprocessing Helpers¶
twinweaver.utils.preprocessing_helpers ¶
Functions¶
aggregate_events_to_weeks ¶
aggregate_events_to_weeks(
df,
patientid_column="patientid",
date_column="date",
event_name_column="event_name",
event_value_column="event_value",
random_state=None,
)
Aggregates a long-format events DataFrame to rounded weeks relative to each patient's first visit.
This function rounds event dates to the nearest week (relative to each patient's
first visit date), then aggregates multiple events that fall within the same week.
For identical events (same event_name) within the same week:
- Numerical values are averaged.
- Categorical values use the mode (most frequent value), with random selection
as a tiebreaker if multiple modes exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame containing longitudinal patient events in TwinWeaver format. Expected columns: patientid, date, event_category, event_name, event_value, event_descriptive_name, meta_data, source. |
required |
patientid_column
|
str
|
The name of the column containing patient identifiers. |
"patientid"
|
date_column
|
str
|
The name of the column containing date/time information. |
"date"
|
event_name_column
|
str
|
The name of the column containing event names (used to identify identical events). |
"event_name"
|
event_value_column
|
str
|
The name of the column containing event values to aggregate. |
"event_value"
|
random_state
|
int
|
Random seed for reproducibility when breaking ties in mode selection. If None, results may vary for tied modes. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A new DataFrame with dates rounded to weeks and events aggregated. The date column will contain dates representing the start of each week relative to the patient's first visit. |
Examples:
>>> import pandas as pd
>>> data = {
... 'patientid': ['p1', 'p1', 'p1', 'p1', 'p2', 'p2'],
... 'date': ['2024-01-01', '2024-01-03', '2024-01-08', '2024-01-09',
... '2024-02-01', '2024-02-05'],
... 'event_category': ['lab', 'lab', 'lab', 'lab', 'lab', 'lab'],
... 'event_name': ['glucose', 'glucose', 'glucose', 'glucose', 'glucose', 'glucose'],
... 'event_value': [100, 110, 120, 130, 90, 95],
... 'event_descriptive_name': ['Glucose', 'Glucose', 'Glucose', 'Glucose',
... 'Glucose', 'Glucose'],
... 'meta_data': [None] * 6,
... 'source': ['events'] * 6,
... }
>>> df = pd.DataFrame(data)
>>> df['date'] = pd.to_datetime(df['date'])
>>> result = aggregate_events_to_weeks(df)
>>> # p1: Jan 1 and Jan 3 are in week 0 -> averaged to 105
>>> # Jan 8 and Jan 9 are in week 1 -> averaged to 125
>>> # p2: Feb 1 and Feb 5 are in week 0 -> averaged to 92.5
Notes
- Weeks are calculated as 7-day intervals from each patient's first visit.
- A date is assigned to week N if it falls within [first_visit + N7 days, first_visit + (N+1)7 days).
- The output date for each week is the first day of that week interval.
- Non-grouping columns (like event_descriptive_name, meta_data, source) take the first value within each aggregation group.
- Empty DataFrames are returned as-is.
Source code in twinweaver/utils/preprocessing_helpers.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 | |
identify_constant_and_changing_columns ¶
Identifies which columns remain constant and which change over time for each patient.
This function analyzes a DataFrame to determine which columns have values that stay constant across all time points for each patient, and which columns have values that change over time for at least one patient.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame containing patient data with multiple time points. |
required |
date_column
|
str
|
The name of the column containing date/time information. |
required |
patientid_column
|
str
|
The name of the column containing patient identifiers. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
constant_columns |
list of str
|
A list of column names that remain constant across all time points for every patient. |
changing_columns |
list of str
|
A list of column names that change over time for at least one patient. |
Examples:
>>> import pandas as pd
>>> data = {
... 'patient_id': [1, 1, 2, 2],
... 'date': ['2024-01-01', '2024-02-01', '2024-01-01', '2024-02-01'],
... 'age': [30, 30, 45, 45],
... 'weight': [70, 72, 80, 80],
... 'gender': ['M', 'M', 'F', 'F']
... }
>>> df = pd.DataFrame(data)
>>> constant, changing = identify_constant_and_changing_columns(
... df, date_column='date', patientid_column='patient_id'
... )
>>> print(constant)
['age', 'gender']
>>> print(changing)
['weight']
Notes
- The date_column and patientid_column are excluded from the analysis.
- A column is considered constant if all values for a patient are identical (including NaN values, which are treated as equal to each other).
- A column is considered changing if at least one patient has different values across their time points.
Source code in twinweaver/utils/preprocessing_helpers.py
175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 | |