Machine Learning for Energy-Efficient Buildings: HVAC Cooling Load Forecasting

Published February 16, 2026

Machine Learning for Energy-Efficient Buildings: HVAC Cooling Load Forecasting

Source: The poster is taken from the ELIAS(European Lighthouse of AI for Sustainability)

Introduction

Modern buildings are no longer just concrete, glass, and steel. They are living systems—constantly reacting to weather, occupancy, and human behavior. At the heart of these systems lies HVAC infrastructure, quietly responsible for comfort, productivity, and a surprisingly large share of global energy consumption. Across the European Union, buildings are one of the biggest energy and climate levers: the European Commission notes that buildings account for about 40% of energy consumption and 36% of greenhouse-gas emissions in the EU. The picture is similarly striking in Germany—the Umweltbundesamt reports that building operation (running the building stock day to day) causes roughly 35% of final energy consumption and around 30% of CO₂ emissions.

Optimizing HVAC is not as simple as adjusting a thermostat. Because heat is stored in building mass (walls, floors, furniture) and in HVAC working fluids, real systems exhibit thermal inertia: demand responds with delays rather than instantly to weather or occupancy changes. Combined with noisy, incomplete, and time-shifted telemetry, this makes traditional rule-based control hard to tune for consistently efficient performance.

These challenges sit at the core of the AI-Based Modeling for Energy-Efficient Buildings competition. Using more than a year of measurements from a large commercial campus, the task was deceptively simple: predict the chilled-water return temperature (a proxy for cooling load) three hours ahead under strict causality constraints. The rest of this blog walks through the approach—signal analysis, time-causal feature engineering, and model design—and what the results imply for practical, energy-aware building operation.

At a Glance

This work addresses the problem of modeling cooling load in a large, real-world commercial building using machine learning. The objective is to predict the chilled water return temperature measured by sensor B205WC000.AM02, which serves as a direct proxy for the building’s cooling demand. The dataset consists of hundreds of sensor signals collected at 10-minute intervals over more than a year at the Bosch Budapest Campus. The evaluation focuses on unseen summer months, where the target signal is entirely removed and must be inferred from other measurements under strict time-causality constraints. Model performance is measured using mean squared error, with additional emphasis on robustness, interpretability, and suitability for HVAC control.

Project Duration

The competition itself ran over roughly three months, but this project was completed on a much tighter timeline:The challengewas discoveredlate and had about three weeks to go from raw telemetry to a competitive solution. That constraint shaped the entire workflow. There was no detailed system documentation available (no piping & instrumentation diagrams, no control schematics, no “how the plant is wired” notes), so the campus HVAC had to be treated as a practical black box: infer structure from signals, validate assumptions visually, and iterate fast. The focus was therefore on “smart work” over brute force—rapid signal diagnostics (correlation/lag checks, heatmaps, and driver–target alignment plots), disciplined time-causal feature engineering, and a small set of high-leverage modeling choices that could be tested quickly and communicated clearly through plots rather than lengthy guesswork.



Background

Modern HVAC systems operate as tightly coupled, dynamic networks that respond continuously to weather conditions, internal heat loads, equipment states, and control logic. In large facilities, these systems can account for up to half of total energy consumption, making accurate cooling load modeling a critical lever for energy efficiency and operational optimization. However, the complexity of such systems — characterized by delayed responses, nonlinear interactions, and cross-subsystem dependencies — makes manual modeling impractical.

The competition challenges participants to learn these dynamics directly from measurement data. Rather than forecasting a future value from its own history, the task requires understanding how cooling load emerges from interactions between sensors across the chilled water plant. During training, data from January to May 2025 includes all variables, including the target. During evaluation, covering June and July 2025, the target sensor is fully withheld. Predictions must therefore be derived from related signals while respecting a three-hour lead time, ensuring that only physically plausible, time-causal information is used. This framing shifts the focus from short-term prediction to system-level reasoning, aligning machine learning models with the realities of real-world building operation and control.

Target signal

The figure shows the target signal B205WC000.AM02 (chilled-water return temperature) over Jan 2024 → May 2025. For most of the period the series sits in a fairly stable operating band around ~9–12 °C, with a clear regime shift in early/mid-summer 2024 where the baseline rises and variability increases (roughly ~12–14 °C), consistent with higher cooling load and more active control. Superimposed on these regimes are several sharp, isolated spikes—notably around spring 2024, late Oct/Nov 2024, and spring 2025—where the temperature briefly jumps well above the normal range (peaking around ~25–28 °C). These events are visually distinct from typical dynamics and are best treated as rare operational excursions or measurement artifacts in the absence of maintenance/operations metadata. The plot also contains short discontinuities / gaps(and a few abrupt drops), which were handled during preprocessing by cubic spline interpolation for short missing segments to maintain a smooth, physically plausible signal without introducing step artifacts. Overall, the series combines slow seasonal drift, frequent small transients (control actions), and a handful of extreme events—exactly the kind of real-world behavior that motivates a time-causal forecasting setup with lagged/rolling features and robust handling of outliers and missingness.

Fig 1: B205 chilled-water return temperature (2024–2025), showing seasonal variations.

Sensor screening and selection

To keep the modeling causal and competition-compliant, each candidate sensor was evaluated under the 3-hour availability rule: predicting the target at time t may only use information available up to t−3h. Practically, every candidate driver was shifted by three hours and aligned to the target on the 10-minute grid, i.e., driver(t−3h) vs target(t). Sensors were then ranked using Mutual Information (MI), which is well suited for HVAC systems because relationships are often non-linear and can involve thresholds, saturation effects, and regime changes.

The plots below illustrate the idea using the strongest drivers selected by mutual information (MI). In each figure, the target is the temperature to be predicted, and the driver(input)is the corresponding candidate sensor shifted by −3 hours. When the driver shows changes that consistently appear before similar changes in the target, it supports the assumption that the sensor carries predictive signal under the causality constraint. These plots are not intended to compare absolute magnitudes across signals (each sensor has its own unit and operating range), but to emphasize time alignment and consistent co-movement.


The prediction target B205WC000.AM02 is measured in degrees Celsius (°C). Among the selected drivers, B205WC001.AM01 is also in °C, B205HP110.AC61 is in percent (%), and B205HW110.AM02 is in cubic meters per hour (m³/h).

B205WC001.AM01 (°C, Supply Temperature Chilled Water).
This signal exhibits clear step-like operating patterns and recurring transitions that align well with subsequent changes in the target. The repeated on/off plateaus and event-like changes provide a strong, structured driver that can be converted into predictive features (lags, rolling means, change rates) without violating causality.


Fig 2: Cause–effect time-alignment plot showing B205WC000.AM02(t) (target) and B205WC001.AM01(t−3h) (driver), illustrating a causal relationship under a 3-hour lag.

B205HP110.AC61 (%, Heat-pump, Speed Cooling Water).
This sensor shows distinct operating phases and bursts that coincide with later shifts in the target. These patterns are typical of equipment control behavior: once the heat pump moves into a different regime, the cooling loop response becomes visible downstream with delay. This makes the signal valuable for capturing regime changes and transient periods.

Fig 3: Cause–effect time-alignment plot showing B205WC000.AM02(t) (target) and B205HP110.AC61(t−3h) (driver), illustrating a causal relationship under a 3-hour lag.

B205HW110.AM02 (m³/h, Outlet Temperature Heat Exchanger).
Although it operates on a different physical scale, this sensor displays stable baseline behavior with deviations that align with later target movements. Including it helps the model account for coupled thermal effects across subsystems (for example, heat-exchange interactions that indirectly influence return temperature).

Fig 4: Cause–effect time-alignment plot showing B205WC000.AM02(t) (target) and B205HP110.AC61(t−3h) (driver), illustrating a causal relationship under a 3-hour lag.

Based on the MI ranking and these lag-alignment checks, the final feature set emphasized the highest-signal sensors while remaining compact and interpretable. Feature engineering was then applied primarily around these drivers (multi-horizon lags, rolling statistics, and interaction terms), ensuring the model learned from signals that are both predictive and available at inference time.



Time-causal temporal feature engineering

HVAC systems have thermal inertia: the return temperature does not respond instantly to a valve movement or a flow change—it responds with delay. For that reason, raw sensor values were not used directly. Instead, each selected B205 signal was expanded into a time-aware feature set on the 10-minute grid. This includes lagged values (to capture delayed propagation), rolling statistics such as mean/standard deviation/min/max over multiple windows (to represent operating regimes and stability), and change features such as first differences and percentage change (to capture control actions and sharp transitions). In addition, physically meaningful circuit features were added automatically by pairing sensors with the same equipment prefix and computing sums and deltas—simple but powerful constructs that often encode supply/return gaps or two-point consistency checks inside a loop.

Because building operation is also strongly shaped by human schedules, calendar features were added to capture recurring demand patterns that are not fully explained by sensors alone. These include time-of-day (to reflect daily setpoint schedules), day-of-week (weekday vs weekend operation), and indicators for holidays / non-working days, which often produce distinctly different HVAC regimes due to reduced occupancy and altered control strategies. Where helpful, cyclic encodings (e.g., sine/cosine for hour-of-day) were used so the model can learn smooth periodic behavior rather than treating “23:50” and “00:00” as unrelated.

Most importantly, every engineered feature was made strictly deployable under the competition’s causality rule. After all temporal and calendar features were computed, the entire feature matrix was shifted forward by three hours. This means that when predicting the target at time t, the model only sees information that would have been available at or before t − 3h. This single design choice prevents future leakage, aligns the pipeline with real HVAC operation, and forces the model to learn how system states propagate forward in time rather than exploiting accidental timing artifacts.



Cross-building feature engineering: capturing indirect thermal influence

Although the model’s target is the chilled-water return temperature in B205, the underlying dynamics are not purely “local.” On a campus, HVAC behaves like a coupled system: weather pushes thermal load up and down, and neighboring buildings can influence shared circuits and operating modes through common infrastructure. The key complication in this challenge is that these external influences can persist physically even when their measurements are not provided as input features during evaluation. Directly relying on other buildings’ sensors can therefore create a train–test feature-availability mismatch: the model learns dependencies during training that it cannot use at test time.

To address this, the external buildings were characterized by the type of information they contribute. B106 functions primarily as a weather station, providing ambient conditions such as temperature and wind-related effects. B201 represents campus HVAC operation, with sensors linked to major subsystems such as AH (air handling), FC (fan coil), RC (refrigeration/cooling circuits), and WD (water distribution). These signals are informative drivers of thermal demand, but they cannot be treated as required inputs because their availability is not guaranteed in the evaluation feature set (this is a structural availability constraint, not merely frequent NaNs).

The solution was to encode those cross-building effects indirectly using B205-only proxy features: features computed only from sensors that are always present in B205, but that still contain the “fingerprints” of external forcing through delayed and structured responses. In practice, this means engineering features that capture how B205 reacts when weather changes or when the wider campus HVAC network shifts operating mode, without directly consuming B106/B201 signals.

For weather influence (B106), proxy features were designed to reflect load changes that typically follow atmospheric disturbances with short delays. Signals inside B205 often show consistent patterns after rapid ambient shifts—e.g., characteristic ramps and stabilization behavior as the control system compensates. These responses were captured using short-horizon lags and rolling statistics (roughly in the 1–3 hour range), enabling the model to learn “weather-like” forcing from B205’s own observable state.

For neighboring-building interaction (B201), the proxy design focused on system-level coupling visible in B205 sensors—especially behaviors consistent with changes in refrigeration circuits, coordinated air-handling demand, and water-loop dynamics. Here, longer horizons matter more because thermal mass and shared infrastructure propagate effects more slowly, so longer-window statistics and delayed features (up to several hours) were emphasized. In addition, a small set of coordination indicators was added to detect synchronized shifts across multiple B205 sensors—useful for capturing campus-wide events where many subsystems move together.

Overall, this cross-building feature block keeps the pipeline deployable: it does not assume external sensor availability, yet it still accounts for external influences by learning their signatures as they appear inside B205. The result is a model that predicts the target based on broader site dynamics rather than treating B205 as an isolated system.


Evidence for cross-building coupling (heatmaps)

Fig 5: Multiple B106 weather sensors correlate strongly with B205 signals, showing how external conditions propagate into the building’s thermal behavior and influence cooling demand

This heatmap quantifies cross-building alignment between B106 weather-station channels (rows: B106WS01.AM50 … B106WS01.AM57, including B106WS01.AM54_2) and three key B205 loop/target sensors (columns: B205WC000.AM02, B205HW001, B205WC001.AM71). Each cell reports R² (squared correlation)over the analyzed period, so higher values indicate stronger co-movement in a linear sense.

Several weather channels show clear coupling with B205 behavior. For example,B106WS01.AM53 is a weather-station signal from building B106, and in this context it represents relative humidity of the ambient outdoor air measured at the campus weather station.reaches R² = 0.89 with the target B205WC000.AM02, meaning that a large fraction of the target’s variance aligns with that weather signal’s variability. Conceptually, weather does not “control” the plant directly; it acts as a boundary condition that modulates the thermal load the HVAC system must handle.

Fig 6: Several B201 HVAC subsystem sensors show strong correlations with B205 signals, indicating tightly coupled operation across buildings due to shared infrastructure and thermal inertia.

This heatmap is even more diagnostic: several B201 → B205 pairs show very strong alignment, with highR² (squared correlation) values across multiple cells. The B201 signals shown—B201AH132.AM04, B201AH164.AM13, and B201AH602.AM46—belong to the air-handling (AH) subsystem in building B201. The B205 signals—B205HW001, B205HWM002.AM71,B205HWM110, B205HWM110.AM04, B205WC000.AM71, and B205WC001.AM71—cover hot-water loop measurements (HW) and chilled-water loop measurements (WC), i.e., variables that reflect the operating state of the B205 plant room.

High cells indicate pairs that move together in a stable, repeatable way. For example, B201AH132.AM04 aligns strongly with B205HWM002.AM71 (R² = 0.964) and also with chilled-water related signals such as B205WC001.AM71 (R² = 0.920) and B205WC000.AM71 (R² = 0.850). In plain terms, parts of B201 and B205 behave like coupled subsystems: when B201’s air-handling activity shifts, corresponding loop variables in B205 often shift in a predictable way. In a shared HVAC environment this is expected—coordinated schedules, shared plant operation, and thermal inertiacreate consistent cross-building patterns. As with the weather analysis, these relationships are used as diagnostic evidence of coupling and motivation for proxy features, not as a claim of direct causality.


Correlation calculation (what we did):

  • 1) Time-align the data: For a given sensor pair (source ↔ target), we aligned both time series to the same timestamps (same sampling grid). If timestamps didn’t match perfectly, the data was resampled/merged onto a common time index and then matched point-by-point.
  • 2) Keep only overlapping points: We only used timestamps where both sensors have valid values (missing values were excluded for that pair).
  • 3) Compute the correlation strength: We computed the Pearson correlation between the two aligned signals (this captures how consistently one signal moves with the other in a linear way).
  • 4) Convert to an “R² correlation” score: We then squared that correlation value to get a number between 0 and 1:
    • 0 = no consistent relationship
    • 1 = extremely consistent relationship
      Squaring focuses purely on strength, not direction (so a strong “moves opposite” relationship still appears strong).


How cross-building influence was turned into B205-only features

B106 and B201 measurements are not provided as input features in the test/evaluation set. As a result, the final model cannot use weather-station signals from B106 or subsystem signals from B201 at inference time, even though their physical influence on campus HVAC behavior can still be present. The solution is therefore to translate cross-building effects into stable B205-only signals by engineering proxy features from B205 sensors that reflect the same underlying operating conditions (weather-driven load changes and campus-wide operating modes).

This is done through the following B205-only feature blocks:

  • Lag features (time-causal): B205 signals were shifted and lagged (e.g., 1–6 hours depending on subsystem dynamics) to reflect delayed thermal propagation.
  • Rolling statistics: rolling mean/variance/quantiles capture load intensity and regime stability (useful for weather-driven and neighbor-driven demand changes).
  • Rates of change and deltas: slopes and short-term differences capture fast transients and control actions that often follow external disturbances.
  • Interaction features: combinations such as (temperature × flow), or differences between related temperature sensors, capture physically meaningful behavior (mixing, transfer, and control response).
  • Regime/context indicators: features that summarize “operating mode” (high load / low load, stable / dynamic) help the model generalize across seasons without requiring external building signals.

Together, these engineered features allow the final predictor to behave as if it “knows” about weather and neighboring-building coupling—without ever needing external sensors during evaluation—because the relevant information is already reflected inside B205 measurements.


Why feature engineering was necessary (not optional)

These two visuals(heatmap)together justify the core training decision: don’t treat B205 as an isolated time series. Instead, treat it as the observable surface of a coupled system. During training, you encode the external drivers (weather) and the neighbor-building interactions (B201) into proxy features derived from B205-only signals—lags, rolling statistics, regime flags, and interaction terms that mirror how the campus HVAC responds over time. This makes the final model more robust in the test period, because it no longer depends on “missing buildings”; it depends on B205 measurements that already contain the fingerprints of those buildings. In short: the heatmaps are your evidence that cross-building coupling is real, structured, and strong enough to be worth modeling—so proxy-based feature engineering becomes a principled design choice rather than a hack.

Why XGBoost Fits This HVAC Modeling Task

XGBoost is a gradient-boosted decision tree method that builds an ensemble of small decision trees sequentially, where each tree corrects the errors of the previous ones. This is a strong match for HVAC sensor modeling because the relationships are non-linear and rule-like (control logic, thresholds, switching behavior, regime changes), and XGBoost captures these patterns naturally without requiring a carefully designed neural architecture.

This competition effectively becomes a tabular learning problem after feature engineering: raw time series are converted into lagged values, rolling statistics, and interaction features under a strict 3-hour causality constraint. In that setting, XGBoost tends to be more data-efficient and stable than neural networks, trains quickly on CPU, and remains robust under common real-world issues such as missing values, noisy sensors, and heterogeneous feature scales. It also supports practical interpretability: feature importance and error analysis can be used to verify that the model is driven by physically meaningful signals rather than spurious correlations, which is critical when hundreds of sensors and engineered features are involved.


Training : Building a Reliable Foundation for HVAC Modeling

Accurate HVAC modeling starts long before model training. The main challenge in this competition was not a lack of data, but the scale, heterogeneity, and time dependence of the sensor streams. The training set was built by integrating multiple repositories across 2024 and January–May 2025, yielding a continuous, seasonally diverse period that covers heating and cooling regimes, transitions, and varying load conditions—crucial for learning robust patterns instead of overfitting to a short operating window.

Integration focused on building B205, where the target sensor is located. From hundreds of signals, 28 core sensors were selected across three subsystems—heat pumps, hot-water circuits, and chilled-water systems—to represent the dominant physical drivers of cooling demand. To avoid discarding informative but intermittent sensors, a relaxed coverage threshold was used, paired with stronger data-quality handling.

All signals were aligned to a 10-minute grid, and missingness was treated as a normal property of operational telemetry. Short gaps were interpolated, resampling/shift boundaries were handled with forward/backward fills, and any remaining missing values were imputed using per-feature medians computed strictly on the training period to preserve continuity without introducing leakage.

Feature engineering explicitly captured HVAC “memory.” From the base sensors, a large set of time-aware features was created: lags, rolling statistics at multiple horizons, and change features (differences/percent changes) to represent propagation delays, operating regimes, and control actions. Cross-building influence was encoded via B205-only proxy features: although B106 weather and B201 sensors are not available in the test period, their effects can still appear in B205 through shared infrastructure and boundary conditions; proxies with distinct lag structures captured faster weather-driven responses and slower thermal-coupling effects.

To ensure deployability, the entire pipeline was strictly time-causal: engineered features were shifted by three hours, and all normalization/clipping/scaling statistics were computed on training data only and reused unchanged for validation/test. Because the resulting feature space was high-dimensional, mutual information was used to select features with genuine predictive signal while retaining time/context features and enforcing subsystem diversity to avoid over-reliance on any single circuit. Finally, integrity checks (outages, stuck values, feature activation, and test-availability consistency) ensured the dataset preserved real system complexity while remaining consistent, causal, and suitable for operational forecasting.

Validation and Results

Because the organizer’s true test labels were not available locally, model quality was assessed in two complementary ways: (1) a rigorous internal validation analysis using held-out periods from the historical dataset, and (2) the official Kaggle leaderboard score on the hidden test split. This combination provides both interpretability (why the model behaves the way it does) and an external reality check (how well it generalizes to unseen evaluation data).

Internal validation: high-fidelity tracking in winter and summer operating regimes

To make the validation interpretable for an HVAC audience, performance is visualized using representative 48-hour windows selected from two very different seasonal regimes: a winter-like period and a summer-like period. Each window overlays the measured chilled-water return temperature (target) against the model prediction at the competition cadence (10-minute sampling, i.e., 6 points per hour). Presenting a compact window rather than the full year avoids “spaghetti plots” and highlights what matters operationally: does the model track real dynamics smoothly, without drift, and without systematic bias?

Winter operating regime (January)

The winter window shows stable, high-fidelity tracking: the predicted curve stays close to the measured target over long stretches while following short-term changes in operating conditions. This matters because winter operation often consists of steady regimes punctuated by discrete control actions; the model captures these transitions without introducing spurious oscillations. The summary statistics on the plot support this interpretation (RMSE = 0.103 °C, low MAE, and near-zero bias), indicating that the model is not consistently over- or under-estimating the target.

At the same time, it is important to interpret very small residual differences in the context of measurement uncertainty. The dataset does not specify the exact sensor type and measurement chain for B205WC000.AM02, and in HVAC water loops the practical accuracy (sensor + transmitter + installation + calibration) is often on the order of ~0.1 °C. Therefore, an RMSE around 0.1 °C may already be close to the effective measurement-noise floor of the target, meaning part of the remaining error could reflect label uncertainty rather than model limitations.

Fig 7: Representative winter operating window (January).


Summer operating regime (August)

A second window taken from late summer demonstrates that the same feature-engineered model remains reliable when the system operates under cooling-dominated conditions. Summer behavior is typically shaped by stronger external forcing (weather and occupancy schedules) and more frequent load changes. In this window, the prediction continues to follow the measured target closely, including gradual ramps and short disturbances—exactly the behavior required for 3-hour-ahead operational forecasting. The summary box reports low bias and RMSE = 0.238 °C, indicating that performance is not driven by a systematic offset but by genuine tracking of the underlying thermal dynamics. As in winter, small residual differences should be interpreted in the context of measurement uncertainty, but the overall alignment shows that the learned relationships generalize beyond a single operating regime.

Together, these two windows communicate the main validation message in an intuitive way: the model behaves consistently across distinct seasonal regimes, rather than being tuned to one narrow set of conditions.

Fig 8: Representative summer operating window (August).


Why June–July was treated differently

The competition’s evaluation focused on June and July2025conditions in the organizer’s hidden test period. For this reason, June–July 2024 was treated as valuable training material rather than a primary validation block, ensuring the model was exposed to cooling-season behavior during learning. Validation windows were therefore selected from other months (winter and late-summer proxies) to preserve a clean internal check while still training the model to handle the regime that mattered most in the final evaluation.

External validation: robust Kaggle test performance

The final confirmation comes from the official Kaggle scoring on the hidden test split. On the competition test data, the achieved MSE was 0.309 on 51% of the samples and 0.320 on 49%, showing only a small spread between the two portions of the evaluation. This consistency is a strong robustness signal: performance does not depend on a single subset or a narrow operating regime, and there is no sharp degradation between splits. In practice, such near-matching scores indicate that the model’s feature engineering and learning strategy generalize well to unseen conditions—an outcome aligned with reliable deployment goals in real HVAC forecasting.

Outlook: from forecasting to control

The next step after forecasting is to use the 3-hour prediction to operate the HVAC system more proactively. Instead of reacting only after load has already changed, the forecast can serve as an early warning signal: rising demand can be anticipated through smoother setpoint adjustments, pump control, or chiller staging, while falling demand allows the system to avoid unnecessary overcooling and inefficient cycling. A practical rollout would begin in shadow mode, where the model generates control recommendations without actively influencing the plant, enabling a direct comparison with existing rule-based strategies before any gradual deployment.

Conclusion

This project demonstrates that combining domain knowledge with carefully engineered, time-causal machine learning models enables reliable, physically plausible HVAC forecasting—even under real-world data constraints. More broadly, it shows how data-driven approaches can form a robust bridge from monitoring and prediction toward intelligent, energy-efficient building operation.

We are happy to support and advise on similar AI-based modeling and optimization projects for energy systems and building operations—feel free to reach out via our contact form.

Ready for the future?

Contact