TimeXer for Time Series Forecasting with Exogenous Features
Mar 30, 2025
While a lot of research efforts are put in foundation forecasting models like Time-MoE, Moirai or TimeGPT, new data-specific models are still being actively developed and released.
One of the latest proposed method is TimeXer, released in the paper TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables in February 2024 [1].
As the title suggests, TimeXer is a transformer-based model, just like PatchTST and iTransformer, that also take into account exogenous features. Thus, it does not only rely on past values of the series, but also on external information that may help forecast our target.
This is especially useful with time series that are impacted by external factors. For example, the demand for a product can change if there is a holiday, or demand for electricity is impacted by the temperature outside.
In those situations, we can use TimeXer, as it was specifically built to handle exogenous features.
In this article, we first explore the architecture and inner workings of TimeXer, and then we apply it in a small experiment to compare its performance against other models.
For more details, make sure to read the original paper.
Learn the latest time series forecasting techniques with my π free time series cheat sheet π in Python! Get the implementation of statistical and deep learning techniques, all in Python and TensorFlow!
Let’s get started!
Discover TimeXer
As mentioned before, TimeXer is a transformer-based model. The motivation behind TimeXer comes from the realization that current transformer-based models lack some key capabilities.
For example, PatchTST performs univariate forecasting, meaning that it cannot model the interdependency of multiple series. Also, it does not support exogenous features.
For iTransformer, it can perform both univariate and multivariate forecasting, but it does not support exogenous features.
In the case of TimeXer, the model can perform univariate and multivariate forecasting, as well as use exogenous features to inform predictions.
Architecture of TimeXer
In the figure below, we can see the general architecture of TimeXer.

At a glance, from the figure above, we can see that TimeXer basically reuses the original Transformer architecture without modifying its components.
The main change is how TimeXer adopts self-attention to learn temporal dependencies and cross-attention to capture the effect of exogenous features on the target series.
Of course, there is much more to unpack, so let’s take a deeper look at each component.
Endogenous embedding
The first step of TimeXer is creating an embedding from the input time series. While the vanilla Transformer would tokenize each time step individually, TimeXer adopts patching to group data points together and tokenize those groups.

In the figure above, we can see that with patching, we group many data points, five in this case, and we then tokenize them before feeding them to the self-attention mechanism.
This strategy was first proposed in PatchTST and the iTransformer is basically the extreme version of patching.
The main advantages of patching is that it reduces the number of tokens, and thus reduces the space and time complexity of the model, and it also helps capture temporal dependencies.
Thus, TimeXer uses the same patching strategy, using non-overlapping patches, to embed the input series and feed it to the self-attention mechanism.
However, before moving there, let’s take a look at the exogenous embedding.
Exogenous embedding
Just like the input series, exogenous features must also go through an embedding step before being fed to the cross-attention mechanism.
To that end, the authors opted for a variate-level representation, as shown below.

Here, each exogenous series is embedded as a unique token. This strategy allows the model to handle features with missing values or different frequencies than the target series.
Plus, it also reduces the computational complexity of the model, since we get one token per variate. This is basically an extreme case of patching, where the whole series is patched and turned into a single token (just like in the iTransformer).
Now that we understand the embedding strategy for both the endogenous and exogenous series, let’s take a look at the attention mechanisms to capture dependencies.
Endogenous self-attention
For the endogenous series, or the target series, the self-attention mechanism is used, as shown below.

There, the attention mechanism learns temporal dependencies from the input series.
Now, the key innovation here is the use of a global token, depicted as the cube with dashed edges in the figure above.
As such, there are three main attention operations being conducted:
- Patch-to-patch: this is the standard self-attention mechanism, where the model learns relationships between different temporal segments (patches).
- Patch-to-global: this is where the global token attends to all temporal tokens and thus learns global patterns from the series.
- Global-to-patch: this is where each temporal token attends to the global token, meaning that they receive broad information from the entire series, as learned by the global token.
Using this strategy, the model is able to capture both local and global patterns from the series, which should technically lead to more accurate forecasting.
Exogenous-to-endogenous cross-attention
The last key element of TimeXer is the cross-attention mechanism, which captures relationships between external features and the target series, as shown below.

Here, we notice that the global token (shown as the cube with dashed edges) play another key role in this step.
The cross-attention mechanism effectively draws relevant information from external factors.
However, since exogenous variables are encoded as variate-level tokens, the global token aggregates information from those external factors and passes it on to the temporal patch, during the global-to-patch operation we detailed in the previous section.
This is how TimeXer is able to capture both temporal dependencies, and relationships between the target series and external variables.
Final steps and output projection
Once the tokenized series have gone through both attention mechanisms, the output is sent through normalization, and a feed-forward layer, before entering a final normalization layer.
The normalization layers help stabilize the training of the model, while the feed-forward layer further learns from the deep abstract representation that results from the attention mechanisms.
The final step is then to project the output to a dimension that is consistent with the forecasting task.
If only one series is being predicted, then we get a 1D vector with length equal to the forecast horizon.
If multiple series are being predicted, then we get a 2D vector with dimensions equal to the number of series and the forecast horizon.
Now that we have a deep understanding of the inner workings of TimeXer, let’s apply it in our own small experiment using Python.
Forecasting with TimeXer
For this portion, we apply TimeXer on the popular EPF benchmark dataset.
It contains information on the price of electricity in five different European markets. Most importantly, the dataset also comes with known exogenous features. The whole dataset is accessible on GitHub under the MIT License.
For this small experiment, we will compare the performance of TimeXer against NHITS and TSMixerx, as they are very robust models that also support exogenous features.
Here, we use the implementation available in neuralforecast, as I believe this is the easiest and fastest way to use deep learning models for time series forecasting.
As always, the full source code for this experiment is available on GitHub.
Let’s get started!
Initial setup
First, let’s import the required packages for this experiment. We import the usual packages for data manipulation and visualization, as well as neuralforecast and utilsforecast for fitting the models and evaluating them.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from neuralforecast.core import NeuralForecast
from neuralforecast.models import TimeXer, NHITS, TSMixerx
from utilsforecast.evaluation import evaluate
from utilsforecast.losses import mae, mse
Then, we read in the data and format it as expected by neuralforecast. Mainly, we create a unique_id
column to identify the series, we name the timestamp column as ds
and we name the target column as y
.
We also remove the extra whitespace that precedes the name of the exogenous features.
BE_url = "https://raw.githubusercontent.com/thuml/TimeXer/refs/heads/main/dataset/EPF/BE.csv"
DE_url = "https://raw.githubusercontent.com/thuml/TimeXer/refs/heads/main/dataset/EPF/DE.csv"
BE_df = pd.read_csv(BE_url, parse_dates=["date"])
BE_df["unique_id"] = "BE"
BE_df = BE_df.rename(columns={
"date": "ds",
" Generation forecast": "Generation forecast",
" System load forecast": "System load forecast",
"OT": "y"
})
DE_df = pd.read_csv(DE_url, parse_dates=["date"])
DE_df["unique_id"] = "DE"
DE_df = DE_df.rename(columns={
"date": "ds",
" Wind power forecast": "Wind power forecast",
" Ampirion zonal load forecast": "Ampirion zonal load forecast",
"OT": "y"
})
Note that for simplicity, we forecast for two out of the five markets available in the dataset. Of course, feel free to forecast all of them, as the steps remain the same, just make sure to use the correct variable names.
Finally, let’s set some constants that will be reused throughout the experiment.
HORIZON = 24
INPUT_SIZE = 168
FREQ = "h"
BE_EXOG_LIST = ["Generation forecast", "System load forecast"]
DE_EXOG_LIST = ["Wind power forecast", "Ampirion zonal load forecast"]
Here, we use a horizon of 24 hours, and an input size of 168 hours (7 days). The data has an hourly frequency, and we define the name of the exogenous features.
At this point, we are ready to fit our models.
Training each model
In neuralforecast, we can initialize a list of models to be trained on a dataset. Here, we use TimeXer, NHITS and TSMixerx.
Below, we initialize the models for the Belgian market.
models = [
TimeXer(
h=HORIZON,
input_size=INPUT_SIZE,
n_series=1,
futr_exog_list=BE_EXOG_LIST,
patch_len=HORIZON,
max_steps=1000
),
NHITS(
h=HORIZON,
input_size=INPUT_SIZE,
futr_exog_list=BE_EXOG_LIST,
max_steps=1000
),
TSMixerx(
h=HORIZON,
input_size=INPUT_SIZE,
n_series=1,
futr_exog_list=BE_EXOG_LIST,
max_steps=1000
)
]
In the code block above, we can specify the name of our exogenous features in the futr_exog_list
parameter. This parameter expects a list of columns to be treated as external variables.
Also notice that for each model, we use the same input size, and maximum training steps.
Finally, TSMixerx and TimeXer use the n_series
parameter because they are multivariate models, meaning that they can learn the interdependency of multiple series. However, since we are modeling one market at a time, we set it to 1.
In the case of NHITS, it is a multivariate model, so the parameter is not used.
Then, all models can be trained on the dataset using cross-validation. That way, we get multiple windows of predictions that we can compare directly to the actual values.
In this case, we use ten non-overlapping cross-validation, as shown below.
nf = NeuralForecast(models=models, freq=FREQ)
BE_cv_preds = nf.cross_validation(BE_df, step_size=HORIZON, n_windows=10)
BE_cv_preds.head()
The entire process is done for the Belgian market only. Now, we can repeat the same steps for the Danish market.
models = [
TimeXer(
h=HORIZON,
input_size=INPUT_SIZE,
n_series=1,
futr_exog_list=DE_EXOG_LIST,
patch_len=HORIZON,
max_steps=1000
),
NHITS(
h=HORIZON,
input_size=INPUT_SIZE,
futr_exog_list=DE_EXOG_LIST,
max_steps=1000
),
TSMixerx(
h=HORIZON,
input_size=INPUT_SIZE,
n_series=1,
futr_exog_list=DE_EXOG_LIST,
max_steps=1000
)
]
nf = NeuralForecast(models=models, freq=FREQ)
DE_cv_preds = nf.cross_validation(DE_df, step_size=HORIZON, n_windows=10)
DE_cv_preds.head()
Once the training process is done for both datasets, we can then evaluate the performance of our models.
Evaluation
Before calculating performance metrics, we can first visualize the predictions made for both series.
Below, we can see the predictions made for the Belgian market.

From the figure above, we can see that NHITS misses the granular variations in the data. However, TSMixerx and TimeXer seem to make better forecasts.
Similarly, we can then visualize the forecasts for the Danish market.

In the figure above, surprisingly, TimeXer now is the worst model, as it completely misses the early dip, while NHITS and TSMixerx generally follow the shape of the actual values.
This visual inspection is further confirmed by the table of performance metrics below.

Interestingly, TimeXer performs best on the BE dataset and NHITS performs worst there, but the situation is completely flipped for the DE dataset, where NHITS performs best, and TimeXer achieves the worst performance.
It is very hard for me to explain why that is the case. However, when testing TimeXer on other datasets, I did notice that it benefits from longer training compared to NHITS or TSMixer, so it might be that for the DE dataset, training for 1000 steps was simply not enough.
Of course, this experiment is not a thorough benchmark. The main objective is to show how you can implement TimeXer on your own datasets.
As such, keep in mind that:
- TimeXer is implemented as a multivariate model in neuralforecast, so it can model the relationship between multiple series
- TimeXer benefits from longer training, so set the training steps to a larger number than 1000
- the model supports exogenous features, which can be passed in the
futr_exog_list
argument. This means that the future values of the features are known and available at the time of forecasting.
Conclusion
TimeXer is a transformer-based model that combines a self-attention mechanism with a cross-attention mechanism.
The self-attention models temporal dependencies in the target series, while cross-attention captures relationships between the endogenous series and external variables.
When working with TimeXer, it is recommended to train it for more than 1000 steps as it seems to benefit from longer training times.
In our small experiment, TimeXer achieved the best performance on one dataset, but the worst on another. Once again, keep in mind that this was not a complete benchmark. However, the performances are interesting enough to be worth trying in your own project.
As always, each problem requires its own solution, and now you can test if TimeXer is the best answer to your scenario.
Thanks for reading! I hope that you enjoyed it and that you learned something new!
Learn the latest time series forecasting techniques with my π free time series cheat sheet π in Python! Get the implementation of statistical and deep learning techniques, all in Python and TensorFlow!
Cheers π»
Support me
Enjoying my work? Show your support with Buy me a coffee, a simple way for you to encourage me, and I get to enjoy a cup of coffee! If you feel like it, just click the button below π

References
[1] Y. Wang et al., “TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables,” arXiv.org, 2024. https://arxiv.org/abs/2402.19072
[2] Official implementation of TimeXerβ—βGitHub
Stay connected with news and updates!
Join the mailing list to receive the latest articles, course announcements, and VIP invitations!
Don't worry, your information will not be shared.
I don't have the time to spam you and I'll never sell your information to anyone.