Thanks to Yingli for the question.
Here are the equations that you originally sent through email.
> This calculates the average of the Percentages.
> This gives you the correct MAPE weighted by volume.
Averaging percentages can give you strange numbers. This is not advised. So equation 1 for MAPE is not the recommended solution, although many academics use this as a model diagnostic.
Equation 2 gives you the correct MAPE as used by the Supply Chain practitioners. This weights the MAPE by volume. small numbers don’t heavily influence this calculation. So that was the easy part.
MAPE can be defined as the volume weighted absolute error relative to the total actual demand. In other words, this is the percent Mean absolute deviation or PMAD. This can also be intuitively explained as the average absolute deviation relative to the average unit demand. Please see a downloadable presentation at DemandPlanning.Net.
Equation 3 and 4 describe a family of measures that are used to calculate forecast bias. Forecast bias is just a component of total forecast error or MAPE. The other component of MAPE is SKU Mix Error. If you over-forecast all SKUs in your product portfolio, then your forecast bias will equal the MAPE.
Dec 13, 2012 How to Evaluate MAD, MSE, RMSE, and MAPE for an Excel Forecast - Duration: 10:51. The Stats Files - Dawn Wright Ph.D. This observation led to the use of the so-called “symmetric” MAPE (sMAPE) proposed by Armstrong (1978, p. Also, the value of sMAPE can be negative, so it is not really a measure of “absolute percentage errors” at all. Hyndman & Koehler recommend that the sMAPE not be used. It is included here only because it is widely used.
So let us examine 3 now. I have defined this in my lectures and workshops as the classic measure of cross-sectional forecast bias.
Equation 4 also measures forecast bias, but some what weakly. We call equation 4 simply as MPE since it averages the percent errors and small volume SKUS may heavily influence the calculation.
In essence, if you are measuring forecast performance across a portfolio of products, you would equation 2 for MAPE and equation 3 for Forecast Bias over the other two calculations.
On the other hand, if you are measuring forecast error over time for the same sku, the other two equations 1 and 4 are also acceptable but 2 and 3 are much stronger calculations.
Many accuracy measures have been proposed in the past for time series forecasting comparisons. However, many of these measures suffer from one or more issues such as poor resistance to outliers and scale dependence. In this paper, while summarising commonly used accuracy measures, a special review is made on the symmetric mean absolute percentage error. Moreover, a new accuracy measure called the Unscaled Mean Bounded Relative Absolute Error (UMBRAE), which combines the best features of various alternative measures, is proposed to address the common issues of existing measures. A comparative evaluation on the proposed and related measures has been made with both synthetic and real-world data. The results indicate that the proposed measure, with user selectable benchmark, performs as well as or better than other measures on selected criteria. Though it has been commonly accepted that there is no single best accuracy measure, we suggest that UMBRAE could be a good choice to evaluate forecasting methods, especially for cases where measures based on geometric mean of relative errors, such as the geometric mean relative absolute error, are preferred.
IntroductionForecasting has always been an attractive research area since it plays an important role in daily life. As one of the most popular research domains, time series forecasting has received particular concern from researchers –. Many comparative studies have been conducted with the aim of identifying the most accurate methods for time series forecasting. However, research findings indicate that the performance of forecasting methods varies according to the accuracy measure being used. Various accuracy measures have been proposed as the best to use in the past decades. However, many of these measures are not generally applicable due to issues such as being infinite or undefined under certain circumstances, which may produce misleading results. The criteria required for accuracy measures have been explicitly addressed by Armstrong and Collopy and further discussed by Fildes and Clements and Hendry.
As discussed, a good accuracy measure should provide an informative and clear summary of the error distribution. The criteria should also include reliability, construct validity, computational complexity, outlier protection, scale-independency, sensitivity to changes and interpretability. It has been suggested by many researchers that no single measure can be superior to all others in these criteria ,.The evolution of accuracy measures can be seen through the measures used in the major comparative studies of forecasting methods. Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) can be considered as the very early and most popular accuracy measures. They were the primary measures used in the original M-Competition. Despite well-known issues such as their high sensitivity to outliers, they are still being widely used –. When using these accuracy measures, errors which are small and appear to be good, such as 0.1 by RMSE and 1% by MAPE, can often be obtained.
employed RMSE as the performance indicator in their research on stock price forecasting. The average error obtained was 84 and it was claimed to be superior to some other previous models. However, without comparison, the error 84 as a number is not easy to interpret. In fact, the average fluctuation of stock indices used was 83 which is smaller than the error of their proposed model. A similar case can be found regarding MAPE.
Esfahanipour and Aghamiri proposed a model with an error of 1.3%, which appears to be good. Yet, this error was larger than the average daily fluctuation of the stock price, which was approximately 1.2%. The poor interpretation here is mainly due to the lack of comparable benchmark used by the accuracy measure.Armstrong and Collopy recommended the use of relative absolute errors as a potential solution to the above issue. Accuracy measures based on relative errors, such as Mean Relative Absolute Error (MRAE), can provide a better interpretation of how good the evaluated forecasting method perform compared to the benchmark method. However, when the benchmark error is small or equal to zero, the relative error could become extremely large or infinite. This may lead to an undefined mean or at least a distortion of the result. Thus, Armstrong and Collopy suggested a method named ‘winsorizing’ to overcome this problem by trimming extreme values.
However, this process will also add some complexity to the calculation and an appropriate trimming level has to be specified.Similarly, MAPE also has the issue of being infinite or undefined due to zeros in the denominator. The symmetric mean absolute percentage error (sMAPE) was first proposed by Armstrong as a modified MAPE which could be a simple way to fix the issue. It was then used in the M3-Competition as an alternative primary measure to MAPE.
However, Goodwin and Lawton pointed out that sMAPE is not as symmetric as its name suggested. In fact, it gave more penalties to under-estimates more than to over-estimates.
Thus, the use of sMAPE in the M3-Competition was widely criticized by researchers later. In an unpublished working paper, Chen and Yang defined a modified sMAPE, called msMAPE, by adding an additional component to the denominator of sMAPE.
The added component can efficiently avoid the inflation of sMAPE caused by zero-valued observations. However, this does not address the issue of asymmetry for sMAPE.Hyndman and Koehler proposed Mean Absolute Scaled Error (MASE) as a generally applicable measurement of forecasting accuracy without the problems seen in the other accuracy measures. However, this measure can still be dominated by a single large error, though infinite and undefined values have been well avoided for most cases. Davydenko and Fildes proposed an altered version of MASE, the average relative MAE (AvgRelMAE), which uses the geometric mean to average the relative efficiencies of adjustments across time series.
Although the geometric mean is appropriate for averaging benchmark ratios , the appropriateness of AvgRelMAE still depends on its component measure RelMAE for each time series.In this paper, a new accuracy measure is proposed to address the issues mentioned above. Specifically, by introducing a newly defined bounded relative absolute error, the new measure can address the asymmetric issue of sMAPE while maintaining its other properties, such as scale-independence and outlier resistance.
Further, we believe that the new measure improves the interpretability based on relative errors with a selectable benchmark than sMAPE which uses the percentage errors based on the observation values. Given that claimed that measures based on relative errors are the most reliable, we believe our measure is reliable in this sense. Review of accuracy measuresMany accuracy measures have been proposed to evaluate the performance of forecasting methods during the past couple of decades.
A table of most commonly used measures were listed in the review of 25 years of time series forecasting. There was also a thorough review on accuracy measures by Hyndman and Koehler. In this section, we mainly focus on new insights or new measures that have been introduced since 2006.For a time series with n observations, let Y t denote the observation at time t and F t denote the forecasts of Y t. Then the forecasting error e t can be defined as ( Y t–F t).
Let e t. denote the forecasting error at time t obtained by some benchmark method. That means e t.
= ( Y t - F t. ), where F t. is the forecast at time t by the benchmark method.
(3)MAE had been cited in the very early forecasting literature as a primary measure of performance for forecasting models. As shown in, MAE directly calculates the arithmetic mean of absolute errors. Hence, it is very easy to compute and to understand. However, it may produce biased results when extremely large outliers exist in data sets. Specifically, even a single large error can sometimes dominate the result of MAE.MSE, which calculates the arithmetic mean of squared errors, was used in the first M-Competition. However, its use was widely criticized later as inappropriate ,.
MSE is more vulnerable to outliers since it gives extra weight to large errors. Also, the squared errors are on different scale from the original data. Thus, RMSE, which is the squre root of MSE, is often preferred to MSE as it is on the same scale as the data. However, RMSE is also sensitive to forecasting outliers. (5)It should be noted that absolute values are used in the denominator of sMAPE defined in this paper.
This definition is different but equivalent to the definition in Makridakis and Makridakis and Hibon when forecasts and actual values are all non-negative. The absolute values in the denominator can avoid negative sMAPE as pointed out by Hyndman and Koehler.MAPE was used as one of the major accuracy measures in the original M-Competition. However, the percentage errors could be excessively large or undefined when the target time series has values close to or equal to zero. Moreover, Armstrong pointed out that MAPE has a bias favouring estimates that are below the actual values. This was illustrated by extremes: “a forecast of 0 can never be off by more than 100%, but there is no limit to errors on the high side”.
Makridakis discussed the asymmetric issue of MAPE with another example which involves two forecasts on different actual values. However, we believe that the example by Makridakis is beyond the idea of Armstrong in 1985. (7)MRAE can provide a clearer intuition of the performance improvement compared to the benchmark method. However, MRAE has a similar limitation as MAPE, in that it can also be excessively large or undefined, when e t. is close to or equal to zero.GMRAE is favoured since it is generally acknowledged that the geometric mean is more appropriate for averaging relative quantities than the arithmetic mean ,. According to an alternative representation of GMRAE shown above in, a key step for calculating GMRAE is to make an arithmetic mean of log-scaled error ratios. This makes GMRAE more resistant to outliers compared to MRAE which uses the arithmetic mean of original error ratios.
However, GMRAE is still sensitive to outliers. More specifically, GMRAE can be dominated by not only a single large outlier, but also an extremely small error close to zero. This is because there is neither upper bound nor lower bound for the log-scaled error ratios used by GMRAE. Also, it should also be noticed that zero errors, both in e t and e t., have to be excluded from the analysis.
Thus, GMRAE may not be sufficiently informative.Rather than use the average of relative errors, one can also use the relative of average errors obtained by a base measure. For example, when the base measure is RMSE, then relative RMSE (RelRMSE) is defined as. (8)RelRMSE is a commonly used measure proposed by Armstrong and Collopy where RMSE. denotes the RMSE produced by a benchmark method. Similar measures, such as RelMAE and RelMAPE, can be easily defined. They are also called relative measures.
An advantage of relative measures is their interpretability. However, the performance of relative measures is restricted by the component measure. For example, RelMAPE is also undefined when MAPE is undefined. Further, RelMAPE can also be easily dominated by extreme large outliers since MAPE is not resistant to outliers.
Thus, it makes no sense to compute RelMAPE if MAPE, as the component, is skewed.Another disadvantage of relative measures is that they are only available when there are several forecasts on the same series. As a related idea of relative measures, MASE does not have the above issue. It is defined as. (11)It should be noticed that AvgRelMAE uses out-of-sample M A E i. as the scaling factor while MASE uses in-sample M A E i. Though AvgRelMAE was shown to have many advantages such as interpretability and robustness , it still has the same issue with MASE since they are based on RelMAE. As mentioned above, the accuracy of RelMAE is constrained by the accuracy of MAE.
Since MAE can be dominated by extreme outliers, the MAE ratio r i does not necessarily represent an advisable comparison of forecasting methods based on the errors of the majority of forecasts for the i th time series. A new accuracy measureThe criteria for a useful accuracy measure have been explicitly addressed in the literature ,. As reviewed in the previous Section, many measures have been proposed with various advantages and disadvantages.
However, most of these measures suffer from one or more issues. In this section, we propose a new accuracy measure which adopts the advantages of other measures such as sMAPE and MRAE without having their common issues.
Specifically, the proposed measure is expected to have the following properties: (i) Informative: it can provide an informative result without the need to trim errors; (ii) Resistant to outliers: it can hardly be dominated by a single forecasting outlier; (iii) Symmetric: over estimates and under estimates are treated fairly; (iv) Scale-independent: it can be applied to data sets on different scales; (v) Interpretability: it is easy to understand and can provide intuitive results.It has been mentioned above in the review that sMAPE is resistant to outliers due to bounded error defined. We would like to propose a new measure in a similar fashion to sMAPE without its issues. Since relative errors are more general than percentage errors in providing intuitive results, we use the Relative Absolute Error (RAE) as the base to derive our new measure. (14)Though MBRAE is adequate to compare forecasting methods, it is a scaled error that cannot be directly interpreted as a normal error ratio reflecting the error size. In fact, the process of calculating GMRAE also contains a mean of log-scaled error ratio which is not easily interpretable. But this issue is addressed by converting the log-scaled error to a normal ratio with the exponential function.
Similarly, a transformation can be made to MBRAE to obtain a more interpretable measure which is termed the unscaled MBRAE (UMBRAE). Evaluation on the resistance of accuracy measures to a single forecasting outlier.A: Synthetic time series data where Y t is the target series and F t n are forecasts.
The only difference between F t n is their forecasts on the observation Y 8. B: Results of single forecasting outlier evaluation, which shows UMBRAE is less sensitive than other measures to a single forecasting outlier.The second group of time series data is created to evaluate whether over-estimates and under-estimates are treated ‘fairly’ by the accuracy measures. As presented in, Y t is the same time series as which was used in the single forecasting outlier resistance evaluation. In this scenario, F t 1 makes a 10% over-estimate error to all observations in Y t while F t 2 makes a 10% under-estimate. The results in show that all the accuracy measures except sMAPE have given the same error for F t 1 and F t 2. SMAPE produces a larger error for F t 2 which indicates it puts a heavier penalty on under-estimates than on over-estimates. Evaluation on the symmetry of accuracy measures to over-estimates and under-estimates.A: Synthetic time series data where Y t is the target series and F t n are forecasts.
F t 1 makes a 10% over-estimate to all observations of Y t, while F t 2 makes a 10% under-estimate. B: Results of symmetric evaluation, which shows UMBRAE and all other accuracy measures except sMAPE are symmetric.Davydenko and Fildes suggested another scenario to examine the property of symmetry for measures. In this scenario, the reward given for improving the benchmark is expected to balance the penalty given for reducing the benchmark by the same quantity. We also use this to examine our measure UMBRAE. Suppose that a time series has only two observations ( y) and there is one forecasting method to be compared with another benchmark method. For the benchmark method, it makes the forecasts f with errors ( y− f) of 1 and 2 respectively.
In contrast, the forecasting method produces errors of 2 and 1 respectively. As an expected result, the forecasting method has an error of 1 measured by UMBRAE based on the benchmark method. Thus, UMBRAE is also symmetric for this case.Normally, the scale-dependent issue of accuracy measures is related to their capability of evaluating forecasting performance across data series on different scales. Accuracy measures based on percentages or relative ratios are clearly suited to perform such evaluations and no synthetic data are made for this. However, the scale-dependent issue also exists within a data series. Thus, the third group of synthetic data shown in is made to evaluate the property of accuracy measures dealing with data on different scales within a single time series. In this data set, Y t is a time series generated by the Fibonacci sequence from 2 to 144.
As the forecasts to Y t, all forecasting values of F t 1 are set to have a 20% over-estimate error of the relevant observation of Y t. In contrast, F t 2 has the same mean absolute error as F t 1 but its errors are on different percentage scales from 1440% to 0.2%. Specifically, F n 2 has the same absolute error as F 11 - n 1.
For instance, F 1 2 has the same absolute error as F 10 1 which is 28.8. As presented in, MAE, RMSE, MASE and even GMRAE do not show any difference between the two forecasts. MRAE and MAPE, however, have produced substantially different results for the two cases. The errors measured by them for F t 2 are approximately ten times larger than for F t 1. In contrast, UMBRAE and sMAPE give a moderate difference for the two forecasts. Evaluation on the scale dependency of accuracy measures.A: Synthetic time series data where Y t is the target series and F t n are forecasts. F t 1 and F t 2 have the same mean absolute error, but errors are on different percentage scales to the corresponding values of Y t.
B: Results of scale dependency evaluation, where MAE, RMSE, MASE and even GMRAE show no difference between F t 1 and F t 2. MRAE and MAPE produce substantially different errors for the two cases. SMAPE and UMBRAE can reasonably distinguish the two forecasts. Accuracy MeasureMAERMSEMASEAvgRelMAEMRAEGMRAEMAPEsMAPEUMBRAEMAE−0.6500.4090.4630.2780.4930.8950.8210.476RMSE0.650−−0.110−0.2540.357−0.2290.4490.168−0.260MASE0.409−0.110−0.7110.2200.6870.5100.5900.679AvgRelMAE0.463−0.2540.711−0.0540.9850.4810.6870.990MRAE0.2780.3570.2200.054−0.0100.0790.1100.026GMRAE0.493−0.2290.6870.9850.010−0.5320.7020.995MAPE0.8950.4490.5100.4810.0790.532−0.8410.513sMAPE0.8210.1680.5900.6870.1100.7020.841−0.706UMBRAE0.476−0.2600.6790.9900.0260.9950.5130.706−Average0.5610.0960.4620.5150.1420.5220.5380.5780.516. To eliminate the influence of outliers and extreme errors, we also use trimmed means to evaluate the accuracy measures.
A 3% trimming level is used in our study. As shown in, most errors measured by MAE, RMSE, MASE, MRAE and MAPE have significant differences compared to that without trimming shown in. The rankings of forecasting methods made by these measures also have significant changes. In contrast, errors and rankings measured by other measures have less changes. Particularly, the value of UMBRAE is quite invariant to trimming, where differences appear only after the third decimal point for most of the forecasting methods. It can also be noticed that the rankings made by UMBRAE in keep the same as that in. In general, all the measures except MRAE have similar rankings.
As shown in, the rank correlations between UMBRAE and other measures are much higher on average as shown in. Accuracy MeasureMAERMSEMASEAvgRelMAEMRAEGMRAEMAPEsMAPEUMBRAEMAE−0.9510.8050.8280.2560.8200.9400.9700.839RMSE0.951−0.7070.7100.3450.7080.9290.9090.720MASE0.8050.707−0.9520.2390.9120.6310.7510.948AvgRelMAE0.8280.7100.952−0.1350.9800.6880.7710.996MRAE0.2560.3450.2390.135−0.0770.1430.1560.133GMRAE0.8200.7080.9120.9800.077−0.6840.7530.985MAPE0.9400.9290.6310.6880.1430.684−0.9500.697sMAPE0.9700.9090.7510.7710.1560.7530.950−0.781UMBRAE0.8390.7200.9480.9960.1330.9850.6970.781−Average0.8010.7470.7430.7580.1860.7400.7080.7550.762. To show the error distributions in a similar manner to that in , we use the errors produced by the forecasting method ForecastPro as an example. Figs to show the distributions of the eight underlying error measurements used in the nine accuracy measures mentioned in this paper. In each Fig, the top plot shows the kernel density estimate of the errors illustrating its distribution, while the bottom shows a box-and-whisker plot which more clearly highlights the outliers. From these Figs, it can be seen that the distribution of error measurements used in UMBRAE is more evenly distributed, with fewer outliers than in the other measures.
ConclusionWe have proposed a new accuracy measure UMBRAE based on bounded relative errors. As discussed in the review of sMAPE, one advantage of the bounded error is that it gives less significance to outliers since it does not have the issue of being excessively large or infinite. Evaluation on the proposed measure along with related measures has been made on both synthetic and real-world data.
We have shown that UMBRAE combines the best features of various alternative measures without having their common drawbacks. UMBRAE, with selectable benchmark, can provide an informative and interpretable result based on bounded relative error. It is less sensitive to forecasting outliers than other measures. It is also symmetric and scale-independent. Though it has been commonly accepted that there cannot be any single best accuracy measure, we suggest that UMBRAE is a good choice for general use when evaluating the performance of forecasting methods. Since UMBRAE, in our study, performs similar to GMRAE without the need to trim zero-error forecasts, we particularly recommend UMBRAE as an alternative measure for the cases where GMRAE is preferred.Although we have shown that UMBRAE has many advantages as described above, its statistical properties have not been well studied.
For example, the way how UMBRAE reflects the properties of the distributions of errors is unclear. Moreover, one possible underlying drawback for UMBRAE is that the bounded error used by UMBRAE will reach the maximum value 1.0 when the benchmark error ( Y t - F t.) is equal to zero even if the forecast is good. This may produce a biased estimate especially when the benchmark method produces a large number of zero errors. Although this drawback may not be relevant for the majority of real-world data, in the future, we would like to address this issue.