Comparing K Nearest Neighbours Methods and Linear Regression – Is There Reason To Select One Over the Other?

Arto Haara, Annika Kangas


Non-parametric k nearest neighbours (k-nn) techniques are increasingly used in forestry problems, especially in remote sensing. Parametric regression analysis has the advantage of well-known statistical theory behind it, whereas the statistical properties of k-nn are less studied. In this study, we compared the relative performance of k-nn and linear regression in an experiment. We examined the effect of three different properties of the data and problem: 1) the effect of increasing non-linearity, 2) the effect of balance of the modelling and test data and 3) the effect of the correct assumptions of the model form. In order to be able to determine the effect of these three aspects, we used simulated data and simple modelling problems. K-nn and linear regression gave fairly similar results with respect to the average RMSEs. In both cases, balanced modelling dataset gave better results than unbalanced dataset. When the results were examined within diameter classes, the k-nn results were less biased than regression model results, especially with extreme values of diameter. The differences increased with increasing non-linearity of the model and increasing unbalance of the data. The difference between the methods was more obvious, when the assumed model form was not exactly correct. This result, however, requires that the modelling and test datasets have a similar distribution: if the distributions are different, regression model may be more robust than k-nn.

Keywords: modelling, regression, imputation, balance of data


modelling; regression; imputation; balance of data

Full Text:



  • There are currently no refbacks.


© 2008 Mathematical and Computational Forestry & Natural-Resource Sciences