University of Auckland
观来说，房价受到诸多因素的制约，正因如此，房价预测仍然是数据分析中一个非常经典且具有挑战性的问题.本文针对房价数据冗余，在实际场景中很难确定重要特征，提出了一种创新的数据预处理方式，并通过双模型迭代拟合的方式进行数据预测.首先从数据意义、数据形式和数据关联性三个方面进行初始数据预处理，然后根据数据选择适合的模型进行训练.在传统机器学习中，Random Forest和 XGBoost是两种常用的方法.RF 模型通过其 Bagging 过程，能够准确地评判“冗余”特征，而 XGB 模型在提高预测效果的同时，也囿于其泛化能力下降，无法稳定地反映特征重要性.因此，本文利用 RF 模型处理冗余数据，并使用XGB模型对新数据集进行拟合提高预测效果.本文在Kaggle竞赛的数据集（"House PricesAdvanced Regression Techniques"）上进行了实验，测试结果显示，XGB回归模型最终的回归精度R2为87%，而单独的RF模型或XGB模型的R2分别为79.2%和78.7%.实验证明，该数据预测方法能够明显提高房价预测效果.同时，为充分体现模型拟合效果和预测能力，将“房价”改为具有“高”和“低”两类的离散变量，最终预测结果的精确度为93%，召回率为93%.
Objectively, housing prices are restricted by many factors and because of this, house price prediction remains a very classical and challenging problem in data analysis. In response to the redundancy of house price data, which makes it difficult to identify important features in practical scenarios, this paper proposes an innovative approach to data pre-processing and data prediction by means of double model iterative fitting. The initial data is pre-processed in terms of data meaning, data form and data relevance, then suitable models are selected for training. In traditional machine learning, Random Forest (RF) and XGBoost (XGB) are two commonly used methods. The RF model is able to accurately judge "redundant" features through its Bagging process. The XGB model, while improving prediction, is also limited by its reduced generalisation ability and cannot stably reflect the importance of features. Therefore, this paper uses the RF model to process redundant data and uses the XGB model to fit new data sets to improve the prediction results. In this paper, experiments were conducted on the Kaggle competition dataset ("House PricesAdvanced Regression Techniques") and the test results showed that the final regression accuracy R2 of the XGB regression model was 87%, while the R2 of the single RF model and the single XGB model were 79.2% and 78.7%, respectively. The experiment proves that the data prediction method can significantly improve the effect of housing price prediction. To fully reflect the model fitting effect and prediction ability, the authors change the "house price" to discrete variable which has two categories of "high" and "low", and get the Confusion Matrix with an precision of 93% and a recall of 93%.
引用本文格式： 陶然. 基于 XGBoost 的房价预测优化[J]. 四川大学学报: 自然科学版, 2022, 59: 037001.复制