DEVELOPMENT OF AN ENHANCED C4.5 DECISION TREE ALGORITHM USING A MEMOIZED MAPREDUCE MODEL
Abstract
Classification technique in data mining concentrates on the prediction of categorical or discrete target variables which is designed to be handled by the classical C4.5 decision tree algorithm, an algorithm whose aim is to produce a tree which accurately predicts the target variable for a new unseen data. However its recursive nature poses a limitation when huge volume of dataset is involved; making computation more complex and resulting in an inefficient implementation of the algorithm in terms of computing time, memory utilization and data complexity. Meanwhile, several researches have been done to control these limitations. One of such improvements is the parallelizing of the algorithm using the MapReduce model. This involves dividing the large dataset into smaller units and sharing them on multiple computers for parallel processing, but the recursive nature of the algorithm makes the cost of computing large number of repeated calculations quite high, which is our concern in this work. . This research is aimed at reducing computation time further, by using a memoized MapReduce model, which involves the saving of the result of previous calculations in a cache; hence, when same calculations are encountered again, the cached result is returned, thus re-computation is avoided. The cached result is considered a reduced cost compared to the computational cost of re-computation.
References
Badgujar, G., & Sawant, K. (2017). Improved C4.5 Decision Tree Classifier Algorithm for Analysis of Data Mining Application. International Journal for Research in Engineering Application & Management, 2(10), 18-24.
Becklas, A. (2018). FIFA World Cup. Kaggle Repository. Kaggle Inc. Retrieved October 17, 2019, from https://www.kaggle.com/abecklas/fifa-world-cup
Cherfi, A., Nouira, K., & Ferchichi, A. (2018). Very Fast C4.5 Decision Tree Algorithm. Applied Artificial Intelligence, 32(2), 119–137. doi:10.1080/08839514.2018.1447479
Dai, W., & Ji, W. (2014). A MapReduce Implementation of C4.5 Decision Tree Algorithm. International Journal of Database Theory Application, 7(1), 49-60.
Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation -, 6, pp. 137-150. San Francisco, CA.
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From Data mining to Knowledge Discovery in Databases. AI Magazine, 17(3), 37-54.
Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1992). Knowledge discovery in databases: an overview. AI Magazine, 13(3), 57-70.
Hallmark, E. (2018). NBA Historical Stats and Betting Data. Kaggle Repository. Kaggle Inc. Retrieved August 13, 2019, from https://www.kaggle.com/ehallmar/nba-historical-stats-and-betting-data
Han, J., & Kamber, M. (2006). Data Mining: Concepts and Techniques (2nd ed.). San Francisco, CA, USA: Morgan Kaufmann.
Jaffar, J., Santosa, A. E., & Voicu, R. (2008). Efficient Memoization for Dynamic Programming with Ad-Hoc Constraints. Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, (pp. 297-303). Chicago, USA.
Jaseena , K. U., & David, J. M. (2014). Issues, Challenges and Solutions : Big Data Mining. Sixth International Conference on Networks & Communications, (pp. 131–140). doi:10.5121/csit.2014.41311
Kahn, M. (n.d.). Diabetes Data Set. UCL Machine Learning Repository. St. Louis: Center for Machine Learning and Intelligent Systems. Retrieved August 13, 2019, from https://archive.ics.uci.edu/ml/datasets/Diabetes
Ilter, N., & Guvenir, H. A. (1998, January 1). Dermatology Data Set. UCI Machine Learning Repository. Ankara, Turkey: Center for Machine Learning and Intelligent Systems. Retrieved August 13, 2019, from https://archive.ics.uci.edu/ml/datasets/Dermatology
Mu, Y., Liu, X., Yang, Z., & Liu, X. (2017). A parallel C4.5 decision tree algorithm based on MapReduce. Concurrency and Computation. Concurrency and Computation: Practice and Experience, 29(8), 1-12. doi:10.1002/cpe.4015
Muslim, M. A., Rukmana, S. H., Sugiharti, E., Prasetiyo, B., & Alimah, S. (2018). Optimization of C4.5 algorithm-based particle swarm optimization for breast cancer diagnosis. Journal of Physics: Conf. Series, 983(2018), 1-8. doi:10.1088/1742- 6596/983/1/012063.
Piatetsky-Shapiro, G. (1991). Knowledge Discovery in Real Databases. AI Magazine, 11(5), 68-70.
Quinlan, J. (1993). C4.5: Programs for machine learning (Vol. 16). San-Mateo, CA: Morgan Kaufman.
Rajeshinigo, D., & Jebamalar, J. P. (2017). Accuracy Improvement of C4.5 using K Means Clustering. International Journal of Science and Research, 6(6), 2755-2758.
Seema, S., Agrawal, J., & Sharma, S. (2013). Classification through Machine Learning. International Journal of Computer Applications, 82(16), 20- 27
Sniedovich, M., & Lew, A. (2006). Dynamic Programming : an overview. Control and Cybernatics, 35(3), 513-533.
Suresh, A., Swamy, B. N., Rohou, E., & Seznec, A. (2015). Intercepting Functions for Memoization: A Case Study Using Transcendental Functions. ACM Transsactions on Architecture and Code Optimization, 12(2), 18:1-18:23.
Wang, B., Huang, S., Qiu, J., Liu, Y., & Wang, G. (2014). Parallel online sequential extreme learning machine based on MapReduce. Neurocomputing, 149, 224-232.
Wang, Z., Wang, J. Huo, Y. Tuo, Y., & Yang, Y. (2016) “A Searching Method of Candidate Segmentation Point in SPRINT Classification,” Journal of Electrical and Computer Engineering, vol. 2016, pp. 1-5. doi:10.1155/2016/2168478
Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H.. Geoffery, J.M., Ng, A., Liu, B., Yu, P.S., Zhou, Z., Steinbach, M., Hand, D.J., & Steinberg, D. (2008) “Top 10 algorithms in data mining,” Knowledge Information System, 14, pp. 1–37.
Zhu, F., Tang, M., Xie, L., & Zhu, H. (2018). A Classification Algorithm of CART Decision Tree based on MapReduce Attribute Weights. International Journal of Performability Engineering, 14(1), 17-25. doi:10.23940/ijpe.18.01.p3.1725
Zwitter, M., & Soklic, M. (1988, July 11). Breast Cancer Data Set. UCI Machine Learning Repository. Ljubljana, Yugoslavia: Center for Machine Learning and Intelligent Systems. Retrieved August 13, 2019, from http://mlr.cs.umass.edu/ml/datasets/Breast+Cancer
Copyright (c) 2023 FUDMA JOURNAL OF SCIENCES
This work is licensed under a Creative Commons Attribution 4.0 International License.
FUDMA Journal of Sciences