Рандомизированные алгоритмы на основе интервальных узорных структур для задач классификации и регрессии в задачах кредитного риск-менеджмента тема диссертации и автореферата по ВАК РФ 05.13.18, кандидат наук Масютин, Алексей Александрович

  • Масютин, Алексей Александрович
  • кандидат науккандидат наук
  • 2018, Москва
  • Специальность ВАК РФ05.13.18
  • Количество страниц 102
Масютин, Алексей Александрович. Рандомизированные алгоритмы на основе интервальных узорных структур для задач классификации и регрессии в задачах кредитного риск-менеджмента: дис. кандидат наук: 05.13.18 - Математическое моделирование, численные методы и комплексы программ. Москва. 2018. 102 с.

Оглавление диссертации кандидат наук Масютин, Алексей Александрович

1 Introduction............................................................3

1.1 Ph.D. Thesis Relevance ......................................3

2 Overview of Data Analysis Methods in Commercial Banks..........6

2.1 Mathematical Modeling in Commercial Banks................6

2.2 Neural Networks in Credit Scoring............................10

2.3 Classification Tasks in Marketing Campaign Management . 13

2.4 Loan Default Prediction in Banking: Scorecards ............15

3 Formal Concept Analysis in Classification Problem..................18

3.1 Formal Concept Analysis......................................20

3.2 Lazy Classification with Pattern Structures ..................21

3.3 Query-Based Classification Algorithm ........................22

3.4 Voting Schemes ................................................25

3.5 Experiments with Top-10 Bank Data..........................31

3.6 Experiments with open data..................................39

3.7 QBCA. Alternative approaches................................45

3.8 Interpretability: visualization of premises....................49

3.9 Computational time analysis ..................................59

4 FCA in regression problem............................................64

4.1 Problem description ..........................................64

4.2 Augmented interval pattern structures ........................64

4.3 Query-based regression algorithm with continuous target attribute ..........................................................66

4.4 Data and experiments ..........................................68

5 Conclusion............................................................74

6 Acknowledgements....................................................75

References....................................................................76

Appendix....................................................................86

Рекомендованный список диссертаций по специальности «Математическое моделирование, численные методы и комплексы программ», 05.13.18 шифр ВАК

Введение диссертации (часть автореферата) на тему «Рандомизированные алгоритмы на основе интервальных узорных структур для задач классификации и регрессии в задачах кредитного риск-менеджмента»

1.1 Ph.D. Thesis Relevance

Although the recent biggest bank failures are mostly viewed from perspective of inability to predict market key factors and lack of banking regulation (financial crisis 2007-'08) the history knows a number of failures driven by purely credit risk mismanagement. For instance, Long-Term Credit Bank of Japan was one of the top three banks in Japan responsible for postwar economic growth and in 1989 it was considered the 9th largest company in the world by asset value. At the time LTCB had more than $19.2 billion in bad debt. In 1998, the Japanese government nationalized LTCB, then restructured it as a commercial bank named Shinsei Bank. The Bank of New England (BNE), along with its two sister banks, Maine National Bank and Connecticut Bank and Trust, failed on January 6, 1991. BNE was the largest bank in the New England area. With its sister banks, it had assets totaling $21.8 billion and deposits totaling $19 billion. Bad loans led to its downfall. Los Angeles-based IndyMac used to be the largest loan originator in the USA. Founded in 1995 as Countrywide Mortgage Investment, IndyMac fueled its aggressive growth through risky loan products like Alt-A mortgages, concentrating on inflated real estate markets like California and Florida, and relying heavily on borrowed funds, especially from the FHLB (Federal Home Loan Bank). As of July 2008, IndyMac had total assets of $32.01 billion. Moreover, the financial crisis of 2007-2008 in its basis was actually led by inappropriate credit risk assessment of mortgage loans.

As far as Russian financial market is concerned within 2016-2017 several banks showed inability to control loan portfolio quality [86], [87], [88]. Central Bank of Russia strengthened its focus on banks assets quality control and implemented its own management teams within problem banks executive boards. To considerable part instability is caused by inconsistent risk management which leads to sufficient losses. The greatest part of loss in Russian banks ( 70%) is due to credit risk. Credit risk is risk of the borrowers are not going to repay the granted loan amount in time. The first step to manage credit risk is ability to assess it. In context of credit risk

assessment there are three key parameters: probability of default (PD), loss given default (LGD), exposure at default (EAD) [89]. Multiplied all together they provide an estimate for expected loss (EL). Majority of decisions in credit process, such as whether to grant a loan, sell the loan, initiate legal bankruptcy procedure, are made based on expected loss estimates.

Mathematical models have been widely used in order to make precise predictions on the level of PD, EAD, LGD [1]. Models are usually calibrated on historical data on borrower performance. From data science standpoint PD estimation is a binary classification problem, EAD and LGD estimation is regression problems. As banking industry has begun to be more and more regulated, the requirements on mathematical models development and validation have become more strict and detailed [89].

One of the serious trade-offs in credit risk modeling is accuracy of prediction versus model interpretability. As it will be later shown, some regulators require banks to be able to provide reject reasons for borrowers and also when central banks examine the bank models they are likely to understand economic intuition behind them to prove the models are going to show expected and stable performance. This can be typically solved given the model is interpretable. At the same time interpretable data analysis algorithms usually belong to the simplest class such as logistic regression or decision trees which not always can provide the desired accuracy. We will make an overview of more complex algorithms applications such as neural networks in credit scoring which are capable of describing non-linear interdependencies within the data but cannot provide the bank with a reason for reject or acceptance of a loan application.

As we stated, accurate credit risk estimation is the key tool for risk management and banks obviously are eager to increase accuracy of algorithms, but keeping them interpretable.

The relevance of this Ph.D. thesis is that it offers data analysis algorithms that have accuracy superior to simple algorithms widely adopted within the banks (such as logisitic regression, decision trees and scorecards) and still maintain the property of interpretability in sense that they provide a decision maker with a set of rules

applicable to the borrower creditworthiness assessment.

In order to achieve this goal several novelties within the methods of formal concept analysis (FCA) and interval pattern structures ([51]) were introduced. The reasons why FCA methods are suitable for credit risk assessment under the interpretabil-ity requirements will be explained in following sections.

The novelty brought to the well developed tools of FCA consists of two parts. The first one is that FCA is adopted to classification problem based on numerical data with the step of concept lattice construction being omitted (query-based classification or "lazy" classification). This allows one to work with the datasets with arbitrary number of observations which is vital for banks as soon as historical data is typically large.

The second is that we introduce a modification to FCA method based on interval pattern structures which allows one to solve regression problem. To our knowledge it is FCA methods were not applicable to such type of data analysis problem before. The crucial difference in regression problem is that the target variable is distributed continuously.

The goal of this thesis is to provide PD and LGD algorithms of estimation keeping them interpretable. At the same time methods should provide higher accuracy than basic wide spread algorithms in banking industry (such as logistic regression, decision trees and scorecards). One also should note that PD and LGD are the main drivers of EL as soon as EAD is modeled only for revolving loans such as credit cards and credit tranches [1].

The work consists of a general overview of data analysis algorithms and mathematical models in banking, FCA terms definitions, detailed description of FCA and interval pattern structures modifications, data, benchmarks and experiments results and appendix with an overview of programming implementation of discussed algorithms.

Похожие диссертационные работы по специальности «Математическое моделирование, численные методы и комплексы программ», 05.13.18 шифр ВАК

Заключение диссертации по теме «Математическое моделирование, численные методы и комплексы программ», Масютин, Алексей Александрович

• First and foremost I want to thank my scientific advisor Doctor of Science prof. Sergei O. Kuznetsov. It has been an honor to be his Ph.D. student. He has taught me, both consciously and unconsciously, giving absolutely new ways to look at common data analysis problems. I appreciate all his contributions of time and ideas to make my Ph.D. experience productive;

• This work could not be possible without Yury Kashnitsky who has given me a sound introduction to the formal concept analysis tool set and who has become one of my co-authors and contributors to my work;

• Also, I thank Alexander Ageev, who is currently a master student at HSE, who has helped me a lot with additional data experiments and his own Python code algorithms implementation;

• I thank Ivan Medvedev and Evgeny Zinchenko, who have become my first boss and mentor in area of risk-management at RCI Banque and who have sparked my interest in risk modeling;

• I give special thanks to my current boss Roman Tikhonov, Head of Validation department at Sberbank, who has provided me with an opportunity to devote considerable time to Ph.D. thesis including academic work and conferences attendance;

• Last but not the least, I would like to thank my family: my parents, my brother and my sweetheart, due to their unconditional support and understanding of lack of attention from my side.

Список литературы диссертационного исследования кандидат наук Масютин, Алексей Александрович, 2018 год

[1] Edelman, D.B. and J.N. Crook. (2002). Credit scoring and its applications. Society for Industrial Mathematics.

[2] Van Gestel, T. and B. Baesens. Credit Risk Management: Oxford University Press.

[3] Hand, D.J. and W.E. Henley. (1997). Statistical classification methods in consumer credit scoring: a review. Journal of the Royal Statistical Society: Series A (Statistics in Society). 160(3), 523-541.

[4] Thomas, L.C. (2000). A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers. International Journal of Forecasting. 16(2), 149-172.

[5] Kamleitner, B. and E. Kirchler. (2007). Consumer credit use: a process model and literature review. Revue Européenne de Psychologie Appliquée/European Review of Applied Psychology.57(4), 267-283.

[6] Abdou, H.A. and J. Pointon. (2011). Credit scoring, statistical techniques and evaluation criteria: a review of the literature. Intelligent Systems in Accounting, Finance and Management.

[7] Leung, K., et al. (2008). A comparison of variable selection techniques for credit scoring.

[8] Cios, K.J., et al. (1998). Data mining methods for knowledge discovery. Kluwer Academic Publishers.

[9] Bing Liu, et al. (1998), Integrating Classification and Association Rule Mining, KDD-98 Proceedings

[10] X. Wu et al. (2008), Top 10 algorithms in data mining, Springer-Verlag (14), 1-37

[11] Bing Liu,Zhiyuan Chen, (2016) Lifelong Machine Learning, Morgan & Clay-pool Publishers, 29-53

[12] Tsai, C.F. and M.-L. Chen. (2010). Credit rating by hybrid machine learning techniques. Applied Soft Computing. 10(2), 374-380.

[13] Tan, P.N., M. Steinbach, and V. Kumar. (2006). Introduction to data mining. Pearson Addison Wesley Boston.

[14] West, D., S. Dellana, and J. Qian. (2005). Neural network ensemble strategies for financial decision applications. Computers & Operations Research. 32(10), 2543-2559.

[15] Wang, G., et al. (2011). A comparative assessment of ensemble learning for credit scoring. Expert Systems with Applications. 38(1), 223-230.

[16] Paleologo, G., A. Elisseeff, and G. Antonini. (2010). Subagging for credit scoring models. European Journal of Operational Research. 201(2). 490-499.

[17] Hussein A, A. (2009). Genetic programming for credit scoring: The case of Egyptian public sector banks. Expert Systems with Applications. 36(9), 1140211417.

V

[18] Sustersic, M., D. Mramor, and J. Zupan. (2009). Consumer credit scoring models with limited data. Expert Systems with Applications. 36(3, Part 1), 47364744.

[19] David, W. (2000). Neural network credit scoring models. Computers & Operations Research. 27(11-12),1131-1152.

[20] Abdou, H., J. Pointon, and A. El-Masry. (2008). Neural nets versus conventional techniques in credit scoring in Egyptian banking. Expert Systems with Applications. 35(3), 1275-1292.

[21] Arie, B.D. (2008) Rule effectiveness in rule-based systems: A credit scoring case study. Expert Systems with Applications. 34(4), 2783-2788.

[22] Ben-David, A. and E. Frank. (2009). Accuracy of machine learning models versus "hand crafted" expert systems - A credit scoring case study. Expert Systems with Applications. 36(3, Part 1), 5264-5271. Sadatrasoul et al./ Journal of AI and Data Mining, Vol.1, No.2, 2013 128

[23] Huang, Y.M., C.M. Hung, and H.C. Jiau. (2006). Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem. Nonlinear Analysis: Real World Applications. 7(4), 720-747.

[24] Wang, J., K. Guo, and S. Wang. (2010). Rough set and Tabu search based feature selection for credit scoring. Procedia Computer Science. 1(1), 2425-2432.

[25] Hoffmann, F., et al. (2007). Inferring descriptive and approximate fuzzy rules for credit scoring using evolutionary algorithms. European Journal of Operational Research. 177(1), 540-555.

[26] Lee, T.S. and I.F. Chen. (2005). A two-stage hybrid credit scoring model using artificial neural networks and multivariate adaptive regression splines. Expert Systems with Applications. 28(4), 743-752.

[27] Huang, C.L., M.C. Chen, and C.J. Wang. (2007). Credit scoring with a data mining approach based on support vector machines. Expert Systems with Applications. 33(4), 847-856.

[28] Li, S.T., W. Shiue, and M.-H. Huang. (2006). The evaluation of consumer loans using support vector machines. Expert Systems with Applications. 30(4), 772782.

[29] Ong, C.S., J.-J. Huang, and G.-H. Tzeng. (2005). Building credit scoring models using genetic programming. Expert Systems with Applications. 29(1), 41-47.

[30] Yingxu, Y. (2007). Adaptive credit scoring with kernel learning methods. European Journal of Operational Research. 183(3), 1521-1536.

[31] Bellotti, T. and J. Crook. (2009). Support vector machines for credit scoring and discovery of significant features. Expert Systems with Applications. 36(2, Part 2), 3302-3308.

[32] Xu, X., C. Zhou, and Z. Wang. (2009). Credit scoring algorithm based on link analysis ranking with support vector machine. Expert Systems with Applications. 36(2, Part 2), 2625-2632.

[33] Luo, S.T., B.-W. Cheng, and C.-H. Hsieh. (2009). Prediction model building with clustering-launched classification and support vector machines in credit scoring. Expert Systems with Applications. 36(4), 7562-7566.

[34] Chen, W., C. Ma, and L. Ma. (2009). Mining the customer credit using hybrid support vector machine technique. Expert Systems with Applications. 36(4), 7611-7616.

[35] Chen, F.L. and F.C. Li. (2010). Combination of feature selection approaches with SVM in credit scoring. Expert Systems with Applications. 37(7), 49024909.

[36] Nanni, L. and A. Lumini. (2009). An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications. 36(2, Part 2), 3028-3033.

[37] Lee, T.S., et al. (2006). Mining the customer credit using classification and regression tree and multivariate adaptive regression splines. Computational Statistics & Data Analysis. 50(4), 1113-1130.

[38] Martens, D., et al. (2007). Comprehensible credit scoring models using rule extraction from support vector machines. European Journal of Operational Research. 183(3), 1466-1476.

[39] Steven, F. (2009). Are we modelling the right thing? The impact of incorrect problem specification in credit scoring. Expert Systems with Applications. 36(5), 9065-9071.

[40] Ping, Y. and L. Yongheng. (2011). Neighborhood rough set and SVM based hybrid credit scoring classifier. Expert Systems with Applications. 38(9), 1130011304.

[41] Hens, A.B. and M.K. Tiwari. (2012). Computational time reduction for credit scoring: An integrated approach based on support vector machine and stratified sampling method. Expert Systems with Applications.

[42] Wang, J., et al. (2012). Rough set and scatter search metaheuristic based feature selection for credit scoring. Expert Systems with Applications.

[43] Yap, B.W., S.H. Ong, and N.H.M. Husain. (2011). Using data mining to improve assessment of credit worthiness via credit scoring models. Expert Systems with Applications. 38(10), 13274-13283.

[44] Brown, I. and C. Mues. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications. 39(3), 3446-3453.

[45] Crone, S.F. and S. Finlay. (2012). Instance sampling in credit scoring: An empirical study of sample size and balancing. International Journal of Forecasting. 28(1), 224-238.

[46] Lee, T.-S., et al. (2002). Credit scoring using the hybrid neural discriminant technique. Expert Systems with Applications. 23(3), 245-254.

[47] Wang, G., et al. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems. 26(0), 61-68.

[48] Sohn, S.Y. and J.W. Kim. (2012). Decision tree-based technology credit scoring for start-up firms: Korean case. Expert Systems with Applications. 39(4), 40074012.

[49] Thomas, L.C. (2009). Consumer credit models: pricing, profit, and portfolios. Oxford University Press, USA.

[50] Vapnik, V.N. (2000). The nature of statistical learning theory. Springer Verlag.

[51] Bernhard Ganter and Sergei Kuznetsov, "Pattern structures and their projections," in Conceptual Structures: Broadening the Base, Harry Delugach and Gerd Stumme, Eds., vol. 2120 of Lecture Notes in Computer Science, pp. 129-142. Springer, Berlin/Heidelberg, 2001.

[52] Ganter, B., Wille, R.: Formal concept analysis: Mathematical foundations. Springer, Berlin, 1999.

[53] Sergei O. Kuznetsov, "Scalable knowledge discovery in complex data with pattern structures.," in PReMI, Pradipta Maji, Ashish Ghosh, M. Narasimha Murty, Kuntal Ghosh, and Sankar K. Pal, Eds. 2013, vol. 8251 of Lecture Notes in Computer Science, pp. 30-39, Springer.

[54] Thomas L., Edelman D., Crook J. (2002) Credit Scoring and Its Applications, Monographs on Mathematical Modeling and Computation, SIAM: Pliladelphia, pp. 107-117

[55] Bigss, D., Ville, B., and Suen, E. (1991). A Method of Choosing Multiway Partitions for Classification and Decision Trees. Journal of Applied Statistics, 18, 1, 49-62.

[56] Naeem Siddiqi, Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring, WILEY,ISBN: 978-0-471-75451-0, 2005

[57] B Baesens, T Van Gestel, S Viaene, M Stepanova, J Suykens, Benchmarking state-of-the-art classification algorithms for credit scoring, Journal of the Operational Research Society 54 (6), 627-635, 2003

[58] Ghodselahi A., A Hybrid Support Vector Machine Ensemble Model for Credit Scoring, International Journal of Computer Applications (0975 - 8887), Volume 17- No.5, March 2011

[59] Yu, L., Wang, S. and Lai, K. K. 2009. An intelligent agent-based fuzzy group decision making model for financial multicriteria decision support: the case of credit scoring. European journal of operational research. vol. 195. pp.942-959.

[60] Gestel, T. V., Baesens, B., Suykens, J. A., Van den Poel, D., Baestaens, D.-E. and Willekens, B. 2006. Bayesian kernel based classification for financial distress detection. European journal of operational research. vol. 172. pp. 979-1003.

[61] P. Ravi Kumar and V. Ravi, "Bankruptcy Prediction in Banks and Firms via Statistical and Intelligent Techniques-A Review," European Journal of Operational Research, Vol. 180, No. 1, 2007, pp. 1-28.

[62] Sergei O. Kuznetsov and Mikhail V. Samokhin, "Learning closed sets of labeled graphs for chemical applications.," in ILP, Stefan Kramer and Bernhard Pfahringer, Eds. 2005, vol. 3625 of Lecture Notes in Computer Science, pp. 190208, Springer

[63] SAS Institute Inc. (2012), Developing Credit Scorecards Using Credit Scoring for SAS® Enterprise Miner™ 12.1, Cary, NC: SAS Institute Inc.

[64] Hocking, R. R. (1976) "The Analysis and Selection of Variables in Linear Regression," Biometrics, 32.

[65] Mehdi Kaytoue, Sergei O. Kuznetsov, Amedeo Napoli, and Sebastien Dup-lessis, "Mining gene expression data with pattern structures in formal concept analysis," Information Sciences, vol. 181, no. 10, pp. 1989-2001, 2011.

[66] Veloso, A. & Jr., W. M. (2011), Demand-Driven Associative Classification., Springer.

[67] Bigss, D., Ville, B., and Suen, E. A Method of Choosing Multiway Partitions for Classification and Decision Trees. Journal of Applied Statistics, 18, 1, 49-62 1991

[68] Bonferroni, C. E. Teoria statistica delle classi e calcolo delle probability - Pub-blicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze 8, pp. 3-62, 1936

[69] Naeem Siddiqi, Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring, WILEY,ISBN: 978-0-471-75451-0, 2005

[70] Thomas, L. C., Edelman, D. B., Crook, J. N. Credit Scoring and Its Applications — Pliladelphia.: SIAM, 2002. 250 p.

[71] B Baesens, T Van Gestel, S Viaene, M Stepanova, J Suykens, Benchmarking state-of-the-art classification algorithms for credit scoring, Journal of the Operational Research Society 54 (6), 627-635, 2003

[72] Silvia, F., Cribari-Neto F. Beta Regression for Modelling Rates and Proportions // Journal of Applied Statistics. 2004. Vol. 31, Issue 7. p. 799-815.

[73] Yu, L., Wang, S. and Lai, K. K. 2009. An intelligent agent-based fuzzy group decision making model for financial multicriteria decision support: the case of credit scoring. European journal of operational research. vol. 195. pp.942-959.

[74] Gestel, T. V., Baesens, B., Suykens, J. A., Van den Poel, D., Baestaens, D.-E. and Willekens, B. 2006. Bayesian kernel based classification for financial distress detection. European journal of operational research. vol. 172. pp. 979-1003.

[75] P. Ravi Kumar and V. Ravi, "Bankruptcy Prediction in Banks and Firms via Statistical and Intelligent Techniques-A Review," European Journal of Operational Research, Vol. 180, No. 1, 2007, pp. 1-28.

[76] Ganter, B. and Wille, R. Formal Concept Analysis: Mathematical Foundations // Springer-Verlag New York, Inc. 1997.

[77] B. Ganter, and S. O. Kuznetsov. Pattern Structures and Their Projections, Conceptual Structures: Broadening the Base, Lecture Notes in Computer Science, Springer, Berlin/Heidelberg. 2001 Vol. 2120. p. 129-142.

[78] Kuznetsov S.O. Scalable Knowledge Discovery in Complex Data with Pattern Structures // PReMI, Lecture Notes in Computer Science, Springer. 2013 Vol. 8251. p 30-39.

[79] Grigoriev P., Sword-systems or JSM-systems for chains, using statistical considerations of STI. Series 2, 1996

[80] Kuznetsov S.O. Fitting Pattern Structures to Knowledge Discovery in Big Data // Lecture Notes in Computer Science, Springer. 2013 Vol. 7880. p 254-266.

[81] M. Kaytoue, S. Duplessis, S. O. Kuznetsov, and A. Napoli. Mining Gene Expression Data with Pattern Structures in Formal Concept Analysis. // Information Sciences. Spec.Iss.: Lattices. 2011.

[82] David W. Aha (Ed.). Lazy Learning. Kluwer Academic Publishers, Norwell, MA, USA. 1997.

[83] X. Li and Y. Zhong. An Overview of Personal Credit Scoring: Techniques and Future Work // International Journal of Intelligence Science. 2012 Vol. 2, Issue 4A. p. 182-189.

[84] Masyutin A., Kashnitsky Y., Kuznetsov S. O. Lazy Classification with Interval Pattern Structures: Application to Credit Scoring, in: Proceedings of the International Workshop "What can FCA do for Artificial Intelligence?" (FCA4AI at IJCAI 2015) / Ed.: Sergei O. Kuznetsov, A. Napoli, S. Rudolph. Buenos Aires : , 2015. P. 43-54.

[85] Kaytoue, M., Kuznetsov S.O., Napoli,A.: Revisiting numerical pattern mining with formal concept analysis. In: IJCAI, pp. 134201347 (2011).

[86] https://www.investopedia.com/slide-show/top-bank-failures/

[87] http://www.banki.ru/news/lenta/?id=10257511

[88] http://www.bbc.com/russian/news-42363290

[89] https://www.bis.org/publ/bcbsca.htm

Обратите внимание, представленные выше научные тексты размещены для ознакомления и получены посредством распознавания оригинальных текстов диссертаций (OCR). В связи с чем, в них могут содержаться ошибки, связанные с несовершенством алгоритмов распознавания. В PDF файлах диссертаций и авторефератов, которые мы доставляем, подобных ошибок нет.