![](/uploads/1/2/7/5/127563340/174437089.jpg)
![German Credit Data Set Arff Fire German Credit Data Set Arff Fire](/uploads/1/2/7/5/127563340/557217285.png)
Auto-WEKA: Sample Datasets Auto-WEKA: Sample DatasetsBelow are some sample datasets that have been used with. Each zip has two files, test.arff and train.arff in WEKA's native format.
Data mining is a critical step in knowledge discovery involving theories, methodologies, and tools for revealing patterns in data. It is important to understand the rationale behind the methods so that tools and methods have appropriate fit with the data and the objective of pattern recognition. There may be several options for tools available for a dataset.When a bank receives a loan application, based on the applicant’s profile the bank has to make a decision regarding whether to go ahead with the loan approval or not. Two types of risks are associated with the bank’s decision –. If the applicant is a good credit risk, i.e. Is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank.
If the applicant is a bad credit risk, i.e. Is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bankObjective of Analysis:Minimization of risk and maximization of profit on behalf of the bank.To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to.
![German Credit Data Set Arff Fire German Credit Data Set Arff Fire](http://r3.firehouse.com/files/base/image/FHC/2014/06/16x9/640x360/pierce-strykers-in-calagary_11526653.jpg)
An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application.The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants. Here is a link to the German Credit data ( right-click and 'save as' ). Before getting into any sophisticated analysis, the first step is to do an EDA and data cleaning. Since the number of predictors in this problem is not very high, it is possible to look into the dependency of the response (Creditability) on each of them individually. The following table summarizes the chi-square p-values for each contingency table.
Note that among the sample of size 1000, 700 were Creditable and 300 Non-Creditable. This classification is based on the Bank’s opinion on the actual applicants.Only significant predictors are to be included in the logistic regression model. Since there are 1000 observations 50:50 cross-validation scheme is tried: Model Building with 50:50 Cross-validationSample R code for 50:50 cross-validation data creation indexes = sample(1:nrow(German.Credit), size=0.5.nrow(German.Credit)) # Random sample of 50% of row numbers createdTrain50 = 0.5) Threshold50i.
For discriminant analysis all the predictors are not used. Only the continuous variables and the ordinal variables are used as for the nominal variables there will be no concept of group means and linear discriminants will be difficult to interpret. The predictors are assumed to have a multivariate normal distribution.Sample R code for Discriminant Analysis library(MASS)ldafit. Sample R code for Random Forest library(randomForest)rf50. Ultimately these statistical decisions must be translated into profit consideration for the bank. Let us assume that a correct decision of the bank would result in 35% profit at the end of 5 years.
Bach Choral Set Predict which chord was played in a Bach piece given pitch, bass and meter Instances: 5665, Attributes: 17, Tasks: Classification. The collection of ARFF datasets of the Connectionist Artificial Intelligence Laboratory (LIAC) - renatopp/arff-datasets.
A correct decision here means that the bank predicts an application to be good or credit-worthy and it actually turns out to be credit worthy. When the opposite is true, i.e.
Bank predicts the application to be good but it turns out to be bad credit, then the loss is 100%. If the bank predicts an application to be non-creditworthy, then loan facility is not extended to that applicant and bank does not incur any loss (opportunity loss is not considered here). The cost matrix, therefore, is as follows:Out of 1000 applicants, 70% are creditworthy. A loan manager without any model would incur 0.7.0.35 + 0.3 (-1) = - 0.055 or 0.055 unit loss. If the average loan amount is 3200 DM (approximately), then the total loss will be 1760000 DM and per applicant loss is 176 DM.Logistic regression model performance:Tree-based classification and random forest show a per unit profit; other methods are not doing well.GCD - Appendix - Description of Dataset GCD - Appendix - Description of Dataset.
![](/uploads/1/2/7/5/127563340/174437089.jpg)