Benchmark Study for the University of California Irvine (UCI) Machine Learning Repository:
Prediction of persons with more then $50K income based on U.S. Census Bureau data. Many researchers involved in actual data set analysis, e.g. in medicine or security problems, need algorithms and knowledge extraction with an intuitive understandable interface. Many research algorithms and methods providing good classification and prediction tools, fail to convey the results to people with insufficient experience in data mining techniques. This also applies to prediction problems and the validation of the outcome of any data mining endeavor. Many tools have been developed for this purpose, i.e. regression, discriminant analysis, decision trees, Byes classifiers etc., but they all require a definite and full set of true answers to understand the results. We offer data classification and prediction tools based on features extraction neural nets developed by Kohonen - Self Organizing Features Map (SOM). Look at the potential of the approach realized and see what SOM can extract for you using a minimal available data set. Is it possible to lead about the income without known income, using only census data ? The answer is YES. Let us show you the possibilities of SOM in the following example. One of the main tasks in business planning is the target group selection based on commonly available information, for example using U.S. Census Bureau data, and from it select persons with -adjusted gross income- above $50K (age, class of workers, education, marital-status, occupation code, marital status, race, sex, capital gains, capital losses, working hours per week and native-country). We have used data sets from the Current Population Survey (CPS) database provided by the U.S. Census Bureau and posted on the University of California Irvine (UCI) Repository>> to predict if a person's income is over or under 50K. The data is publicly available and free of charge. The first data set (-adult data set-) was extracted from 1994 CPS data. The 48,842 instances were divided into two files: a training and a testing file. Fourteen attributes, eight categories and six continuous values were chosen>>. They include age, work class, weight, education, years of education, marital status, occupation, relationship, race, sex, capital gain, capital loss, hours per week and native country. The six continuous attributes were quantified into quintiles>> before running the algorithm. The second data set (-Census Income Database-) with 1,999,523 instances was extracted from 1994 and 1995 CPS data and contained 41 demographic and economic related variables. These attributes include the majority but not all of 14 attributes included in the adult dataset. Another difference between two data sets is the decision variable provided for the classification problem. For the adult dataset the decision was drawn from the -adjusted gross income- versus the -total personal income- of the census income database. The SOM tool organizes the set of records into classes, ordinated>> on the plane in such a manner, that persons with similar records are mapped into neighboring classes on the ordination plane. Let us consider the age value distribution on the ordination plane. The color gradient spans from green (young) to red (declining years): 
| 
| - young | 
| - middle-aged | 
| - declining years |
Another pictures gives us an idea about the different classes of worker's distribution. The color gradient ranges from green (no persons in this class of workers) to red (only persons in this class of workers): 
| 
| 
| Local government worker | State government worker | Federal government worker | Legend: - low proportion, - considerable proportion, - high proportion |
One can see some regularity in the patterns showing changes in mapping moving from local to federal government workers. The more clear ordination shows the education level. The color gradient ranges from green (1 year spent on education) to red (16 years spent for education): 
| 
| - not educated | 
| - secondary education | 
| - doctorate |
This picture shows that the level of education reflects in solid gradient on the map from the upper left hand corner (uneducated persons) to the lower right hand corner (highly-educated persons). Let us now look at the connection between features selected (positions on the ordination plane), and income. The mapping received>> was calibrated and each node received a probability of income above $50K per person, attributed to this node. The color legend is dark green, where the probability lies below 5%, green, where the probability lies between 5% and 75%, light red, where the probability lies between 75% and 90%, and red, where the probability lies above 90%. 
| 
| < 5% | of High Income | 
| 5% - 75 % | of High Income | 
| 75% - 90% | of High Income | 
| > 90% | of High Income |
We can conclude, that a high probability of income above $50K directly correlates with the level of education, but that there are more complex dependencies due to the age and class of workers. To classify persons by target group, a probability cut level was defined. A cut level near 0.5 leads to a maximum total accuracy (82%) on the testing set but only with 50% of the target persons selected; the balanced cut level of about 0.3 leads to an equal prediction accuracy for both classes and selects 75% of the desired record. The results obtained with the SOM algorithm can be compared with those of other methods reported in the UCI Repository (http://www.ics.uci.edu/~mlearn/MLSummary.html) - see table. Taking into account that the prediction was done without an answer set, the total error margin of 18% obtained with the SOM algorithm seems satisfactory in comparison with other classifiers used for the same purpose. The tools utilized also allow to control the outcome depending on the specific aims of the investigation, for example, it is possible to select a relative error depending on what is more important, i.e. either not to miss a person who is in the above $50K income bracket or not to increase the size of group selected as the target. Method used | Error(%) | Method used | Error(%) | FSS Naive Bayes | 14.05 | Voted ID3 (0.6) | 15.64 | NBTree | 14.10 | CN2 | 16.00 | C4.5-auto | 14.46 | Naive-Bayes | 16.12 | IDTM (Decision table) | 14.46 | Voted ID3 (0.8) | 16.47 | HOODG | 14.82 | T2 | 16.84 | C4.5 rules | 14.94 | 1R | 19.54 | OC1 | 15.04 | Nearest-neighbor (3) | 20.35 | C4.5 | 15.54 | Nearest-neighbor (1) | 21.42 |
Novel possibilities are available to process the answers. We used <data base field transform> to recode them for the high income probability. The resulting probability distribution of income above $50K per for persons attributed to this node, is shown below. As before, the color legend is dark green where the probability lies below 5%, green, where the probability lies between 5% and 75%, light red, where the probability lies between 75% and 90%, and red, where the probability lies above 90%. 
| 
| < 5% | of High Income | 
| 5% - 75 % | of High Income | 
| 75% - 90% | of High Income | 
| > 90% | of High Income |
We have now obtained more detailed information, e.g. classes populated with wealthy people (more then 95% of them have an income above $50K). The total accuracy of the prediction has been increased to 85%. If the records with <data permit> are used, the average influence of each variable on probability can be - as has been demonstrated for predicted probabilities - attributed to working class, education and hours-per-week only, are shown below. The color gradient is from green (low probability predicted by this variable) to red (high probability predicted by this variable). 
| 
| 
| Working class | Education | Hours-per-week | High Income probability predicted: - low, - average, - high |
The patterns shown can be used to correlate persons income with specific variables and formulate hypothesis to be tested using usual statistical tools. For more information on this study, please contact Anatoly Saveliev:
info(at)itcsoftware.com
|