Traditional statistical methods deal with corroborating given hypotheses on a given body of data. However, generating the hypothesis itself is a matter of intuition and ingenuity. It is clearly impossible to test all hypotheses on a database with millions of records and hundreds of fields.
There have been attempts to bridge this gap through data mining. Association generation is a method of creating such statistical hypotheses for binary data. For quantitative databases the situation is still not good. There are a number of known methods. One is a reduction to binary data by creating intervals and then generating associations. This method is computationally expensive. Another suggested method was by generating associations that are statistically interesting. This method also was tried only on small databases and is applicable only for binary relations, e.g., in certain ranges of field X, field Y lies significantly outside its average.
We suggest a method that answers some of the problems with the current techniques. Our idea is based on using visualization techniques and image processing ideas to rank subsets of fields according to the relation between them in the database. This ranking suggests the hypotheses to be statistically investigated.
Our method has the following advantages:
In this talk we present an algorithmic methodology and the results of its application to the census bureau data bases, cpsm93p and nhis93ac.
(Joint work with A. Amir and N. Netanyahu)