An Introduction To Programming With Processing

6. Clustering and Data Mining

Data mining is the process of identifying patterns within the data you have acquired. The purpose of doing this is to place the relationships that exist between the various aspects of your data in a more mathematical context that can ultimately be used programmatically (in the program you are developing). Data mining can be done manually or it could be automated. When dealing with a small data set with a high ratio of inconsistent data types, manual data mining could be more effective and save you some time. For example, lets take a hypothetical situation where the owner of a small café has asked you to determine what items a customer is likely to purchase together. The data set you have acquired consists of all the items customers have purchased in the store over the past year. From that data you could determine that goods can be divided up into food, drinks, magazines, stationary etc, and then into even smaller groups like fruit, vegetables, soft-drinks etc. Identifying these groups would be the first step of data mining and is a process also known as clustering. If the café has a large variety of items to choose from the clusters making up the data could resultantly be numerous, yet the data set as a whole is actually quite small, only consisting of a single year of purchased items. In contrast a more established café that has been operating for several years proposes the same question to you. In the case of the smaller café groupings of purchased items would yield a lower probability of repeating in a shorter period of time. In contrast the more established café has a better chance of the same groups of items being purchased together over a longer period of time. In the former case manually mining this data set could be more effective because of the lower probability of the same items being purchased together in a relatively short space of time. In this scenario the majority of your time would be spent on clustering and populating the resultant groups after eliminating the majority of items purchased in that year because they will not fall into any cluster. However in the case of the established café although there may be just as many clusters the values that these clusters are populated with have a higher probability of being repeated, it might therefore be more efficient to have a computer program count, cluster and mine the data. Regardless of whether you choose to manually mine your data or have a software program do the work for you, you should have a set of data at the end of the process that can be manipulated programmatically.

Clustering Raw Data

The process of getting external data into a program, via clustering and data mining.