Statistical Hypothesis testing does an OK job at avoiding proving the presence of effects, but it does a mediocre job (or worse) at disproving them. There are a lot of reasons for this, poor training among them, but it is largely systemic. I spent my Thanksgiving morning watching the “Vanishing of the Bees,” and my mind kept drifting to thoughts of Type II error. I know. I can grasp the obvious … maybe I need a break.

I don’t have any biological expertise in evaluating, in detail, the research on either side of the fascinating Colony Collapse Disorder debate, but I am always suspicious of negative findings of any kind unless I can read the research. In the case of this documentary, they claim (a claim that is perhaps biased) that pesticides were determined to be safe after administering a fairly large dose to an adult bee, and determining that the adult bee did not die during the research period. Was that enough? I can’t speak to the biology/ecology research, but it got me thinking about Type II.

We know well the magnitude of the risk we face in committing Type I, and it is trained into us to the point of obsession. When meeting analysts wearing this obsession on their sleeve, reminding everyone who will listen, leveling their wrath on marketing researchers daring to use exploratory techniques, I am often tempted to ask about controlling for Type II. I am often underwhelmed with the reply. There are just so many things that can go wrong when you get a non-significant result. Although I wrote about something similar in my most recent post, I’m am compelled to reduce my thoughts to writing again:

1) The effect can be too small for the sample size. Ironically, the problem is usually the opposite. Often researchers don’t have enough data even thought the effect is reasonably big. In this case, I was persuaded by the documentary’s argument that bee “birth defects” would be a serious effect. Maybe short term adult death was not subtle enough. More subtle would require more data.

2) The effect can be delayed. My own works doesn’t involve bees, but what about the effect of marketing? Do we always know when a promotion will kick in? Are we still experiencing the effects of last quarter’s campaign? Does that cloud our ability to measure the current campaign? Might the effects overlap?

3) The effect could be hidden in an untested interaction (AKA your model is too simple). The bee documentary proposed an easy to grasp hypothesis - that the pesticide accumulates over time in the adult bee. Maybe a proximity * time interaction? We may never know, but was the sample size sufficient to test for interactions, or was Power Analysis done assuming only main effects. Since they were studying bee autopsies the sample size was probably small. I don’t know the going rate for a bee autopsy, but they are probably a bit expensive since the expertise would seem rare.

4) Or its hidden in a tested interaction (AKA your model is too complex). I had a traumatic experience years ago when a friend asked me what “negative degrees of freedom” were. Since she was not able to produce a satisfactory answer to a query regarding her hypothesized interactions, her dissertation committee required here to “do all of them”. Enough said. It was horrible.

5) The effect might simply be, and what could be more obvious, not hypothesized. This, we might agree, is the real issue regarding the adult bee death hypothesis. It may not have been the real problem at all.

Statistics doesn’t help you find answers. Not really. It only helps you prove a hypothesis. When you are lucky, you might be able to disprove one. Often, we have to simply “fail to prove”. In any case, I recommend the documentary. Now that I’ve been able to vent a bit about Type II, I should watch it again and focus more of my attention on the bees.

When you get a statistical result, one too often immediately jumps to the conclusion that the finding “is statistically significant” or “is not statistically significant.” While that is literally true since we use those words to describe below .05 and above .05, it does not imply that there are only two conclusions to draw about our finding. Have we ruled out the possible ways that our statistical result might be tricking us?

Things to think about if it is below .05

Real:
You might have a Real Finding on you hands. Congrats. Consider the other possibilities first, but then start thinking about who needs to know about your finding.

Small Effect: Your finding is Real, but is of no practical consequence. Did you definitively prove a result with an effect so small that there is no real world application of what you have found? Did you prove that a drug lowers cholesterol at the .001 level, but the drug only lowers it at a level so small that no Doctor or patient will care? Is your finding of a large enough magnitude to prompt action or to get attention?

Poor Sample: Your data does not represent of population. There is nothing you can do at this point. Are you sure you have a good sample? Did you start with a ‘Sampling Frame’ that accurately reflects the population? What was your response rate on this particular variable? Would the finding hold up if you had more complete data? Have you checked to see if the respondent and non-respondent status on this ‘significant’ variable is correlated with any other variable you have? Maybe you have a census, or you are Data Mining - are you sure you should be focused on p values?


Rare Event:
You have encountered that 5% thing. It going to happen. The good news is we know how often it is going to happen. If you are like everyone else, you probably are operating at 95% confidence, and then each test, by definition, has a 5% chance of coming in below .05 from random forces alone. So you have a dozen findings - which ones are real? Was choosing 95% Confidence a deliberate and thoughtful decision? Have you ensured that Type I error will be rare? If you have a modest sample size did you chose a level of confidence that gave you enough Statistical Power (see below)? If you are doing lots of tests (perhaps Multiple Comparisons) did you take this into account or did you use 95% confidence out of habit?

Too Liberal: You have violated an assumption which has made your result Liberal. Your p value only appears to be below .05. For instance, did you use the usual Pearson Chi-Sq when Continuity Correction would have been better? Maybe Pearson was .045, Likelihood Ratio was .049,  Continuity Correction was .051. Did you chose wisely? Did you use Independent Samples T-Test when a non-parametric would have been better? Having good Stats books around can help, because they will often tell you that a particular assumption violation tends to produce Liberal results. You could always consider a Monte Carlo simulation or Exact Test, and make this problem go away. (An interesting ponderable is to ask if we are within a generation of abandoning distributional assumptions as ordinarily outfitted computers get more powerful?)

Things to think about if it is above .05

Negative Finding: You might have disproven your hypothesis. (I know that you have ‘proven’ your ‘Null Hypothesis’, but does anyone talk that way outside of a classroom?) Congrats might be in order. Consider the other possibilities and then start thinking about who needs to know about your negative finding. If it is the real thing, a negative finding could be a valuable. Be careful however before you shout that the literature was wrong. Make sure it is a bona fide finding.

Power: You may simply have lacked enough data. Did you do a Power Analysis before you began? Was your sample size commensurate with your number of Independent Variables? Did you begin with a reasonable amount of data, but attempted every interaction term under the sun? Did you thoughtlessly include effects like 5 way interactions without measuring the impact that it had on your ability to detect true effects? If you aren’t sure what a Power Analysis is, it is best that you describe your negative results using phrases like: “We failed to prove X”, not “We were able to prove that the claim of X, believed to be true for years, was disproved by our study (N=17)”. You can also Google Jacob Cohen’s wonderful “Things I have Learned (So Far)” to learn more about Power Analysis. I mention is in my Resources section, and it has influenced my thinking for years. Its influence is certainly present in this post.

Poor Sample: Your data is not representative of the population. This one can get your p value to move, incorrectly, in either direction.

Too Conservative: You have violated an assumption which has made your result Conservative. Your p value only appears to be above .05. Did you use an adjusted test in an instance when no adjustment was needed? Did you use Scheffe for Multiple Comparisons, but aren’t quite sure how to justify your choice? Most assumptions make our tests lean Liberal, coming in too low, but the opposite can occur.

 

This list has served me well for a long time. Always best to report your findings thoughtfully. Statistics, at first, seems like a system of Rule Following. It is more subtle than that. It is about extracting meaning, and then persuading an audience with data. Without an audience, there would be no point. They deserve to know how certain (or uncertain) we are.

I will be speaking in Kuala Lumpur, Malaysia next week on the subject of Data Mining. I will be discussing Data Mining, in general, and then participants will get a chance to try it using the resources providing by the excellent tool neutral Elder, Miner, Nisbit book. I believe the event is at capacity, but there are already tentative plans to try this format again in January, 2012, also to be held in Kuala Lumpur, Malaysia. The event organizer stays in charge of the details, but if you are interested in finding out more about the January four day event please email me.

Essential Elements of Data Mining

This is my attempt to clarify what Data Mining is and what it isn’t. According to Wikipedia, “In philosophy, essentialism is the view that, for any specific kind of entity, there is a set of characteristics or properties all of which any entity of that kind must possess.” I do not seek the Platonic form of Data Mining, but I do seek clarity where it is often lacking. There is much confusion surrounding how Data Mining is distinct from related areas like Statistics and Business Intelligence. My primary goal is to clarify the characteristics that a project must have to be a Data Mining project. By implication, Statistical Analysis (hypothesis testing), Business Intelligence reporting, Exploratory Data Analysis, etc., do not have all of these defining properties. They are highly valuable, but have their own unique characteristics. I have come up with ten. It is quite appropriate to emphasize the first and the last. They are the bookends of the list, and they capture the heart of the matter.

1) A Question
2) History
3) A Flat File
4) Computers
5) Knowledge of the Domain
6) A lot of Time
7) Nothing to Prove
8) Proof that you are Right
9) Surprise
10) Something to Gain

1) A Question: Data Mining is not an unfocused search for anything interesting. It is a method for answering a specific question, meeting a particular need. Getting new customers is not the same as keeping the customers you already have. Of course, they are similar, but different in both big and subtle ways. The bottom line is that every decision that you make about the data that you select and assemble flows from the business question.

2) History: Data Mining is not primarily about the present tense, which contrasts it from Business Intelligence reporting. It is about using the past to predict the future. How far into the past? Well, if your customers sign a 12 month contract than it is probably more than 12 months old. It must be old enough to have a cohort of customers that have started and ended a process that is ongoing. Did they renew? Did they churn? You need a group of records for which the outcome of the process is known historically. This outcome status is usually in the form of a Target or Dependent Variable. It is the corner stone of the data set that one must create, and is the key to virtually all Data Mining projects.


3) A Flat File:
Data Miners are not in the Dark Ages. They work with relational databases on a daily basis, but the algorithms that are used are designed to run on flat files. Software vendors are proud to tout “in database modeling,” and it is exciting for its speed, but you still have to build a flat file that has all of your records and characteristics in one table. The Data Miner and author Gordon Linoff calls this a “customer signature.” I rather prefer the idea of a customer “footprint” as it always involves an accumulation of facts over time. The resulting flat file will be unique to the project, specifically built to allow the particular questions of the Data Mining project to be answered.

4) Computers: Data Mining data sets are not always huge. Sometimes they are in the low thousands, and sometimes a carefully selected sample of a few percent of your data is plenty to find patterns. So, despite all the talk of Big Data, the size of the data file is not really a limiting factor on today’s machines. Statistics software packages were capable of running a plain vanilla regression on larger data sets decades ago. The real thing that separates Data Mining from R. A. Fisher and his barley data set is that Data Mining algorithms are highly iterative. Considerable computing power is needed to find the best predictors and try them in all possible combinations using a myriad of different strategies. Data Mining is not simply Statistics on Big Data. Data Mining algorithms were created in a post computing environment to solve post computing problems. They are qualitatively different from traditional statistical techniques in fundamental and important ways, and even when traditional techniques are used they are used in the service of substantively different purposes.

5) Knowledge of the Domain: A sales rep once told me a story, probably apocryphal, about the early days of the Data Mining software I use. A banking client wanted to put them to the test, so the client said: “Here are some unlabeled variables. We are going to keep the meaning of them secret. Tell us which are the best predictors of variable X. If you answer ‘correctly’, we will buy.” What a horrible idea! The Data Mining algorithms play an important role in guiding the model building process, but only the human partner in the process can be the final arbiter of what best meets the need of the business problem. There must be business context, and if the nature of the data requires it, that context might involve Doctors, Engineers, Call Center Managers, Insurance Auditors or a host of other specialists.

6) A Lot of Time: Data Mining projects take time, a lot of time. They take many weeks, and perhaps quite a few months. If someone asks a Data Miner if they can have something preliminary in a week, they are thinking about something other than Data Mining. Maybe they really mean generating a report, but they don’t mean Data Mining. Problem definition takes time because it involves a lot of people, assembled together, hashing out priorities, figuring out who is in charge of what. With this collaboration, the project lead can’t easily make up lost time by burning the midnight oil. Data Preparation takes much of the time. Perhaps you assume that you will be Mining the unaltered contents of your Data Warehouse. It was created to support BI Reporting, not to support Data Mining, so that is not going to happen. Finally, when you’ve got something interesting, you have to reconvene a lot of people again, and you aren’t done until you have deployed something, making it part of the decision management engines of the business. (See Element 10.)

7) Nothing to Prove: If you are verifying an outcome, certain that you are right, having carefully chosen predictors in advance, simply curious how well it fits, you aren’t doing Data Mining. Perhaps you are merely exploring the data in advance, biding you time, waiting until your deadline approaches and then using hypothesis testing to congratulate yourself on how successfully your model fits data that you explored. This is, of course, the worst possible combination of Statistics and Data Mining imaginable, and violates the most basic assumptions of hypothesis testing. Neither of these approaches are Data Mining.

8) Proof that you are Right: Data Mining, by its very nature does not have a priori hypotheses, but it does need proof. A contradiction? The most fundamental requirement of Data Mining is that the same data which was used to uncover the pattern must never be used to prove that the pattern applies to future data. The standard way of doing this is to divide ones data randomly into two portions, building the model on the Train data set, verifying the model on the Test data set. In this is found the essence of Data Mining because it gives one freedom to explore the Train data set, uncovering its mysteries, awaiting the eventually judgement of the Test data set.

9) Surprise: A common mistake in Data Mining is being too frugal with predictors, leaving out this or that variable because “everyone knows” that it is not a key driver. Not wise. Even if this is true, it discounts the insight that an unanticipated interaction might provide. Even if true, it is a needless precaution because Data Mining algorithms are designed to be resilient to large numbers of related predictors. This is not to say that feature selection is not important - it is a key skill - but rather that Data Miners must be cautious when removing variables. Each of those variables cost the business money to record, and the insights they might offer have monetary value as well. Doing variable reduction well in Data Mining is in striking contrast with doing variable reduction well in Statistics.

10) Something to Gain: It might be somewhat controversial, but I think not overly so, to establish an equivalence: Data Mining Equals Deployment. Without deployment, you have may have done something valuable, perhaps even accompanied with demonstrable ROI, but you have fallen short. You may have reached a milestone. You may even have met the specific requirements of your assignment, but it isn’t really Data Mining until it is deployed. The whole idea of Data Mining is taking a carefully crafted snapshot, a chunk of history, establishing a set of Best Practices, and inserting them in the flow of Decision Making of the business.

The issue of clarifying what Data Mining is (and what to call it) comes up in conversation often among Data Miners so I hope the community of data analysts will find this a worthy enterprise. I intend to present this list to new Data Miners when I met them in a tool neutral setting. Please do provide your feedback. Would you add to the list? Do you think that there any properties that are listed here that are not required to call a project Data Mining?

Options for formal Data Mining training in university settings are exploding. Just during the week that I have been collecting resources, and putting this together new programs have been announced. Some years ago I looked into what was then a very thin selection of Data Mining university based programs. I took one online class, and was unimpressed. I gave up my search. As the years passed, and my portfolio of real world projects grew, it seemed to make less and less sense for me. You won’t find too many folks that have been doing this more than 10 years with a formal degree in Data Mining because the movement to certify and credential Data Miners is relatively new. I welcome it. We need more Data Miners. The selection is much broader now, and some of the offerings seem very promising. Here are some questions to ask if you are considering it.

Is the program really teaching Data Mining? How do they define it?

Here is my definition, and here is Meta Brown's great discussion of Tom Khabaza's nine laws.

Do I want software certification, a university certificate, or a Master’s?

Both SPSS and SAS have software certifications. Neither certification process will be sufficient to make you a competent data miner, but the investment in time and money is modest compared to a Masters. In both cases, it is an investment of a couple of hundred dollars for the exam, and then taking some corporate training classes. I am quite familiar with the training options for the SPSS certification exam. The classes are good, but focus mostly on the "point and click" aspect which is what the exam covers as well. The SAS certification exam options seem similar. IBM SPSS Modeler doesn't seem to have a strong self study option. SAS does seem to have done a better job with relatively inexpensive self study material dedicated to exam prep. 

Monster shows high demand for proficiency in both SPSS and SAS in general, so the IBM SPSS Modeler and SAS Enterprise Miner exams would seem to be good bets. See Bob Muenchen discussion of "Software Popularity" for an analysis of this demand. Two recent Spotfire posts have addressed the same topic: Data Geek shortage and Data Geeks are "Hotter than Hot". It is worth noting than Meta Brown thinks that the problem isn't a shortage, but that recruiters aren't finding qualified analysts

Some universities have added Data Mining "certificates". Stanford’s offering looks impressive, and it involves 3 courses. Central Connecticut State University was one of the first to offer a Master’s and its Data Mining certificate program is 5 course, 18 credit hours program. KDNuggets maintains a list. Obviously less money than a Masters, but it is too early to know how human resources departments, and recruiters will respond to these. I think it is an interesting option for someone that already has a Stats degree, but wants to differentiate themselves. Is it worth as much as $10,000 to do that? Could you assist someone on a real world project instead?

A Master’s program, and there are now dozens, is going to be approximately 33 credit hours (11 classes), so it is obviously a greater commitment. KDnuggets maintains a list of all the university options. University of Tennessee at Knoxville just announced an Analytics Masters combined with an MBA. Northwestern's Masters in Predictive Analytics was announced earlier this year, and it just about to start its first classes.

Questions to ask in evaluating a program:

1) What software will be used?

Last year IBM announced the creation of a partnership with DePaul. I can't find anything on the site that explicitly mentions IBM SPSS Modeler, but it is bound to be an IBM SPSS Modeler friendly place. SAS has a strong affiliation with Institute for Advanced Analytics at North Carolina State University, which awards a Masters. There is a similarly affiliated certificate program at Oklahoma state. It is unlikely that use of anything other than SAS would be allowed on assignments because they are explicitly designed to teach SAS for university credit. I think it might be a good thing to be forced to learn another software package thereby earning credit and a tangible skill. On the other hand, at midnight, with a deadline drawing close on a capstone project you might regret that decision if the class denied you access to what you already know well. Northwestern's Master of Science in Predictive Analytics, which has a capstone project, explicitly allows the use of SPSS, SAS, or R. Students are expected to learn all three, but can use any of the three of their capstone project.

2) Who will be teaching you?

Here, I must admit, I get a bit skeptical. Are there enough university faculty with actual field experience in Data Mining? There are some, to be sure. Reviewing the CVs of the faculty, most of these programs have great faculty in their fields. But that is the catch. Are they competent in Stats AND Data Mining. Are they competent in Data Warehousing AND Data Mining. Am I being a bit unfair? Well, a Masters program might run you $30,000 - $40,000. Are you just trying to impress your future employer or do you really want to master your craft? There are lots of programs out there now. Be prepared to ask the tough questions. If you are required to take an online "Stats 101", it might be from an adjunct statistician that may or may not be a Data Miner. I am almost certain that the best of the programs will have some faculty that are Data Miners. Some of them, in fact, are pretty impressive. Why should a Stats professor in a Data Mining program be required to have done Data Mining in a corporate setting? For experienced Data Miners, the question almost answers itself. After all, you always have the option of a Stats Masters with a Data Mining course or two. Frankly, on this note, in reviewing faculty backgrounds, I don't think any of the other programs can compete with Stanford. The faculty in the three course certificate programs truly are Data Miners.

3) Do you really need credit? Do you really need the degree?

These programs are popular precisely because earning a degree can increase you attractiveness to employers. If you can learn a marketable skill at the same time all the better. However, what if you are already established in a related field. Maybe you already have a Masters. You might want to pursue just the skill. It is cheaper and quicker, but you don't get the degree. It is a big decision because some employers might favor a candidate with a Masters. If you can live without a Masters in Data Mining, then there are lots of corporate options. SPSS and SAS have their aforementioned corporate training options. Competitors like Salford Systems and Statsoft's Statistica have training programs. There are also tool neutral training vendors like Predictive Analytics World, Statistics.com, or The Modeling Agency. There is a big difference between being a customer and being a student. When you are a customer, the old adage that "the customer is always right" kicks in. If you have a bad experience you might be able to retake a class, or work out a complaint in another way. Years ago, when I gave this a try, I had a bad experience. The professor did not make himself available for questions, and was very slow to provide feedback. It was very difficult to pursue it. There was literally no system in place. It was clear that in that particular venue the philosophy was "the professor is always right".

4) What will be the quality of the online experience?

Some asynchronous online classes might be nothing more than assigned reading with assignments. When CCSU first produced its Data Mining programs, it was in this format, and it seems that it still is. A video does not assure a good experience, but if it is going to be just readings, you will want to make sure the experience will work for you. What will your colleagues in class be like? Will you be interacting with them in meaningful ways? Most programs have video presentations now. It is remarkable to me that some of the sample lectures online are not very good. The lectures themselves are usually competent, but some are uninspiring, and many are very poorly produced. The sound can be poor, the professors often walk out of the frame. When students in the room participate they are in the form of mysterious invisible voices. There is lots of competition now, so you should be a critical consumer. I would ask enough questions to be certain that you are going to get: good lecture material in some form, meaningful assignments, rapid quality feedback, thoughtful exams, good customer service, software support, and job placement.

5) Is money a factor?

Money is probably always a factor. On the cheapest end of the scale are: a one day workshop at a conference, a single online class, or self study for an exam like SAS'. An option like this is going to be less than $1,000. That might get someone's attention on LinkedIN, but it probably won't be enough to be really competent. I have taught this kind of material to hundreds and hundreds of folks. I don't think one class does it. I had a good experience with a one day R workshop at Predictive Analytics World, but I brought more than a decade of experience to that workshop. I just didn't know R. It was fun, but I certainly didn't master R in a day. Having said that, I think you can learn a lot in the equivalent of a couple of weeks study, especially if you already work in a related field. So something like the SPSS or SAS class series leading up to their exams might be work. If you take the classes publicly, you are looking at a few thousand dollars. The university options vary widely. University of California San Diego's Data Mining Certificate charges $625 per course for each of about 6 courses (20 credit hours). In contrast, at Stanford, you are looking at $11,000 for three classes. A Masters program is going to be tens of thousands, certainly, but will vary widely in cost. Also, a busy professional is probably looking at as much as 5 years to get the Masters.

 

 

I found myself entranced by this thread today: LinkedIN

I replied there, but with the promise of more detail here:

A sales’s rep I worked with faced a challenge some years ago. He had to define Statistics in one bullet on a power point slide to contrast it with Data Mining. The result was brilliant: “Stats is proving or disproving a hypothesis with a single data set”.

Here is my own two cents on the most striking differences. Disclaimer: By “Stats” I really focusing on hypothesis testing using parametric techniques.

1) In Statistics your hypotheses predate the collection of data. In Data Mining, the data has already been collected during the normal course of doing business.

2) Data Mining assumes eventual deployment. You don’t always get there, but the whole process is built around identifying previously unknown patterns, proving that you are right, and deploying the results in the form of a transformed business process.

3) To be worth an effort of many person weeks, the search for patterns has to be exhaustive. It takes a lot of human effort, not only machine effort, to know where and how to look. It can not be fueled only by a priori hypotheses. To do so is to limit the search by the business experience of the data miner and their collaborators. No matter how extensive that experience is, it would be imprudent to let it impede the discovery of new patterns. Rather, that experience should be used during validation and deployment to ensure that the patterns can be made useful to the business.

4) Data Miners always have to validate the model using data, whereas statisticians use distributional assumptions as a proxy for a validation data set.  It is always desirable to have a replication data set, but statisticians are not always that lucky. I am surprised how rarely the issue of money comes up when drawing the contrast between these two techniques. A competent statistician will often have to do a power analysis to determine sufficient sample size. One does not double (or increase less modestly) the sample size in order to create a validation data set. It would cost too much money. Imagine a heart study, for instance. Instead, one uses hypothesis testing and distributional assumptions to obviate the validation data set. In other words, the bell curve becomes a stand in for the validation data set. Data Miners don’t have to worry about this. They usually have  enough data to allow them to divide it randomly into a Train data set, and a Test data set.

I am updating my book reviews, and I have decided to review more books briefly, and direct everyone to Amazon for more detail. I didn't want to remove this material entirely, though. My Amazon review doesn't cover the flaws of this book in as much detail.

Let former statisticians that become Data Miners beware. The advantages of stats training for data miners are too numerous to mention here, but there are also dangers!

Here is the original review of Larose's first book: 

I have mixed feeling about this book!  I don’t think I can recommend it, but there are some sections that I like.  Here in a nut shell is what I don’t like about it: he strongly implies that the way to avoid over reaching with automated techniques is for the human to hand pick variables after a rigorous bivariate analysis.  The author, Larose, would not put it that way in a single sentence, but his logic forces you there. First, data mining is “easy to do badly” and “there are no automatic data mining tools that will solve your problems mechanically ‘while you wait’”. Second, almost 40 pages are spent on data cleaning and prep., but only a couple of pages are spent on hold-out validation. Third, if you screw up, “the wrong analysis is worse than no analysis, since it leads to policy recommendations that will turn out to be expensive failures.” Conclusion: if a well prepared domain expert with good, clean data is at the reigns you have a good chance of a positive result if they will liberally apply their business experience to the model. Although, it seems impossible to disagree with this wisdom it can actually be dangerous if taken literally.  Read carefully, Larose says many of the right things, but if you aren’t careful, you will get the impression that business experience and their exploration is driving the model.In contrast, I would suggest that the critical piece is the hold-out validation!  A good validation should be able to show you that you developed theory/model works.  At the risk of being too colorful, I don’t care if the insight comes to you in the shower, or is whispered to you by a tarot card reader, if you do a careful validation you are confirm that the model works. The trick is to use data that the machine has never seen – or better that you have never worked directly with either (it must be clean).  Those, like Larose, that emphasize the human pre-processing of the data seem to neglect the very real fear that immersion in the data can prevent surprise if you don’t take the validation very seriously. It is the linchpin of data mining. What is surprise?  The discovery of something that you didn’t expect, but that you can prove is real.  Not merely a fluke, but a true, real discovery.  If you aren’t careful, you could ‘clean’ your surprises right out of data if you drop variables that have a weak relationship to the dependent, or remove all of your outliers.  In Larose’s defense, he explicitly says not to do this, but after dozens of pages on outliers and bivariate analyses.

What do I like about the book?  One major piece, but I like it A LOT.  He has the brilliant insight to use tiny, tiny data sets to walk through every step of potentially complicated algorithms like neural nets, and CART.  These data sets have only a dozen or so cases, but it makes the walk through easy to understand.  I don’t think is enough to warrant buying the book since it is thin and expensive, but I have reread those sections several times, and I say more about the details than I used to in lecture because I have become convinced that they are easier to explain than I thought, and provide real insight into the outcomes.

Overall, unless you book budget for this stuff is virtually unlimited, buy Berry and Linoff’s Data Mining Techniques. Note: I have since purchased two other books in this series, and this author's books has value in a large collection.

I have been going crazy trying to find this article. I had remembered it being in his excellent book "Data Prep for Data Mining". Instead it was offered in a magazine some years ago. It is great.

"This way failure lies"

I just found this, and it looks promising. The price is right, and the content seems good. It is asynchronous, but there is a instructor helping students along. And the O'Reilly name doesn't hurt. I may try it.

Python Certification

A series of mouse clicks delivered me here eventually.

Cross Validated

One of the posts sent me in this direction, an article walking through Data Mining from a Statistics point of view. The story has been told many times, but they win points for starting with Francis Bacon.

Data Mining from a Statistical Perspective