How statistics manipulate you

14.11.2006
Q: I just read a statistic which stated 48 percent of companies are planning to deploy a file classification strategy (2006 ITtoolbox Networking & Storage Survey). My company (please don't name us!) is a Fortune 500, and we have absolutely no plans to deploy anything like that. I sent an email to our NetApp users group in the area and not one person said they were rolling out any sort of classification system (though several were starting to look at what is out there, and two told me how they have added file virtualization products recently). I asked our Windows team the same thing, and they looked at me like I was from Mars. I am interested in keeping up to date on what is happening in the world, but is there any way to find real data? Do people really just make this up, and if so, why? -- N.R., N.Y.

A: My brother had the best line I've ever heard in this area: 62 percent of all statistics are made up. I'm fairly confident he stole it, but I have passed the plagiarism baton. Yes, it's all made up. No, you can't bet on any number you see, from anyone, even me.

The problem is that as a smart guy, I can make any number you want to see work. Research is an art as much as a science. I'd like to think Enterprise Strategy Group has mastered forward looking research (they don't let me have anything to do with it!), but only because we have really smart folks who know how to ask the questions. They also know whom to ask them. They also know that if it seems too good to be true, it means we screwed something up.

I don't know what research you are referring too, but I do know that there is a better chance of me turning 6'4" and really good looking than 48 percent of IT shops implementing a file classification strategy. You can't get 48 percent of IT shops to agree on lunch, let alone something as nebulous and hard as: 1. Finding out what file data exists in the enterprise, no matter where; 2. Creating corporate policies with executive buy-in on what categories/classes of file data should exist; 3. What to do about stuff you find that shouldn't exist; 4. Deciding what attributes should be assigned to each class of data; 5. When or what event should force us to review the current classification and change if necessary; 6. When finished with that, start all over again.

I don't think I could get half the folks surveyed to say yes to this if there were only two in the population -- and I'm good.

I don't even believe the backwards looking stuff I read. Market share data, for example, has long been used by the industry and IT folks to base decisions on, and I don't think I've ever seen any I can trust. In a smaller market, with two or three players, it's doable, but in a big market it gets infuriatingly goofy. How can one firm say the data management market is a $28 billion dollar space and another say it's $3.8 billion? Easy, it's all in how you define it. It doesn't do you any good to find out after the fact that you just spent $487,000 on the market leading product when your purchase just doubled the size of the market does it?