How statistics manipulate you

14.11.2006

Q: I just read a statistic which stated 48 percent of companies are planning to deploy a file classification strategy (2006 ITtoolbox Networking & Storage Survey). My company (please don't name us!) is a Fortune 500, and we have absolutely no plans to deploy anything like that. I sent an email to our NetApp users group in the area and not one person said they were rolling out any sort of classification system (though several were starting to look at what is out there, and two told me how they have added file virtualization products recently). I asked our Windows team the same thing, and they looked at me like I was from Mars. I am interested in keeping up to date on what is happening in the world, but is there any way to find real data? Do people really just make this up, and if so, why? -- N.R., N.Y.

A: My brother had the best line I've ever heard in this area: 62 percent of all statistics are made up. I'm fairly confident he stole it, but I have passed the plagiarism baton. Yes, it's all made up. No, you can't bet on any number you see, from anyone, even me.

The problem is that as a smart guy, I can make any number you want to see work. Research is an art as much as a science. I'd like to think Enterprise Strategy Group has mastered forward looking research (they don't let me have anything to do with it!), but only because we have really smart folks who know how to ask the questions. They also know whom to ask them. They also know that if it seems too good to be true, it means we screwed something up.

I don't know what research you are referring too, but I do know that there is a better chance of me turning 6'4" and really good looking than 48 percent of IT shops implementing a file classification strategy. You can't get 48 percent of IT shops to agree on lunch, let alone something as nebulous and hard as: 1. Finding out what file data exists in the enterprise, no matter where; 2. Creating corporate policies with executive buy-in on what categories/classes of file data should exist; 3. What to do about stuff you find that shouldn't exist; 4. Deciding what attributes should be assigned to each class of data; 5. When or what event should force us to review the current classification and change if necessary; 6. When finished with that, start all over again.

I don't think I could get half the folks surveyed to say yes to this if there were only two in the population -- and I'm good.

I don't even believe the backwards looking stuff I read. Market share data, for example, has long been used by the industry and IT folks to base decisions on, and I don't think I've ever seen any I can trust. In a smaller market, with two or three players, it's doable, but in a big market it gets infuriatingly goofy. How can one firm say the data management market is a $28 billion dollar space and another say it's $3.8 billion? Easy, it's all in how you define it. It doesn't do you any good to find out after the fact that you just spent $487,000 on the market leading product when your purchase just doubled the size of the market does it?

Look at the Fibre Channel switch market ; there have been three players for the last three years: Brocade, McData, and Cisco. While I'm no statistical whiz, I know that in your shop you run McData in the core, connected to your mainframes. You probably run Brocade switches. You might be talking to Cisco, or even have implemented their director in non-mainframe environments. So how come the Fibre Channel networking market is listed as a $2.7 billion space? I add up the total revenues of Brocade and McData I come up with about a $1.3-1.4 billion. I know anecdotally that Cisco adds $300-400 million to that - so even if you call it $500 milion, it's at most $1.9 billion. Where's the $800 million? Or are those the invisible switch ports?

Since we never say what we mean in the IT industry, why should the numbers be any different?

The numbers should be a guide, but if they are just stupid, you should ignore them. Don't do anything because of some statistic alone; use them as a reference to point you in a direction. Real references from real people are better ways to buy anyway. Real people don't speak in statistics (normally). They say things like "that product sucked. It caused a fire alarm" or "this was the best money I've spent in years."

Now, having said that, you should absolutely be thinking about creating classes of files (or blocks, for that matter), and putting some high-level assumptions for each class down on paper. Then you should socialize those assumptions with other IT folks, business line folks, and even mucky mucks if you have the juice. I guarantee it will be a worthy exercise, and you'll find out that there are a lot of misaligned assumptions. Just being able to get folks back on the same page makes it a worthy endeavor. The mucky mucks probably think you already have your file data completely under control, and that Mr. Spitzer will never be able to find fault in the system. Or, you may find they want to begin the exercise and start to solve some real problems. I figure you have about a 4 percent chance of that.

Send me your questions -- about anything, really, to sinceuasked@computerworld.com.

Steve Duplessie founded Enterprise Strategy Group Inc. in 1999 and has become one of the most recognized voices in the IT world. He is a regularly featured speaker at shows such as Storage Networking World, where he takes on what's good, bad -- and more importantly -- what's next.