Transmitting data from the middle of nowhere

02.12.2008
To migrate data from one location to another, Seth Georgion has had to devise some creative ways to transmit data from remote locations, including ships in the Pacific Ocean and taverns in the Australian outback, using T1 lines and satellites. Georgion, manager of IT for marine survey firm , also had to reduce the amount of information he was transmitting by culling duplicate data that accounted for as much as half of what he was sending.

In the geographical and marine survey business, mobility is key, Georgion said. Most of the time, that means setting up a full data center near any location being surveyed to process the terabytes of data collected by sonar and laser equipment. But for Fugro, with 250 offices in about 55 countries, it was extremely costly to build data centers and staff them for every job. It also made Georgion's company less competitive. His solution: Setting up servers that not only capture, but deduplicate and then replicate data over distance.

For example, Fugro is now performing the largest hydrographic survey in California's history, producing massive amounts of data about the state's entire coastline to help it determine whether fisheries are healthy. One type of sonar unit aboard a research ship records not only the seabed, but -- at specific intervals -- information about the entire water column from the surface to the bottom. The rate at which the sonar unit transmits the data to the ship is so fast that Georgion is unable to use network-attached storage (NAS) with gigabit speeds.

A full water column scan shows every substance in the water, displaying the kinds of gases and fish that are there as well as what is occurring in the soil underneath. The amount of data from 100 feet of water can be tremendous.

"We have to use iSCSI just to get the throughput performance to write it from the sonar head to the disk - about 75MB/sec," said Georgion, who noted that the California coastal survey project has been under way for about five months and has another eight months to go.

For Fugro, which often produces seabed imagery for oil exploration companies such as , various military organizations and government conservation agencies, throughput is critical because the faster survey data is processed, the faster it can be acted upon. Fugro's data center in San Diego relies on six NAS arrays from as storage for processing hydrographic data. That data is backed up using software. Georgion uses a tape library equipped with LTO-3 and LTO-4 tape drives for archiving data.

But for fast collection, deduplication and transmission of data, Georgion said only one company's appliance fit the bill and that was Data Domain Inc.'s .

Data Domain makes disk-based storage appliances for data backup and disaster recovery using a compression algorithm that is supposed to make sure data being transmitted does not already exist on the host system.

"The main problem people have in this industry is moving data. Once generated, how do you get [it] from point A to point B? Anything that holds, let's say 5TB of storage, is not portable," Georgion said. "So data movement is the number one problem for anyone in the survey or field scientific world."

On average, Fugro achieves a 50% data reduction by using Data Domain's compression algorithm, which Georgian said is "really good considering that our data is largely image based, meaning that it's not typically compressible." "It basically doubles the line speed for the cost. Or in far flung locales doubles the line speed that's available," he said, adding that the appliances send data between 20 and 24 hours a day at 100% bandwidth capacity. For the California coastline survey, Georgion set up one Data Domain appliance on the research boat; it receives its data from a NetApp NAS array that collects transmissions from the sonar head. Once the data is copied onto the Data Domain box, the software automatically compares it against a second Data Domain box in the San Diego data center and, after deduplication, begins sending.

On an average day, Fugro's coastline survey systems create about 100GB of data, "but it just never stops," Georgion said. "It just keeps growing and growing. It's 100GB a day, every day."

At times, hydrographic data floods in at 5TB a day, and that data not only needs to be transmitted to disk arrays aboard ship from the sonar equipment, but then replicated to an onshore data center for analysis. That requires either satellite transmission or T1 lines -- and either can be a bottle neck.

Georgion had previously set up a data center on the ship using NetApp NAS arrays. That not only required greater bandwidth to transmit data -- the deduplication algorithm was not as efficient -- but it also required a crew of IT personnel to support the setup. Both were costly.

"There's a real shortage of surveyors in the world," he said. "So if you're going to have them spending three or four hours a day copying data onto disk, verifying it, mailing it back, copying it on the other end, doing all these pieces, you're basically going to be taking away 30% maybe 40% of your total efficiency. You can't survive doing that. Plus, you can't manage it."

According to Georgion, Data Domain figured out how to do something no other vendor has yet been able to do cost effectively: in-line data deduplication, or compression with very miniscule performance overhead hits.

"In this day and age when we're moving 10TB at a time, are we going to move that at 10Mit/sec or 50Mbit/sec per second? No."

Georgion said the Data Domain boxes automatically compensate for any network latency, so if you tell the appliance the data will take 200 milliseconds to get from the ship to the San Diego data center -- and let it know the bandwidth of your network -- it will automatically stream the data for maximum performance.

Compensating for network latency was also key for another Fugro project: surveying 40% of Australia's coastline via airplane. The project involves flying a small plane over the coastline and shooting a laser from the plane, which penetrates the water's surface and maps the seabed. Because the area being surveyed -- Queensland in North Eastern Australia - is so remote, the only place available to transmit the data to San Diego without setting up a field data center was a small room over an old bar in the Outback, said Georgion. He's using a T1 line over the bar's phone.

"It's like one of those places that are not near anything for, like, 1,000 miles," Georgion says.

The Australian coastline survey project was split into parts, with half taking place before this year's rainy season and half after. The first survey data was collected and transmitted in a traditional manner, by setting up a data center with NetApp arrays and a StorageTek tape library in a hanger. Because of the cost involved, that phase was unprofitable, Georgion said. After the four-month rainy season ended, Georgion set up the Data Domain boxes in the ramshackle Outback apartment to transmit the data to San Diego for processing. "It's turned us from unprofitable to [profitable]," he said.