Posts Tagged ‘Big Data’
South of Iran, east of Saudi Arabia, and north of Oman is Dubai, an emirate (political territory) of the U.A.E. (the United Arab Emirates). In addition to being the location of Burj Khalifa (“Khalifa Tower” in English), the current tallest building in the world, Dubai also hosts an international airport unlike any other. The Dubai International Airport (DXB) holds the record for the world’s busiest airport. This “mega-airport” expects to serve a staggering 120 million customers this year. Compare that to the measly 94 million passengers that the Hartsfield-Jackson Atlanta International Airport in Georgia handled in 2013, and the 72 million that passed through London-Heathrow Airport in the United Kingdom in 2014.
Any traveler who has missed a connecting flight because the gates were too far apart, or ended up standing in the wrong line because either the writing on the boarding pass or the announcements over the intercom were in a different language has to wonder how an airport of any size could handle 120 million people — successfully.
When asked about the likelihood of issues such as people getting lost and luggage being left behind, Dubai Airports CEO Paul Griffiths explained at the ATIS (Air Transport Industry Summit) that the efficient and intelligent analysis of real-time big data will keep the airport running as secure as a Boeing 717. “We keep increasing the size of the pipe but actually what our passengers want is to spend less time going through (the) process,” Griffiths said in an interview with Gulf Business. “This is where technology comes to the fore with more efficient operations. It’s about the quality of the personalized customer experience where people don’t have to walk more than 500 meters. That is the design goal and technology is central to that.”
The big data that Griffiths referred to is a massive collection of information about distances between airport gates, baggage handling efficiency, and flight durations among other statistics. All of this information interpreted by “intelligent systems” will transform DBX and airports like it in three ways:
1. Increased Efficiency. The Dubai International Airport is an exception to the rule that larger airports are more difficult to traverse. Real-time calculations will allow the air traffic control tower to guide airplanes to terminals close to the connecting flights each passenger requires.
2. Improved Customer Experience. Everything’s getting smarter, including the boarding passes. Instead of printing everything only in Arabic and English, the analysis of big data information such as a person’s native language will result in better, readable, personalized boarding passes tailored to each individual.
3. Cost Reduction. An increase in customers, plus increased efficiency, plus an improved customer experience means that Dubai’s profits will soar. When the statistics are examined a year from now undoubtedly the money saved by not having to reroute passengers and pay for missed flights and hotel stays will be the final proof that big data analytics tools are transformative.
“I believe that technology will take center stage in the future of aviation,” Griffiths said. “Airports, for too long, have been considered just infrastructure businesses. Actually, we have a vital role to play in enabling a level of customer service that certain airlines have already got right in the air but some airports have let them down with on the ground.”
The increasing demands found in data centers can be difficult for most people to keep up with. We now live in a world where data is being generated at an astounding pace, which has lead to expert coining the phrase “big data.” All that generated data is also being collected, which creates even bigger demands for enterprise data storage. Consider all the different trends currently going around, from video and music streaming to the rise of business applications to detailed financial information and even visual medical records. It’s no wonder that storage demands have risen around 50 percent annually in the past few years, and there appears to be nothing on the horizon that will slow that growth. Companies have reason for concern as current data demands threaten to stretch their enterprise storage to its breaking point, but IT departments aren’t helpless in this struggle. This data deluge can be managed; all that’s needed are the right strategies and technologies to handle it.
It isn’t just the fact that so much new data needs to be stored, it’s that all the data should be stored securely while still allowing authorized personnel to access it efficiently. Combine that with the rapidly changing business environment where needs can evolve almost on a daily basis and the demands for an agile and secure enterprise storage system can overwhelm organizations. The trick is to construct infrastructure that can manage these demands. A well designed storage network can relieve many of the headaches that are generated when dealing with large amounts of data, but such a network requires added infrastructure support.
Luckily, IT departments have many options they can choose from that can meet the demands of the data deluge. One of the most popular at the moment is storage virtualization. This technology basically works by combining multiple network storage devices so that they appear to be only one unit. The components for a virtualized storage system, however, can be a tough decision for companies to make. Network attached storage (NAS), for instance, helps people within an organization access the same data at the same time. Storage area networks (SAN) help make planning and implementing storage solutions much easier. Both carry certain advantages over the more traditional direct-attach storage (DAS) deployments seen in many businesses. DAS simply comes with too many risks and downsides, making it a poor choice when confronting the current data challenges many companies face. Whether choosing NAS or SAN, both can simplify storage administration, an absolute must when storage management has become so complex. They also reduce the amount of hardware needed thanks to converged infrastructure technology.
But these strategies aren’t the only one companies can use to keep up with enterprise storage demands. Certain administrative tactics can be deployed to handle the growing volume and complexity of the current storage scene. Part of that strategy is avoiding certain mistakes, such as storing non-critical data on costly storage devices. There’s also the problem of storing too much. In some cases, business leaders ask IT workers to store multiple copies of information, even when the multiple copies aren’t needed. IT departments need to work closely with the business side of the company to devise the right strategy to avoid these unnecessary complications. By streamlining the process, it can become easier to manage storage.
Other options are also readily available to meet enterprise storage demands. Cloud storage, for example, has quickly become mainstream and comes with attractive advantages, such as easy scalability when businesses need it and the ability to access data from almost anywhere. Concerns over data security have made some businesses reluctant to adopt the cloud, but many cloud storage vendors are trying to address those worries with greater emphasis on security features. Hybrid storage solutions are also taking off in popularity in part because they mix many of the advantages found in other storage options.
With the demands large amounts of data are placing on enterprise storage, IT departments are searching for the answers that can help them keep up with these challenges. The options are there that help meet these demands, but it’s up to companies to fully deploy those solutions. Data continues to be generated at a breakneck pace, and that trend won’t be slowing down anytime soon. It’s up to organizations to have the right strategies and technology in place to take full advantage of this ongoing data deluge.
Most companies by now understand the inherent value found in big data. With more information at their fingertips, they can make better decisions regarding their businesses. That’s what makes the collection and analysis of big data so important today. Any company that doesn’t see the advantages that big data brings may quickly find themselves falling behind their competitors. To benefit even more from big data, many companies are employing big data strategies. They see that it is not enough to simply have the data at hand; it must be utilized in the most effective manner to maximize its potential. Coming up with the best big data strategy, however, can be difficult, especially since every organization has different needs, goals, and resources. When creating a big data strategy, it’s important for companies to consider several main issues that can greatly affect its implementation.
When first developing a big data strategy, businesses will need to look at the current company culture and change it if necessary. This essentially means to encourage employees throughout the whole organization to get into the spirit of embracing big data. That includes people on the business side of things along with those in the IT department. Big data can change the way things are done, and those who are resistant to those changes could be holding the company back. For that reason, they should be encouraged to be more open about the effect of big data and ready to accept any changes that come about. Organizations should also encourage their employees to be creative with their big data solutions, basically fostering an atmosphere of experimentation while being willing to take more risks.
As valuable as big data can be, simply collecting it for the sake of collecting big data will often result in failure. Every big data strategy needs to account for specific business objectives and goals. By identifying precisely what they want to do with their data, companies can enact a strategy that drives toward that single objective. This makes the strategy more effective, allowing organizations to avoid wasting money and resources on efforts that won’t benefit the company. Knowing the business objectives of a big data strategy also helps companies identify what data sources to focus on and what sources to steer clear from.
It’s the value that big data brings to an organization that makes it so crucial to properly use it. When creating a big data strategy, businesses need to make sure they view big data as a company-wide asset, one which everyone can use and take advantage of. Too often big data is seen as something meant solely for the IT department, but it can, in fact, benefit the organization as a whole. Big data shouldn’t be exclusive to only one group within a company. On the contrary, the more departments and groups can use it, the more valuable it becomes. That’s why big data strategies need a bigger vision for how data can be used, looking ahead to the long-term and avoiding narrowly-defined plans. This allows companies to dedicate more money and resources toward using big data, which helps them to innovate and use it to create new opportunities.
Another point all organizations need to consider is the kind of talent present in their companies. Data scientists are sought by businesses the world over because they can provide a significant boost to accomplishing established big data business goals. Data scientists are different from data analysts since they can actually build new data models, whereas analysts can only use models that have been pre-made. As part of a big data strategy, the roles and responsibilities of data scientists need to be properly defined, giving them the opportunity to help the organization achieve the stated business objectives. Finding a good data scientist with skills involving big data platforms and ad hoc analysis that are appropriate for the industry can be difficult with demand so high, but the value they can add is well worth it.
An organized and thoughtful big data strategy can often mean the difference between successful use of big data and a lot of wasted time, effort, and resources. Companies have a number of key considerations to account for when crafting their own strategies, but with the right mindset, they’ll know they have the right plans in place. Only then can they truly gain value from big data and propel their businesses forward.
Right now, we live in the big data era. What was once looked at as a future trend is now very much our present reality. Businesses and organizations of all shapes and sizes have embraced big data as a way to improve their operations and find solutions to longstanding problems. It’s almost impossible to overstate just how much big data has impacted the world in such a short amount of time, affecting everyone’s life whether we truly comprehend how. That means we live in a world awash in data, and as companies pursue their own big data strategies, they’ve had to rethink how to store all that information. Traditional techniques have proven unable to handle the huge amount of data being generated and collected on a daily basis. What once was dominated by hard disk drives (HDD) is now rapidly changing into a world driven by solid-state drives (SSD), otherwise known as flash storage.
For years, when talking of big data analytics, the assumption was that a business was using disk. There were several reasons for this, the main one being cost. Hard disk drives were simply cheaper, and for the most part they could deal with the increasing workloads placed upon them. The more data measured and generated, however, the more the limitations of HDD were unmasked. This new big data world needed a storage system capable of handling the workload, and thus the migration to flash storage began.
Many, including Gartner, peg 2013 as the year the switch really gained steam. Solid-state arrays had already been a storage strategy up until then, but in 2013 flash storage manufacturers began constructing arrays with new features like thin provisioning, deduplication, and compression. Suddenly, the benefits gained from using flash storage outweighed some of the drawbacks, most notably the higher cost. In a single year, solid-state arrays saw a surge in sales, increasing by more than 180 percent from 2012. With the arrival of flash storage to the mainstream, organizations could begin to replace their hard disk drives with a system more capable of processing big data.
And that’s really a main reason why flash storage has caught on so quickly. SSDs provide a much higher performance than the traditional storage options. Of particular note is the reduction in the time it takes to process data. Just one example of this is the experience from the Coca-Cola Bottling Co., which began collecting big data but was soon met by long delays in production due to having to sort through loads of new information. When the company adopted flash storage solutions, the amount of time needed to process data was cut dramatically. For example, processing jobs taking 45 minutes now only took six. These kind of results aren’t unique, which is why so many other businesses are seeking flash storage as their primary means of storing big data.
Many tech companies are responding to this increased demand by offering up more options in flash storage. SanDisk has recently unveiled new flash systems specifically intended to help organizations with their efforts in big data analytics. The new offerings are meant to be an alternative to the tiered storage often seen in data centers. Other major tech companies, such as Dell, Intel, and IBM, have shown similar support for flash storage, indicating the lucrative nature of offering flash solutions. The growth isn’t just being driven by private companies either; educational institutions have found a need for flash storage as well. MIT researchers announced last year that they would be switching from hard disks to flash storage in order to handle the demands of big data more effectively. The researchers determined that hard disk drives were too slow, so a better performing storage solution was needed.
As can be seen, flash storage has been quietly but surely taking over hard disk’s turf. That doesn’t mean hard disk drives will soon be gone for good. HDD will likely still be used for offline storage — mainly archiving purposes for data that doesn’t need to be accessed regularly. But it’s clear we’re moving into a world where solid-state drives are the most prevalent form of storage. The need to collect and process big data is making that happen, providing new, unique opportunities for all kinds of organizations out there.
Big data is where it’s at. At least, that’s what we’ve been told. So it should come as no surprise that businesses are busy imagining ways they can take advantage of big data analytics to grow their companies. Many of these uses are fairly well documented, like improving marketing efforts, or gaining a better understanding of their customers, or even figuring out better ways to detect and prevent fraud. The most common big data use cases have become an important part of industries the world over, but big data can be used for much more than that. In fact, many companies out there have come up with creative and unusual uses for big data analytics, showing just how versatile and helpful big data can be.
1. Parking Lot Analytics
Every business is trying to gauge how well they are doing, and big data is an important part of that. Perhaps some study the data that comes from their websites, or others look at how effective their marketing campaigns are. But can businesses measure their success by studying their parking lots? One startup is doing that very thing. Using satellite imagery and machine learning techniques, Orbital Insight is working with dozens of retail chains to analyze parking lots. From this data, the startup says it can assess the performance of each company without needing further information. Their algorithm uses deep learning to delve into the numbers and find unique insights.
2. Dating Driven By Data
Big data is changing the way people date. Many dating websites, like eHarmony, use the data they compile on their users to come up with better matches, increasing the odds they’ll find someone they’re compatible with. With open source tools like Hadoop, dating sites can gain detailed data on users through answers to personal questions as well as through behaviors and actions taken on the site. As dating sites collect more data on their customers, they’ll be able to more accurately predict who matches well with whom.
3. Data at the Australian Open
Many sports have adopted big data to get a better understanding of their respective games, but big data is also being used in a business sense in the sports world. The Australian Open relies heavily on big data during the tournament in response to the demands of tennis fans around the world. With big data, they can optimize tournament schedules and analyze information like social media conversations and player popularity. From there, the data is used to predict viewing demands on the tournament’s website, helping organizers determine how much computing power they need at any given time.
4. Dynamic Ticket Pricing
The NFL is also using big data analytics to boost their business. While it might seem like the NFL doesn’t need help in this regard, they still want to use big data to increase ticket sales. The goal is to institute variable ticket pricing, which has already been implemented by some teams. Using big data, NFL teams can determine the level of demand for specific games based on factors like where it falls in the season, who the opponent is, and how well the home team is playing. If it’s determined demand is high, ticket prices will go up. If demand is predicted to be low, prices will go down, hopefully increasing sales. With dynamic ticket pricing, fans wouldn’t have to pay high prices for games that are in low demand, creating more interest in the product, especially if a team is struggling.
5. Ski Resorts and Big Data
Many ski resorts are truly embracing the possibilities of big data. This is done through basic ideas, like saving rental information, but it can also be used to prevent ticket fraud, which can take out a good chunk of revenue. Most impressively is how big data is used to increase customer engagement through the use of gamification. With Radio Frequency Identification (RFID) systems, resorts can actually track skiers, compiling stats like number of runs made, number of feet skied, and how often they get to the slopes. This data can be accessed on a resort’s website where skiers can compete with their friends, earning better rankings and rewards which encourage them to spend more time on the slopes.
These cases show that with a bit of creative thinking, big data can help businesses in more ways than one. As companies become more familiar working with big data, it’s easy to see how unique and innovative solutions will likely become the norm. As unusual as some of these uses may be, they may represent only the beginning of many unique ventures in the future.
Big data is a boon to every industry. And as data volumes continue their exponential rise, the need to protect sensitive information from being compromised is greater than ever before. The recent data breach of Sony Pictures, and new national threats from foreign factions serve as a cautionary tale for government and private enterprise to be constantly on guard and on the lookout for new and better solutions to keep sensitive information secure.
One security solution, “data masking”, is the subject of a November 2014 article on Nextgov.com.
In the article, Ted Girard, a vice president at Delphix Federal, defines what data masking is—along with its applications in the government sector. Being that data masking also has non-government applications, organizations wondering if this solution is something they should consider for original production data should find the following takeaways from the Nextgov article helpful.
The information explosion
Girard begins by stating the plain and simple truth that in this day and age of exploding volumes of information, “data is central to everything we do.” That being said he warns that, “While the big data revolution presents immense opportunities, there are also profound implications and new challenges associated with it.” Among these challenges, according to Girard, are protecting privacy, enhancing security and improving data quality. “For many agencies just getting started with their big data efforts”, he adds, “these challenges can prove overwhelming.”
The role of data masking
Speaking specifically of governmental needs to protect sensitive health, education, and financial information, Girard explains that data masking is, “a technique used to ensure sensitive data does not enter nonproduction systems.” Furthermore, he explains that data masking is, “designed to protect the original production data from individuals or project teams that do not need real data to perform their tasks.” With data masking, so-called “dummy data”— a similar but obscured version of the real data—is substituted for tasks that do not depend on real data being present.
The need for “agile” data masking solutions
As Girard points out, one of the problems associated with traditional data masking is that, “every request by users for new or refreshed data sets must go through the manual masking process each time.” This, he explains, “is a cumbersome and time-consuming process that promotes ‘cutting corners’– skipping the process altogether and using old, previously masked data sets or delivering teams unmasked versions.” As a result, new agile data masking solutions have been developed to meet the new demands associated with protecting larger volumes of information.
According to Girard, the advantage of agile data masking is that it, “combines the processes of masking and provisioning, allowing organizations to quickly and securely deliver protected data sets in minutes.”
The need for security and privacy
As a result of collecting, storing and processing sensitive information of all kinds,
government agencies need to keep that information protected. Still, as Girard points out, “Information security and privacy considerations are daunting challenges for federal agencies and may be hindering their efforts to pursue big data programs.” The good news with “advance agile masking technology”, according to Girard, is that it helps agencies, “raise the level of security and privacy assurance and meet regulatory compliance requirements.” Thanks to this solution, Girard says that, “sensitive data is protected at each step in the life cycle automatically.”
Preserving data quality
Big data does not necessarily mean better data. According to Girard, a major cause of many big data project failures is poor data. In dealing with big data, Girard says that IT is faced with two major challenges:
1. “Creating better, faster and more robust means of accessing and analyzing large data sets…to keep pace.”
2. “Preserving value and maintaining integrity while protecting data privacy….”
Both of these challenges are formidable, especially with large volumes of data migrating across systems. As Girard explains, “…controls need to be in place to ensure no data is lost, corrupted or duplicated in the process.” He goes on to say that, “The key to effective data masking is making the process seamless to the user so that new data sets are complete and protected while remaining in sync across systems.”
The future of agile data masking
Like many experts, Girard predicts that big data projects will become a greater priority for government agencies over time. Although not mentioned in the article, the NSA’s recent installation of a massive 1.5 billion-dollar data center in Utah serves as a clear example of the government’s growing commitment to big data initiatives. In order to successfully analyze vast amounts of data securely and in real time going forward, Girard says that agencies will need to, “create an agile data management environment to process, analyze and manage data and information.”
In light of growing security threats, organizations looking to protect sensitive production data from being compromised in less-secure environments should consider data masking as an effective security tool for both on-premise and cloud-based big data platforms.
One of the hallmarks of the holiday season is Christmas television specials. This season I once again watched one of my favorite specials, Rudolph the Red-Nosed Reindeer. One of my favorite scenes is the Island of Misfit Toys, which is a sanctuary for defective and unwanted toys.
This year the Internet of Things became fit for things for tracking fitness. While fitness trackers were among the hottest tech gifts this season, it remains to be seen whether these gadgets are little more than toys at this point in their evolution. Right now, as J.C. Herz blogged, they are “hipster pet rocks, devices that gather reams of largely superficial information for young people whose health isn’t in question, or at risk.”
Some describe big data as reams of largely superficial information. Although fitness and activity trackers count things such as steps taken, calories burned, and hours slept, as William Bruce Cameron remarked, in a quote often misattributed to Albert Einstein, “Not everything that can be counted counts, and not everything that counts can be counted.”
One interesting count is the use of these trackers. “More than half of US consumers,” Herz reported, “who have owned an activity tracker no longer use it. A third of them took less than six months from unboxing the device to shoving it in a drawer or fobbing it off on a relative.” By contrast, “people with chronic diseases don’t suddenly decide that they’re over it and the novelty has worn off. Tracking and measuring—the quantified self—is what keeps them out of the hospital.”
Unfortunately, even though these are the people who could most benefit from fitness and activity trackers, manufacturers, Herz lamented, “seem more interested in helping the affluent and tech-savvy sculpt their abs and run 5Ks than navigating the labyrinthine world of the FDA, HIPAA, and the other alphabet soup bureaucracies.”
In the original broadcast of Rudolph the Red-Nosed Reindeer in 1964, Rudolph failed to keep his promise to return to the Island of Misfit Toys and rescue them. After the special aired, television stations were inundated with angry letters from children demanding that the Misfit Toys be helped. Beginning the next year, the broadcast concluded with a short scene showing Rudolph leading Santa back to the island to pick up the Misfit Toys and deliver them to new homes.
For fitness and activity trackers to avoid the Internet of Misfit Things, they need to make—and keep—a promise to evolve into wearable medical devices. Not only would this help the healthcare system allay the multi-trillion dollar annual cost of chronic disease, but it would also allow manufacturers to compete in the multi-billion dollar medical devices market. This could bring better health and bigger profits home for the holidays next year.
In terms of trends, few are as clear and popular as flash storage. While other technology trends might be more visible among the general public (think the explosion of mobile devices), the rise of flash storage among enterprises of all sizes has the potential to make just as big of an impact in the world, even if it happens beneath the surface. There’s little question that the trend is growing and looks to continue over the next few years, but the real question revolves around flash storage and the other mainstream storage option: hard disk drives (HDD). While HDD remains more widely used, flash storage is quickly gaining ground. The question then becomes, how long do we have to wait before flash storage not only overtakes hard drives but becomes the only game in town? A careful analysis reveals some intriguing answers and possibilities for the future, but one that emphasizes a number of obstacles that still need to be overcome.
First, it’s important to look at why flash storage has become so popular in the first place. One of the main selling points of flash storage or solid-state drives (SSD) is its speed. Compared to hard drives, flash storage has much faster processing power. This is achieved by storing data on rewritable memory cells, which doesn’t require moving parts like hard disk drives and their rotating disks (this also means flash storage is more durable). Increased speed and better performance means apps and programs can launch more quickly. The capabilities of flash storage have become sorely needed in the business world since companies are now dealing with large amounts of information in the form of big data. To properly process and analyze big data, more businesses are turning to flash, which has sped up its adoption.
While it’s clear that flash array storage features a number of advantages in comparison to HDD, these advantages don’t automatically mean it is destined to be the sole storage option in the future. For such a reality to come about, solutions to a number of flash storage problems need to be found. The biggest concern and largest drawback to flash storage is the price tag. Hard drives have been around a long time, which is part of the reason the cost to manufacture them is so low. Flash storage is a more recent technology, and the price to use it can be a major barrier limiting the number of companies that would otherwise gladly adopt it. A cheap hard drive can be purchased for around $0.03 per GB. Flash storage is much more expensive at roughly $0.80 per GB. While that not seem like much, keep in mind that’s about 27 times more expensive. For businesses being run on a tight budget, hard drives seem to be the more practical solution.
Beyond the price, flash storage may also suffer from performance problems down the line. While it’s true that flash storage is faster than HDD, it also has a more limited lifespan. Flash cells can only be rewritten so many times, so the more times a business uses it, the more performance will suffer. New technology has the potential to increase that lifespan, but it’s still a concern that enterprises will have to deal with in some fashion. Another problem is that many applications and systems that have been in use for years were designed with hard drives in mind. Apps and operating systems are starting to be created with SSD as the primary storage option, but more changes to existing programs need to happen before flash storage becomes the dominant storage solution.
So getting back to the original question, when will flash storage be the new king of storage options? Or is such a future even likely? Experts differ on what will happen within the next few years. Some believe that it will be a full decade before flash storage is more widely used than hard drives. Others have said that looking at hard drives and flash storage as competitors is the wrong perspective to have. They say the future lies with not one or the other but rather both used in tandem through hybrid systems. The idea would be to use flash storage for active data that is used frequently, while hard drives would be used for bulk storage and archive purposes. There are also experts who say discussion over which storage option will win out is pointless because within the next decade, better storage technologies like memristors, phase-change memory, and even atomic memory will become more mainstream. However the topic is approached, current advantages featured in flash storage make it an easy choice for enterprises with the resources to use it. For now, the trend of more flash looks like it will continue its impressive growth.
The second of the five biggest data myths debunked by Gartner is many IT leaders believe that the huge volume of data that organizations now manage makes individual data quality flaws insignificant due to the law of large numbers.
Their view is that individual data quality flaws don’t influence the overall outcome when the data is analyzed because each flaw is only a tiny part of the mass of big data. “In reality,” as Gartner’s Ted Friedman explained, “although each individual flaw has a much smaller impact on the whole dataset than it did when there was less data, there are more flaws than before because there is more data. Therefore, the overall impact of poor-quality data on the whole dataset remains the same. In addition, much of the data that organizations use in a big data context comes from outside, or is of unknown structure and origin. This means that the likelihood of data quality issues is even higher than before. So data quality is actually more important in the world of big data.”
“Convergence of social, mobile, cloud, and big data,” Gartner’s Svetlana Sicular blogged, “presents new requirements: getting the right information to the consumer quickly, ensuring reliability of external data you don’t have control over, validating the relationships among data elements, looking for data synergies and gaps, creating provenance of the data you provide to others, spotting skewed and biased data. In reality, a data scientist job is 80% of a data quality engineer, and just 20% of a researcher, dreamer, and scientist.”
This aligns with Steve Lohr of The New York Times reporting that data scientists are more often data janitors since they spend from 50 percent to 80 percent of their time mired in the more mundane labor of collecting and preparing unruly big data before it can be mined to discover the useful nuggets that provide business insights.
“As the amount and type of raw data sources increases exponentially,” Stefan Groschupf blogged, “data quality issues can wreak havoc on an organization. Data quality has become an important, if sometimes overlooked, piece of the big data equation. Until companies rethink their big data analytics workflow and ensure that data quality is considered at every step of the process—from integration all the way through to the final visualization—the benefits of big data will only be partly realized.”
So no matter what you heard or hoped, the truth is big data needs data quality too.
The third of the five biggest data myths debunked by Gartner is big data technology will eliminate the need for data integration. The truth is big data technology excels at data acquisition, not data integration.
This myth is rooted in what Gartner referred to as the schema on read approach used by big data technology to quickly acquire a variety of data from sources with multiple data formats.
This is best exemplified by the Hadoop Distributed File System (HDFS). Unlike the predefined, and therefore predictably structured, data formats required by relational databases, HDFS is schema-less. It just stores data files, and those data files can be in just about any format. Gartner explained that “many people believe this flexibility will enable end users to determine how to interpret any data asset on demand. It will also, they believe, provide data access tailored to individual users.”
While it was a great innovation to make data acquisition schema-less, more work has to be done to develop information because, as Gartner explained, “most information users rely significantly on schema on write scenarios in which data is described, content is prescribed, and there is agreement about the integrity of data and how it relates to the scenarios.”
It has always been true that whenever you acquire data in various formats, it has to be transformed into a common format before it can be further processed and put to use. After schema on read and before schema on write is the schema in between.
Data integration is the schema in between. It always has been. Big data technology has not changed this because, as I have previously blogged, data stored in HDFS is not automatically integrated. And it’s not just Hadoop. Data integration is not a natural by-product of any big data technology, which is one of the reasons why technology is only one aspect of a big data solution.
Just as it has always been, in between data acquisition and data usage there’s a lot that has to happen. Not just data integration, but data quality and data governance too. Big data technology doesn’t magically make any of these things happen. In fact, big data just makes us even more painfully aware there’s no magic behind data management’s curtain, just a lot of hard work.
TODAY: Fri, April 28, 2017April2017