Posted on

Hadoop: Case study big data analytics and How does Hadoop Work

how does hadoop work

Case study big data analytics and How does Hadoop Work

Lets start with Case study big data analytics and How does Hadoop Work. Hadoop MapReduce is a software framework for easily writing applications which process big amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

case study big data analytics and how does hadoop works

Two different task that  Hadoop programs perform:

  1. The Map Task: This task takes input data and converts it into a set of data, where individual elements are broken down into key/value pairs.
  2. The Reduce Task: Here this task takes the output from a map task as input and combines those data tuples into a smaller set of key/value pairs. 

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node.

case study big data analytics and how does hadoop works

How does Hadoop Work and Case study big data analytics

    1.Stage 1

User submits a job to hadoop to  process by specifying following items:

  • Location of the input and output files in the distributed file system.
  • The java classes in the form of jar file.
  • The job configuration by setting different parameters.

    2. Stage 2

After stage 1

  • Distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

   3. Stage 3

  • The TaskTrackers on different nodes execute the task as per MapReduce implementation and output of the reduce function is stored into the output files.

Hadoop Architecture

Hadoop framework includes following four modules:

  • Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop.
  • Hadoop YARN: This is a framework for job scheduling and cluster resource management.
  • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
  • Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.

case study big data analytics and Hadoop Architecture

7 real-life Case study big data analytics

    1.Case study big data analytics: Heart attack patience test analysed by big data

“Patients in a New York hospital with suspicion of heart attack were submitted to series of tests, and the results were analyzed with use of big data – history of previous patients,” says Agnieszka Idzik, Senior Product Manager at SALESmanago. “Whether a patient was admitted or sent home depended on the algorithm, which was more efficient than human doctors.”

     2. Case study big data analytics: Identify warning signs of security breaches

“Data breaches like we saw with Target, Sony, and Anthem never just happen; there are typically early warning signs – unusual server pings, even suspicious emails, IMs or other forms of communication that could suggest internal collusion,” according to Kon Leong, CEO, ZL Technologies. “Fortunately, with the ability to now mine and correlate people, business, and machine-generated data all in one seamless analytics environment, we can get a far more complete picture of who is doing what and when, including the detection of collusion, bribery, or an Ed Snowden in progress even before he has left the building.”

      3. Case study big data analytics: Prevent hardware failure

A power company combined sensor data from the smart grid with a map of the network to predict which generators in the grid were likely to fail, and how that failure would affect the network as a whole. Using this information, they could react to problems before they happened.

How can banks better understand customers and markets?

The bank set up a single Hadoop cluster containing more than a petabyte of data collected from multiple enterprise data warehouses. With all of the information in one place, the bank added new sources of data, including customer call center recordings, chat sessions, emails to the customer service desk and others. Pattern matching
techniques recognize the same customer across the different sources, even when there were some discrepancies in the identifying information stored. The bank applied techniques like text processing, sentiment analysis, graph creation, and automatic pattern matching to combine, digest and analyze the data.
The result of this analysis is a very clear picture of a customer’s financial situation, his risk of default or late payment and his satisfaction with the bank and its services. The bank has demonstrated not just a reduction of cost from the existing system, but improved revenue from better risk management and customer retention.
While this application was specific to retail banking services, the techniques described—the collection and combination of structured and complex data from multiple silos, and a powerful tool of analytics that combine the data and look for patterns – apply broadly. A company with several lines of business often has only a fragmentary, incomplete picture of its customers, and can improve revenues and customer satisfaction by creating a single global view from those pieces.

       4. Case study big data analytics: Understand what people think about your company

Why do companies really lose customers?

A large mobile carrier needed to analyze multiple data sources to understand how and why customers decided to terminate their service contracts. Were customers actually leaving, or were they merely trading one service plan for another? Were they leaving the company entirely and moving to a competitor? Were pricing, coverage gaps, or device issues a factor? What other issues were important, and how could the provider improve satisfaction and retain customers?


The company used Hadoop to combine traditional transactional and event data with social network data. By examining call logs to see who spoke with whom, creating a graph of that social network, and analyzing it, the company was able to show that if people in the customer’s social network were leaving, then the customer was more likely to depart, too.

By combining coverage maps with customer account data, the company could see how gaps in coverage affected churn. Adding information about how often customers use their handsets, how frequently they replace them and market data about the introduction of new devices by handset manufactures, allowed the company to predict whether a particular customer was likely to change plans or providers. Combining data in this way gave the provider a much better measure of the risk that a customer would leave and improved planning for new products and network investments to improve customer satisfaction.

      5. Case study big data analytics: Understand when to sell certain products

How do retailers target promotions guaranteed to make you buy?

A large retailer doing Point-of-Sale transactional analysis needed to combine larger quantities of PoS transaction analysis data with new and interesting data sources to forecast demand and improve the return that it got on its promotional campaigns. The retailer built a Hadoop cluster to understand its customers better and increased its


The retailer loaded 20 years of sales transactions history into a Hadoop cluster. It built analytic applications on the SQL system for Hadoop, called Hive, to perform the same analyses that it had done in its data warehouse system—but over much larger quantities of data, and at much lower cost. The company is also exploring new techniques to analyze the Point of Sale data in new ways using new algorithms and the Hadoop MapReduce interface. Integration of novel data sources, like news and online comments from Twitter and elsewhere, is underway. The company could never have done this new analysis with its legacy data infrastructure. It would have been too expensive to store so much historical data, and the new data is complex and needs considerable preparation to allow it to be combined with the PoS transactions. Hadoop solves both problems, and runs much more sophisticated analyses than were possible in the older system.

       6. Case study big data analytics: How can organizations use machine generated data to identify potential trouble?

A very large public power company combined sensor data from the smart grid with a map of the network to predict which generators in the grid were likely to fail, and how that failure would affect the network as a whole.


The power company built a Hadoop cluster to capture and store the data streaming off of all of the sensors in the network. It built a continuous analysis system that watched the performance of individual generators, looking for fluctuations that might suggest trouble. It also watched for problems among generators—differences in phase or voltage that might cause trouble on the grid as a whole. Hadoop was able to store the data from the sensors inexpensively, so that the power company could afford to keep long-term historical data around for forensic analysis. As a result, the power company can see, and react to, long-term trends and emerging problems in the grid that are not apparent in the instantaneous performance of any particular generator. While this was a highly specialized project, it has an analog in data centers managing IT infrastructure grids. In a large data center with thousands of servers, understanding what the systems and applications are actually doing is difficult. Existing tools often don’t scale. IT infrastructure can capture system-level logs that describe the behavior of individual servers, routers,
storage systems and more. Higher-level applications generally produce logs that describe the health and activity of application servers, web servers, databases and other services. Large data centers produce an enormous amount of this data. Understanding the relationships among applications and devices is hard. Combining all of that data into a single repository, and analyzing it together, can help IT organizations better understand their infrastructure and improve efficiencies across the network. Hadoop can store and analyze log data, and builds a higher-level picture of the health of the data center as a whole.

      7. Case study big data analytics: How can companies detect threats and fraudulent activity?

Businesses have struggled with theft, fraud and abuse since long before computers existed. Computers and on-line systems create new opportunities for criminals to act swiftly, efficiently and anonymously. On-line businesses use Hadoop to monitor and combat criminal behavior.


One of the largest users of Hadoop, and in particular of HBase, is a global developer of software and services to protect against computer viruses. Many detection systems compute a “signature” for a virus or other malware, and use that signature to spot instances of the virus in the wild. Over the decades, the company has built up an enormous library of malware indexed by signatures. HBase provides an inexpensive and high-performance storage system for this data. The vendor uses MapReduce to compare instances of malware to one another, and to build higher-level models of the threats that the different pieces of malware pose. The ability to examine all the data comprehensively allows the company to build much more robust tools for detecting known and emerging threats. A large online email provider has a Hadoop cluster that provides a similar service. Instead of detecting viruses, though, the system recognizes spam messages. Email flowing through the system is examined automatically. New spam messages are properly flagged, and the system detects and reacts to new attacks as criminals create them. Sites that sell goods and services over the internet are particularly vulnerable to fraud and theft. Many use web logs to monitor user behavior on the site. By tracking that activity, tracking IP addresses and using knowledge of the location of individual visitors,
these sites are able to recognize and prevent fraudulent activity. The same techniques work for online advertisers battling click fraud. Recognizing patterns of activity by individuals permits the ad networks to detect and reject fraudulent activity. Hadoop is a powerful platform for dealing with fraudulent and criminal activity like this. It is flexible enough to store all of the data—message content, relationships among people and computers, patterns of activity—that matters. It is powerful enough to run sophisticated detection and prevention algorithms and to create complex models from historical data to monitor real-time activity.