Recent Posts

Saturday, 30 December 2017

Introduction to Big Data

What is Data?
     Any piece of information can be considered as data. We live in the age of data, where everything that surrounds us is linked to a data source and everything in our lives is captured digitally. The physical world around us has turned into raw information: internet, video, call data records, customer transactions, healthcare records, news, literature, scientific publications, economic data, weather data, geo-spatial data, stock market data, city and government records. This data can be in various forms and in various sizes. It can vary from small data to very big Data. So, let us see the classification of this data:
  • Any data that can reside in RAM or memory is considered as small data. Small data is less than 10's of GBs.
  • Any data that can reside in Hard Disk is considered as medium data. Medium data is in the range of 10's to 1000's of GBs.
  • Any data which cannot reside in Hard disk or in a single system is considered as Big Data. Its size is more than 1000's of GBs.
What is Big Data? 

     Big Data is also a data but with a huge size. So, Data which are very large in size is called Big Data. Normally we work on data of the size Mega Bytes (MB) [Word Doc, Excel etc.] or maximum Giga Bytes (GB) [Movies, Songs etc.] but data in Peta Bytes (PB) size is called Big Data. 
     Big Data refers to huge sets of structured, semi structured or unstructured data that are mined by the organizations for the purpose of identifying new opportunities. About 80% of data captured today is unstructured which is being collected from various sources like sensors which are used to gather climate information, posts on various social media websites like tweets from twitter, Digital pictures and videos uploaded on various websites like Facebook, Purchase transaction records and other similar data. All this data is also Big Data. 
The Data Explosion
  • Every day 2.5 quintillion bytes (2.3 Trillion GB) of data is created every day.
  • 90 % of data in the world was created in the last 2 years.
  • As a business leader, it’s the consequences of this data explosion that you need to care about. 
 Two key consequences result:
1. Knowledge Gap: The difference between collecting data and understanding data.
2. Execution Gap: The difference between understanding data and acting on it.
Big Data Big Sources
1. Social Networking Sites
     Facebook, Google, LinkedIn all these sites generates huge amount of data on a day to day basis as they have billions of users worldwide.
2. E-Commerce Site
     Sites like Amazon, Flipkart, Snapdeal generates huge number of logs from which users buying trends can be traced.
3. Weather Station
     All the weather station and satellite gives very huge data which are stored and manipulated to forecast weather.
4. Telecom Company
     Telecom giants like Airtel, Vodafone study the user trends and accordingly publish their plans and for this they store the data of its million users.
5. Share Market
     Stock exchange across the world generates huge amount of data through its daily transaction.
6. Search Engine Data
     Search engines retrieve lots of data from different databases.
Examples of Big data generation
  • 200 million weekly customers across 10,700 stores in 27 countries.
  • 1.5 million customer transactions every hour.
  • 3 PB of data are stored in Walmart's Hadoop cluster.
  • 4.5 billion Facebook likes every day.
  • 350 million photos uploaded on a daily basis.
  • 250 billion photos stored by Facebook.
  • 10 billion messages sent every day.
  • 1 trillion posts in Facebook's graph search database.  
Advantages of Big Data
     As we know data size is increasing very fast. Big Data is an emerging avenue for productivity and innovation. Such huge volume of data if processed properly can lead to huge changes in current business and other related activities. 
     Big Data is all about analyzing this variety of data and finding the needle of value from this huge data. Business can use this to track different patterns to do advance analytics that help significantly in decision making. Business can get 360 degree of customer view by analyzing data of customer from different sources like sales data, social media, etc. The systematic study of Big Data can lead to:

Understanding target customers better – Big data is used by business today for analyzing sentiments of the target customers and providing them better services to increase the business.

Cutting down in expenditures in various sectors – Analysis of such huge volume of data has also helped business in cutting down their expenditures in various sectors wherever possible. Several billions of dollars being saved by improvements in operational efficiency and more.

Increase in operating margins in different sectors – Big Data also helps industries in increasing operating margins in different sectors. With the help of Big Data, lot of manual labour can be converted into machine task and this helps in increasing operating margins.

The Four V's of Big Data
     Big Data is classified in terms of 
1. Volume
2. Velocity
3. Variety
4. Veracity

1. Volume
     Volume refers to the amount of data (Size of the data). Today data size has increased to size of terabytes in the form of records or transactions. 90% of all data ever created, was created in the past 2 years. From now on, the amount of data in the world will double every two years. By 2020, we will have 50 times the amount of data as that we had in 2011. 
     In the past, the creation of so much data would have caused serious problems. Nowadays, with decreasing storage costs, better storage solutions like Hadoop and the algorithms to create meaning from all that data this is not a problem at all.
     Massive volumes of data is getting generated, in the range of tera bytes to zeta bytes. The data generated by machines, networks, devices, sensors, satellites, geospatial data and human interaction on systems like transaction-based data (stored through years), text, images, videos from social media, etc.
  • A commercial aircraft generates 3GB of flight sensor data in 1 hour.
  • An ERP system for a mid-size company grow by 1-2TB annually.
  • Typically a telecom operator generates 3TB of call details records (CDR) every day.
  • Turn 12 terabytes of tweets created each day into improved product sentiment analysis to know the views of customer for better business
2. Velocity
     The Velocity is the speed at which the data is created, stored, analyzed and visualized. In the past, when batch processing was common practice, it was normal to receive an update from the database every night or even every week. Computers and servers required substantial time to process the data and update the databases. In the big data era, data is created in real-time or near real-time. With the availability of Internet connected devices, wireless or wired, machines and devices can pass-on their data the moment it is created. Think about how many SMS messages, Facebook status updates, or credit card swipes are being sent on a particular telecom carrier every minute of every day, and you’ll have a good appreciation of velocity.
     The speed at which data is created currently is almost unimaginable: Every minute we upload 100 hours of video on YouTube. In addition, every minute over 200 million emails are sent, around 20 million photos are viewed and 30.000 uploaded on Flickr, almost 300.000 tweets are sent and almost 4 to 5 million queries on Google are performed. 
     According to Gartner, velocity means both how fast the data is being produced and how fast the data must be processed to meet demand.
     The flow of data is massive and continuous. This real-time data can help business to make decision in real time.
     On google if we search about travelling, shopping (electorinics, apparels, shoes, watch, etc.), job, etc. It provides us the relevant advertisement while browsing in real time.

3. Variety
     Variety refers to the many sources and types of data. In the past, all data that was created was structured data, it neatly fitted in columns and rows but those days are over. Nowadays, 90% of the data that is generated by organizations unstructured data. Data today comes in many different formats: structured data, semi-structured data, unstructured data and even complex structured data. 
i. Structured Data 
     Any data that can be stored, accessed and processed in the form of fixed format is termed as a structured data. Structured data refers to kinds of data with a high level of organization, such as information in a relational database. 
E.g. Relational Data 
ii. Semi-structured Data 
     Semi-structured data is information that doesn’t reside in a relational database but that does have some organizational properties that make it easier to analyze. With some process you can store them in relation database. Examples of semi-structured: XML and JSON documents are semi structured documents, NoSQL databases are considered as semi structured. 
iii. Unstructured Data 
     Any data with unknown form or unknown structure is classified as unstructured data. It often includes text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents.

4. Veracity
  • Big data veracity refers to the biases, noise and abnormalities, ambiguities, latency in data.
  • Is the data that is being stored and mined, meaningful to the problem being analyzed?
  • Keep your data clean and processes to keep 'dirty data' from accumulating in your systems.
     Having a lot of data in different volumes coming in at high speed is worthless if that data is incorrect. Incorrect data can cause a lot of problems for organizations as well as for consumers. Therefore, organizations need to ensure that the data is correct as well as the analyses performed on the data are correct. Especially in automated decision-making, where no human is involved anymore, you need to be sure that both the data and the analyses are correct. 

Other important 2V's of Big Data
1. Validity
     The correct data and accurate are intended to use for taking decisions.
2. Volatility
     Big data volatility refers to how long is data valid and how long should it be stored. In this world of real-time data, you need to determine at what point the data is no longer relevant to the current analysis.

Conventional Approaches
  • RDBMS (Oracle, DB2, MySQL, etc.)
  • OS File-System
  • SQL Queries
  • Custom framework
Problems with Traditional/Conventional Approches
1. Limited storage capacity.
2. Limited processing capacity
3. No scalability
4. Single point of failure
5. Sequential processing
6. RDBMS can handle structured data
7. Required pre-processing of Data

Challenges of Big Data 
     The major challenges associated with big data are as follows:
1. Capturing data
2. Storage
3. Searching
4. Sharing
5. Transfer
6. Security
7. Analysis
8. Presentation
9. IT Architecture
     To fulfill the above challenges, organizations normally take the help of enterprise servers.

Traditional Approach of Big Data
     In the Traditional approach, an enterprise will have a computer to store and process big data. Here data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated software's can be written to interact with the database, process the required data and present it to the users for analysis purpose.
     In the above approach it works well where we have less volume of data that can be accommodated by standard database servers, or up to the limit of the processor which is processing the data. But when it comes to dealing with huge amounts of data, it is really a tedious task to process such data through a traditional database server.

Google’s Solution
     Google solved the above problem using an algorithm called MapReduce. This algorithm divides the given task into small parts and then assigns those parts to many computers connected over the network, and collects the results to form the final result dataset. The following diagram shows various commodity hardware's which could be single CPU machines or servers with higher capacity.
Typical Distributed System
  • Programs run on each app server
  • All the data is on SAN
  • Before execution, each server gets data from SAN
  • After execution, each server writes the output to SAN
Problems with typical distributed system
  • Huge dependency on network and huge bandwidth demands
  • Scaling up and down is not a smooth process
  • Partial failures are difficult to handle
  • A lot of processing power is spent on transporting data
  • Data synchronization is required during exchange. 
Big data Use Cases
1. Credit Card Fraud Detection

     As millions of people are using a credit card nowadays, so it has become very necessary to protect people from frauds. It has become a challenge for Credit card companies to identify whether the requested transaction is fraudulent or not. A credit card transaction hardly takes 3-4 seconds to completion. So, the companies need an innovative solution to identify the transactions which may appear as fraud in this small time and thus protect their customers from becoming its victim.
     An abnormal number of clicks from the same IP address or a pattern in the access times although this is the most obvious and easily identified form of click fraud, it is amazing how many fraudsters still use this method, particularly for quick attacks. They may choose a to strike over a long weekend when they figure you may not be watching your log files carefully, clicking on your ad repeatedly so that when you return to work on Tuesday, your account is significantly depleted. Part of this fraud might be unintentional when a user tries to reload a page.
     Again, if you have made any transaction from Delhi today and the very next minute there is a transaction from your card in Dubai. Then there are chances that this transaction may be fraud and not done by you. So, companies need to process the data in real time (Data in Motion analytics DIM) and analyze it against individual history in a very short span of time and identify whether the transaction is actually fraud or not. Accordingly, companies can accept or decline the transaction based on the severity. To process the data streams, we need streaming engines like Apache Flink. The streaming engine can consume the real-time data streams at very high efficiency and process the data in low latency (without any delay).

2. Sentiment Analysis 
     Sentiment Analysis is the process of determining whether a piece of writing is positive, negative or neutral. It’s also known as opinion mining, deriving the opinion or attitude of a speaker. In sentiment analysis, language is processed to identify and understand consumer feelings and attitudes towards brands or topics in online conversations i.e., what they are thinking about a particular product or service, whether they are happy or not with it, etc.
     If a company is launching a new product, using sentiment analysis we can identify users opinion about the same, based on the users opinion product can be improved. Sentiment Analysis enable business to make early decision rather than wait for sales reports. (Today you are launching your product. Today end of the day you will get information whether people are saying positive or negative.)
     For example, if a company is launching a new product, it can find what its customers are thinking about the product. Whether they are satisfied with the product or not or they would like to have some modifications in it can be found out using Big data by doing sentiment analysis i.e., using sentiment analysis we can identify users’ opinion about the same. Then the company can take action accordingly to modify or improve the product to increase their sales and to make customers feel happy with their product.
Real example of sentiment analysis
     A large airline company started monitoring tweets about their flights to see how customers are feeling about upgrades, new planes, entertainment, etc. Nothing special there, except when they began feeding this information to their customer support platform and solving them in real-time.
     One memorable instance occurred when a customer tweeted negatively about lost luggage before boarding his connecting flight. They collect the tweets (having issues) and offer him a free first-class upgrade on the way back. They also tracked the luggage and gave information on where the luggage was, and where they would deliver it. Needless to say, he was pretty shocked about it and tweeted like a happy camper throughout the rest of his trip.
     With Hadoop, we can mine Twitter, Facebook and other social media conversations for sentiment data about you and your competition, and use it to make targeted, real-time, decisions that increase market share. With the help of quick analysis of customer sentiment through social media, company can immediately take decision and action and they need not wait for the sales report (which might take 6 or more months also) as earlier to run their business in a better manner.

3. Retail - Data Processing 
     Let us now see an application for Leading Retail Client in India. The client was getting invoice data daily which was of about 100 GB size and was in XML format. To generate a report from the data, conventional method was taking about 10 hours of time and client had to wait for this time to get the report from the data.
     This conventional method was developed in C/ Perl and was taking a huge time which was not a feasible solution and the client was not happy with it. The invoice data was in XML format which needs to be transformed into a structured format before generating the report. This involved validation, verification of data and implementation of complex business rules.
     In today’s world when things are expected to be available anytime when required, waiting for 10 hours was not a proper and acceptable solution. So, the client approached Big data team of one of the companies with their problem and with a hope to get a better solution. The client was even able to accept time reduced from 10 hours to 5 hours or little more also.
    When Big Data team started working on their problem and approached them back with the solution, the client was amazed and could not believe that the report which they were getting in 10 hours could now be received in just 10 minutes using Big Data and Hadoop. The team used a cluster of 10 nodes for the data getting generated and now the time taken to process data was just 10 minutes. So, you can imagine the speed and efficiency of Big Data in today’s world.
4. Sears Holding
     Sears has 4000 stores with millions of products and 100mn customers, had collected over 2PB of data so for
     Legacy systems incapable of analyzing large amounts of data to personalize and loyalty campaigns.
Conventional approach for analyzing data 

  • Analyzed just 10% if customer data for personalizing loyalty campaigns on mainframes, Teradata and SAS
  • Processing time to analyze 10% of data: 6 weeks
Big Data Approach
  • Shifted to Hadoop with 300 nodes of commodity servers
  • Time taken to process 100% of customer data now: 1 week!!
  • Interactive reports can be developed in 3 days instead of 6 to 12 weeks
  • Saved millions of dollars in mainframe and RDBMS cost and got 5000% better performance
  • Increased revenues through better analysis of customer data
      Orbitz is a leading travel company using latest technologies to transform the way clients around the world plan the travel. They operate the customer travel planning sites Orbitz, Ebookers and Cheap Tickets. It generates 1.5mn flight searches and 1mn hotel searches daily and the log data being generated by this activity is approximately 500GB in size. The raw logs are only stored for a few days because of costly data warehousing. To handle such huge data and to store it using conventional data warehouse storage and analysis infrastructure was becoming more expensive and time consuming with time.
     For example, to search hotel in database using conventional approach which was developed in Perl/ Bash, extraction need to be done serially. The time it was taking to process and sort hotels based on just last 3 months data was also 2 hours which was again not acceptable and feasible solution today when customers are expecting results to be generated on just their click.
     This problem was again very big and needed some solution to protect the company from losing their customers. Orbitz needed an effective way to store and process this data, plus they needed to improve their hotel rankings. It was then tried using Big Data and Hadoop approach. Here HDFS, Map Reduce and Hive were used to solve the problem and just amazing results were received. A Hadoop cluster provided a very cost-effective way to store vast amounts of raw logs. Data is cleaned and analyzed and machine learning algorithms are run.
     Earlier when it was taking time of about 2 hours to generate search result on hotel data of last 3 months, the time was reduced to just 26 minutes to generate the same result with Big Data. Big data was able to predict hotel and flight search trends much faster, more efficiently and cheaper than the conventional approach. 

Next Tutorial  Apache Hadoop Introduction

No comments:

Post a Comment