In the Traditional approach, an enterprise will have a computer to store and process big data. Here data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated software’s can be written to interact with the database, process the required data and present it to the users for analysis purpose.
In the above approach it works well where we have less volume of data that can be accommodated by standard database servers, or up to the limit of the processor which is processing the data. But when it comes to dealing with huge amounts of data, it is really a tedious task to process such data through a traditional database server.
Google solved the above problem using an algorithm called MapReduce. This algorithm divides the given task into small parts and then assigns those parts to many computers connected over the network, and collects the results to form the final result dataset. The following diagram shows various commodity hardware’s which could be single CPU machines or servers with higher capacity.
Typical Distributed System
- Programs run on each app server
- All the data is on SAN
- Before execution, each server gets data from SAN
- After execution, each server writes the output to SAN
Problems with typical distributed system
- Huge dependency on network and huge bandwidth demands
- Scaling up and down is not a smooth process
- Partial failures are difficult to handle
- A lot of processing power is spent on transporting data
- Data synchronization is required during exchange.