everywhere you turn people are talking about big data, hadoop and sharding. rightfully so. in today’s day and age managing a lot of data is not an easy task, as performance and scalability are key. traversing large data sets, dividing them into tiny sections and distributing the load among many machines (processors) is nothing new.
hadoop exists in order to solve specific problems and has emerged out of necessity. what hadoop does is provide the infrastructure to connect multiple (cheap) servers into a coherent environment with which high i/o and cpu problems (algorithms) are solved.
it all started in 2004 when doug cutting of google released his document indexing project called lucene and decided to have it possible to achieve the same goals in distributed environment. hadoop BTW is his sun’s yellow elephant toy. in 2006 yahoo hired doug to improve his project so it can index the entire internet and made the project open source. that day marked the start of the revolution.
at it’s core hadoop includes two projects: one for distributed storage and one for distributed computing. around those two projects a vast of projects have evolved (and still are).
HDFS: hadoop distributed file system
this file system is designed to store large files and enable large and effective r/w. this is done by dividing the file into sizable chunks, while each chunk is normally stored on 3 nodes which can be anywhere. there is a “name node” that runs the mapping between a document and it’s constituent pieces and the data nodes on which they are stored on.
an API to write programs that will run in a parallel. the developer really needs to write two simple functions: map and reduce that handle a single document (i.e. element of data) on multiple machines, when each node is responsible for the timing, handling errors and failures (network, i/o, etc). this allows for simple parallel batching, where a “job tracker” synchronizes the execution of the bach processes, when each one batch is sub divided into smaller tasks which are handled by the “task trackers”.
over time yahoo and facbook (to mention a few) wrote their own drivers over HDFS and mapReduce and have shared their work with the community. so hadoop is a code name for a set of technologies who harnesses the computing power of many machines to perform simple tasks in parallel. hadoop emerged from the world of un structured data where hundreds of millions of pages are analyzed. today big data is being implemented and researched in every facet of the economy, including healthcare.