What is Hadoop?
Since much of big data is "unstructured", it must be formatted to make it suitable for data mining and analysis. Hadoop solves this problem, and is the core platform for structuring big data.
Big Data University describes Hadoop as follows:
- Hadoop is an open source project of the Apache Foundation.
- It is a framework written in Java originally developed by Doug Cutting who named it after his son's toy elephant.
- Hadoop uses Google’s MapReduce and Google File System technologies as its foundation.
- It is optimized to handle massive quantities of data in a variety of formats (structured, unstructured or semi-structured).
- It uses commodity hardware (that is, relatively inexpensive computers).
- Massive parallel processing is done with great performance. However, it is a batch operation handling massive quantities of data, so the response time is not immediate.
- Hadoop replicates the data across computers. Thus, if one goes down, the data are processed on one of the replicated computers.
- Hadoop is used for big data. It complements OnLine Transaction Processing and OnLine Analytical Processing.
Forrester analyst Mike Gualtieri (from Information Week) offers some additional insights to help you understand what Hadoop is.
- "Imagine you had a file that was larger than your PC's capacity. You could not store that file, right? Hadoop lets you store files bigger than what can be stored on one particular node or server. So you can store very, very large files. It also lets you store many, many files."
- "Mainstream business users don't need to know how Hadoop works." "But they do need to understand that the constraints they once had on storing and processing data are removed when Hadoop is installed."
- “To understand Hadoop, you have to understand two fundamental things about it: How Hadoop stores files, and how it processes data.”
- “The second characteristic of Hadoop is its ability to process that data, or at least (provide) a framework for processing that data. That's called MapReduce."
- “But rather than take the conventional step of moving data over a network to be processed by software, MapReduce uses a smarter approach tailor made for big data sets.” Moving data over a network "can be very, very slow, especially for really large data sets." "Imagine if you're opening a really, really big file on your laptop, it takes a long, long time. It takes much longer than if it's a short, tiny file." So rather than move the data to the software, MapReduce moves the processing software to the data.
What are Hadoop’s Limitations?
- Hadoop is not suitable for OnLine Transaction Processing where data are randomly accessed on structured data typically found in a relational database.
- Hadoop is not suitable for OnLine Analytical Processing or Decision Support System where data are sequentially accessed on structured data to generate business intelligence (BI) reports.
- It is NOT a replacement for a relational database system.
- It is not good for processing lots of small files.
- It is not good for intensive calculations using small data sets.
- It should not be used for cases when the work cannot be parallelized.
- It is not good for low latency data access.
For more in depth information and training on Hadoop, check out these online sites:
About Patti Gilchrist
Patti Gilchrist is a Sr. Technical Manager with 25 years experience implementing strategic enterprise initiatives.
Patti has a reputation for effectively translating business problems into innovative solutions and creating strategic roadmaps to achieve business goals.