When To Use Big Data
With all the current excitement around big data, many are eager to get started with a big data project. But not all projects are big data projects. Remember, just having large amounts of data does not necessarily require big data technologies. Thus, project managers must be capable of advising management not only about how, but also when, the business can and should take advantage of big data technologies.
Do You Really Need a Big Data Solution?
Before beginning a big data project, companies must ask the following basic questions:
- Why do this at all?
Tamara Dull of the best practices team at SAS advises:
“Keep in mind that you don’t always need big data; you just need the right data (see Building Off Best Practices).
“The first, most obvious question is “Why do this at all?” There should be a compelling use case, a competitive driver, cost driver, or some other issue that has been identified where the application of big data technologies is in the critical path to solving the problem. Typical drivers include the information type (for example, under-utilized structured information sources), or the volume of information (retention of IP logs), but in any case, you need to identify exactly why you are pursuing this path.”
“One of the most important things you should look for is a compelling ROI (return on investment). That is to say, find something for which you can put a value on the cost of the problem before you plan a solution (see Selecting Your First Big Data Project).
- Are you using the basic data you already have in a way that engages consumers?
“The biggest reason that investments in big data fail to pay off, though, is that most companies don’t do a good job with the information they already have. They don’t know how to manage it, analyze it in ways that enhance their understanding, and then make changes in response to new insights. Companies don’t magically develop those competencies just because they’ve invested in high-end analytics tools. They first need to learn how to use the data already embedded in their core operating systems, much the way people must master arithmetic before they tackle algebra. Until a company learns how to use data and analysis to support its operating decisions, it will not be in a position to benefit from big data (see You May Not Need Big Data After All).
Additional Considerations to Help Determine if a Big Data Solution Is Required
Once you have answered the above questions, there are additional considerations. Remember, just because your data is too big for Excel does not necessarily mean your data is “big data”.
A Test to Determine If You Need a Big Data Solution
The following offers a 4-point test based on volume, velocity, variety, and variability of data to help determine if you really need a big data solution:
Points to Consider When Determining If You Need a Big Data Solution
The following excerpt from SyonCloud provides several points for consideration to help determine if a Big Data solution is required.
- If your relational databases do not scale to your traffic needs for acceptable cost of hardware and/or licenses.
- If normalized schema of your relational database became too complex. If too many tables hold just tiny proportion of overall data. You can no longer print ERD on single A3 page.
- If your business applications generate lots of supporting and temporary data that does not really belong to main data store. Such data includes customer's search results, visited pages, historical share prices, contents of abandon shopping carts and so on.
- Your database schema is already denormalized in order to improve response times of your applications.
- When joins in relational databases slow the system down to a crawl.
- Relational data doesn’t map well to typical programming structures that often consist of complex data types or hierarchical data. Data such as XML is especially difficult because of its hierarchical nature. Complex objects that contain objects and lists inside of them do not always map directly to a single row in a single table.
- If documents from different sources require flexible schema or no schema at all. If it is required to keep input data in its original formats.
- If ETL (Extract Transform Load) is required on source data. NoSQL engines or Map/Reduce can perform ETL steps and produce output suitable to load into a RDBMS.
- If missing data can be ignored when the volume of data is large enough. The law of Big Data is “More data beats clever algorithms.”
- When flexibility is required for analytics. It allows experimentation into what questions we should be asking before defining a fixed data model.
- In NoSQL databases each data element or each document is versioned. This enables queries for values at specific time in history.
- When we need to utilize outputs from many existing systems. An example is: In order to prepare relevant offer to a customer we need information from billing system, from historical orders of the customer, from orders of similar customers as well as from stock system and CRM system. Traditional integration of all the systems is expensive and not very flexible.
- When we need to analyze unstructured data such as documents, log files or semi-structured data such as CSV files, forms and exports from other systems.
More General Tips to Consider Before Starting a Big Data Project
And here are a few more general tips to consider before undertaking a big data project:
- A big data analytics solution should be a business decision, not an IT decision.
Determine what the problems are you want to solve. Here you need to identify what issues your organization is facing and envision what solutions might be to those problems (see 8 Proven Steps to Starting a Big Data Analytics Project).
- Look for business value before you start building your big data infrastructure.
If you do not have a clear and concise definition of your expectations before you start, you should not be doing a big data project (see A Checklist to Evaluate Your Environment)
- Consider the following dimensions (see
How to Know if a Big Data Solution is Right for Your Organization):
- Business value from the insight that might be gained from analyzing the data
- Governance considerations for the new sources of data and how the data will be used
- People with relevant skills available and commitment of sponsors
- Volume of the data being captured
- Variety of data sources, data types, and data formats
- Velocity at which the data is generated, the speed with which it needs to be acted upon, or the rate at which it is changing
- Veracity of the data, or rather, the uncertainty or trustworthiness of the data
Limitations of Hadoop
Since much of big data is "unstructured", it must be formatted to make it suitable for data mining and analysis. Hadoop solves this problem, and is the core platform for structuring big data.
To help you determine if a big data solution would be beneficial, you should also consider Hadoop’s limitations.
- Hadoop is NOT a replacement for a relational database system.
- Hadoop complements On-Line Transaction Processing and On-Line Analytical Processing.
- Hadoop is NOT suitable for On-Line Transaction Processing workloads where queries are performed on structured data (i.e., from a relational database).
- Hadoop is NOT suitable for On-Line Analytical Processing or Decision Support System workloads to generate business intelligence reports.
- Hadoop is NOT suitable when the work cannot be parallelized.
- Hadoop is NOT needed for processing a lot of small files. In fact, it does not provide sufficient performance for such tasks.
- Hadoop is NOT conducive for performing intensive calculations using small data sets.
About Patti Gilchrist
Patti Gilchrist is a Sr. Technical Manager with 25 years experience implementing strategic enterprise initiatives.
Patti has a reputation for effectively translating business problems into innovative solutions and creating strategic roadmaps to achieve business goals.