What is Big Data ?
BIG DATA is a term used for a collection of data sets so large and complex that it is difficult to process using traditional applications/tools. It is the data exceeding Terabytes in size. Because of the variety of data that it encompasses, big data always brings a number of challenges relating to its volume and complexity. A recent survey says that 80% of the data created in the world are unstructured. One challenge is how these unstructured data can be structured, before we attempt to understand and capture the most important data. Another challenge is how we can store it. Here are the top technologies used to store and analyze Big Data. We can categorize them into two (storage and Querying/Analysis).
There are many real life examples of Big Data. Facebook is generating 500+ terabytes of data per day, NYSE (New York Stock Exchange) generates about 1 terabyte of new trade data per day. A jet airline collects 10 terabytes of censor data for every 30 minutes of flying time. All these are day to day examples of Big Data.
How analysis of Big Data is useful for organizations?
Effective analysis of Big Data provides a lot of business advantage as organizations will learn which areas to focus on and which areas are less important. Big data analysis provides some early key indicators that can prevent the company from a huge loss or help in grasping a great opportunity with open hands. A precise analysis of Big Data helps in decision making. For instance, nowadays people rely so much on Facebook and Twitter before buying any product or service. All thanks to the Big Data explosion.
How Big Data Related to Hadoop?
Big Data is a concept which facilitates handling large amount of datasets. Hadoop is the framework which makes Big Data analysis possible.
Apache Hadoop is a java based free software framework. It is designed to solve problems that involve analyzing large data (e.g. petabytes). The programming model is based on Google’s MapReduce. This framework runs in parallel on a cluster and has an ability to allow us to process data across all nodes. Hadoop Distributed File System (HDFS) is the storage system of Hadoop which splits big data and distribute across many nodes in a cluster. This also replicates data in a cluster thus providing high availability.
HIVE & PIG
Hive is a data warehousing infrastructure based on the Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware. Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily.
Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.
Demand for Big Data Skills
The demand for professionals with big data skills is increasing alongside the increased investment in big data. Companies today realize the value of data analytics. As a result, they are looking for skilled people to capture and make sense of it. Job portals like Indeed, Naukri, Glassdoor and many more provide a clear picture of the demand for big data jobs. Although big data is a broad term, job listings use it often. You will find it in job postings for data analysts, data scientists and other important roles within the data industry.
Technology changes frequently, and so do buzzwords. Big data, which was one of the most-used terms until recently, has been replaced with ‘real-time’. This doesn’t mean that the demand for big data skills is now low. Rather, it simply means that the keyword has been replaced. Textion, a Seattle-based startup, has verified the same after studying 500,000 job applications applying for big data related jobs.
According to the findings of IDG’s research on big data, organizations are set to invest big in skill sets necessary for big data deployments in the next 12-18 months. This means, more opportunities for data scientists, data architects, data analysts, data visualizers, research analysts, and business analysts.
Big Data Job Titles
Here is a list of job titles that require big data skills. These titles are useful in determining search terms for big data jobs.
- Data Engineer
- Big Data Engineer
- Machine Learning Scientist
- Business Analytics Specialist
- Data Visualization Developer
- Business Intelligence (BI) Engineer
- BI Solutions Architect
- BI Specialist
- Analytics Manager
- Machine Learning Engineer
Key Features of HDFS
Difference between RDMS & Hadoop
Structured & Unstructured Data
Core components of Hadoop
Different Hadoop Services
Introduction to Map Reduce Framework
Introduction to HIVE & HIVE architecture
Concept of Metastore, Query Compiler, Execution Engines & Thrift Server
HIVE datatypes and Operators
HQL ( HIVE QUERY LANGUAGE )
Introduction to PIG
Features & Components of PIG
Advantages of Using PIG
PIG usage scenarios
Datatypes, Relations, Bags, Tuples, Fields in PIG
Pig Latin Statements & Pig Scripts