1. What is Big Data?
- Definition: Big Data refers to massive volumes of data that cannot be processed using traditional methods.
- Characteristics (5 V’s):
- Volume: Huge amount of data.
- Velocity: Speed of data generation.
- Variety: Different types of data (structured, unstructured, semi-structured).
- Veracity: Data accuracy and reliability.
- Value: Useful insights derived from data.
2. Examples of Big Data
- Social media posts (Twitter, Facebook).
- E-commerce transactions (Amazon, Flipkart).
- IoT devices (smart home sensors).
- Healthcare records.
3. Big Data Technologies
- Hadoop: Open-source framework for distributed storage and processing of large datasets.
- Key components:
- HDFS (Hadoop Distributed File System): Stores data.
- MapReduce: Processes data.
- Key components:
- Spark: In-memory data processing engine.
- NoSQL Databases:
- Examples: MongoDB, Cassandra, HBase.
- Designed for unstructured data.
4. Big Data Tools
- Storage: HDFS, Amazon S3.
- Processing: Hadoop, Spark.
- Analysis: Hive, Pig, Apache Flink.
- Visualization: Tableau, Power BI.
5. Types of Data
- Structured Data: Organized data (e.g., SQL tables).
- Unstructured Data: Unorganized data (e.g., images, videos).
- Semi-structured Data: Hybrid (e.g., JSON, XML).
6. Key Big Data Concepts
- Distributed Computing: Data processing across multiple servers.
- Data Mining: Extracting useful patterns.
- Machine Learning: Predictive modeling and pattern recognition.
- Data Warehousing: Central repository of integrated data.
7. Big Data Analytics
- Descriptive Analytics: What happened?
- Predictive Analytics: What will happen?
- Prescriptive Analytics: What should we do?
8. Challenges of Big Data
- Data storage and management.
- Ensuring data privacy and security.
- Analyzing real-time data.
- Lack of skilled professionals.
9. Applications of Big Data
- Healthcare: Personalized medicine, disease prediction.
- Finance: Fraud detection, risk management.
- Retail: Customer behavior analysis, recommendation systems.
- Transport: Traffic prediction, route optimization.
- Government: Smart cities, policy analysis.
10. Exam Quick Tips
- Remember the 5 V’s of Big Data.
- Focus on technologies like Hadoop and Spark.
- Differentiate between structured, unstructured, and semi-structured data.
- Know examples of Big Data applications.
- Understand key analytics types: descriptive, predictive, prescriptive.
Cheat Sheet Summary
- Frameworks: Hadoop (HDFS + MapReduce), Spark.
- Databases: MongoDB, Cassandra.
- Analysis Tools: Hive, Pig.
- Key Applications: Healthcare, finance, retail, transport.
Multiple Choice Questions on Big Data
1. What are the 5 V’s of Big Data?
A) Volume, Velocity, Variety, Veracity, Value
B) Volume, Value, Visualization, Variety, Variance
C) Value, Volume, Variety, Verification, Velocity
D) Visualization, Variety, Veracity, Value, Volume
Answer: A
2. Which of the following is an open-source framework for distributed storage and processing of Big Data?
A) Spark
B) Hadoop
C) Tableau
D) SQL Server
Answer: B
3. What does HDFS stand for in the context of Big Data?
A) High Distributed File Storage
B) Hadoop Distributed File System
C) Hybrid Data File Storage
D) Hadoop Data Flow System
Answer: B
4. What type of data does NoSQL handle?
A) Structured data only
B) Unstructured and semi-structured data
C) Processed and raw data
D) Financial data exclusively
Answer: B
5. What is the main purpose of Apache Spark in Big Data?
A) Data visualization
B) In-memory data processing
C) Data storage
D) Predictive analytics
Answer: B
6. Which of the following is NOT a characteristic of Big Data?
A) Volume
B) Velocity
C) Variability
D) Variety
Answer: C
7. What is the role of MapReduce in Big Data?
A) Data visualization
B) Distributed processing of data
C) Managing databases
D) Analyzing data in real-time
Answer: B
8. Which database is commonly used in Big Data for unstructured data?
A) MySQL
B) Oracle
C) MongoDB
D) SQL Server
Answer: C
9. Which type of analytics focuses on “What should we do?”
A) Descriptive Analytics
B) Diagnostic Analytics
C) Predictive Analytics
D) Prescriptive Analytics
Answer: D
10. Which Big Data tool is used for data visualization?
A) Hive
B) Tableau
C) Pig
D) Cassandra
Answer: B
11. What is an example of semi-structured data?
A) Video files
B) SQL tables
C) JSON files
D) Text documents
Answer: C
12. Which Big Data technology is known for its distributed storage and scalability?
A) Tableau
B) Hadoop
C) Excel
D) Oracle
Answer: B
13. What challenge does Big Data face with real-time analysis?
A) Storage capacity
B) Privacy concerns
C) High latency
D) Lack of data integrity
Answer: C
14. Which application area uses Big Data for traffic prediction?
A) Retail
B) Transport
C) Healthcare
D) Government
Answer: B
15. What is distributed computing?
A) Running multiple computations on a single server
B) Processing data across multiple servers
C) Centralizing data for faster access
D) Encrypting data for secure storage
Answer: B