1. What is Big Data?
- Definition:
High-Volume, high-Velocity, and high-Variety information assets that require cost-effective, innovative processing for insights and decision-making.
2. Types of Big Data
Type | Description | Key Characteristics | Banking/Finance Examples |
---|---|---|---|
Structured Data | Organized, tabular data | Fixed schema, stored in RDBMS, easy to query with SQL | Customer account info, transactions, loan details, employee records |
Unstructured Data | Raw, unorganized data | No schema, qualitative, requires advanced tools (NLP, AI, image processing) | Call recordings, emails, scanned KYC docs, social media comments, CCTV footage |
Semi-Structured Data | Mix of structured + unstructured | Tags/metadata, flexible schema, often in XML/JSON | Web server logs, stock feeds (JSON/XML), SWIFT messages |
3. Core Characteristics of Big Data (Vs)
V | Meaning | Banking Example |
---|---|---|
Volume | Large amounts of data | Millions of daily transactions |
Velocity | Speed of generation and processing | Real-time fraud detection |
Variety | Different forms of data | Structured account info + unstructured calls |
Veracity | Accuracy and trustworthiness | Clean and validated customer records |
Value | Business impact from data | Insights for new loan products |
Variability | Changing meaning/context | Changing social media sentiments |
Visualization | Graphical presentation for clarity | NPA dashboards |
4. Big Data Ecosystem & Architecture
A. Storage Layer (HDFS)
- NameNode: Master node, stores metadata.
- DataNode: Worker nodes, store actual data blocks (128MB/256MB).
- Replication: Ensures fault tolerance.
B. Resource Management Layer (YARN)
- ResourceManager: Allocates cluster resources.
- NodeManager: Manages resources on each node.
C. Processing Layer
Tool | Description |
---|---|
MapReduce | Batch processing, key-value pairs, slower. |
Apache Spark | In-memory, 100x faster, supports real-time and ML. |
D. Data Ingestion
Tool | Use Case |
---|---|
Apache Sqoop | Import structured DB data to HDFS. |
Apache Flume | Capture streaming/unstructured data like logs or social media. |
E. Analysis & Query Tools
Tool | Use Case |
---|---|
Hive | SQL-like queries (batch reporting). |
Pig | High-level scripting (Pig Latin). |
Impala | Real-time, fast SQL queries. |
5. NoSQL Databases for Big Data
Type | Description | Examples | Use Case |
---|---|---|---|
Document Store | JSON-like documents, flexible schema | MongoDB, CouchDB | Unified customer profiles |
Column-Family Store | Stores data in columns for analytics | Cassandra, HBase | Time-series data (stock ticks) |
Key-Value Store | Key-value pairs, very fast | Redis, DynamoDB | Caching sessions in mobile apps |
Graph Database | Relationship-based storage | Neo4j, JanusGraph | Detecting fraud and money-laundering networks |
6. Applications of Big Data in BFSI
Application | Technique | Example |
---|---|---|
Fraud Detection | Real-time anomaly detection (ML models) | Unusual high-value transfer alert |
Credit Risk Assessment | Predictive analytics | Using weather + satellite data for farm loans |
Customer Segmentation | 360° view, churn prediction | Targeted marketing campaigns |
Regulatory Compliance | AML pattern detection | Tracking suspicious fund transfers |
Algorithmic Trading | High-frequency data analytics | Automated trade execution |
7. Data Storage Approaches
Aspect | Data Warehouse | Data Lake | Data Lakehouse |
---|---|---|---|
Data Type | Structured | Raw (all types) | All types |
Schema | Write-first (Schema-on-Write) | Read-later (Schema-on-Read) | Both |
Users | Business analysts | Data scientists, developers | Both |
Purpose | Reporting, BI | Machine learning, exploration | Unified analytics |
8. Cloud & Big Data
Layer | AWS | Azure | GCP |
---|---|---|---|
Storage | S3 | Blob Storage | GCS |
Processing | EMR | Azure Databricks | DataProc |
9. Cheat Sheet
- Frameworks: Hadoop (HDFS, MapReduce), Spark.
- Databases: MongoDB, Cassandra, Redis, Neo4j.
- Tools: Hive, Pig, Sqoop, Flume.
- Applications: Finance, healthcare, retail, transportation, gaming.
- Key Mnemonic: “3 Vs + 4 Extra Vs = 7 Vs (Volume, Velocity, Variety, Veracity, Value, Variability, Visualization)”
Multiple Choice Questions on Big Data
1. What are the 5 V’s of Big Data?
A) Volume, Velocity, Variety, Veracity, Value
B) Volume, Value, Visualization, Variety, Variance
C) Value, Volume, Variety, Verification, Velocity
D) Visualization, Variety, Veracity, Value, Volume
Answer: A
2. Which of the following is an open-source framework for distributed storage and processing of Big Data?
A) Spark
B) Hadoop
C) Tableau
D) SQL Server
Answer: B
3. What does HDFS stand for in the context of Big Data?
A) High Distributed File Storage
B) Hadoop Distributed File System
C) Hybrid Data File Storage
D) Hadoop Data Flow System
Answer: B
4. What type of data does NoSQL handle?
A) Structured data only
B) Unstructured and semi-structured data
C) Processed and raw data
D) Financial data exclusively
Answer: B
5. What is the main purpose of Apache Spark in Big Data?
A) Data visualization
B) In-memory data processing
C) Data storage
D) Predictive analytics
Answer: B
6. Which of the following is NOT a characteristic of Big Data?
A) Volume
B) Velocity
C) Variability
D) Variety
Answer: C
7. What is the role of MapReduce in Big Data?
A) Data visualization
B) Distributed processing of data
C) Managing databases
D) Analyzing data in real-time
Answer: B
8. Which database is commonly used in Big Data for unstructured data?
A) MySQL
B) Oracle
C) MongoDB
D) SQL Server
Answer: C
9. Which type of analytics focuses on “What should we do?”
A) Descriptive Analytics
B) Diagnostic Analytics
C) Predictive Analytics
D) Prescriptive Analytics
Answer: D
10. Which Big Data tool is used for data visualization?
A) Hive
B) Tableau
C) Pig
D) Cassandra
Answer: B
11. What is an example of semi-structured data?
A) Video files
B) SQL tables
C) JSON files
D) Text documents
Answer: C
12. Which Big Data technology is known for its distributed storage and scalability?
A) Tableau
B) Hadoop
C) Excel
D) Oracle
Answer: B
13. What challenge does Big Data face with real-time analysis?
A) Storage capacity
B) Privacy concerns
C) High latency
D) Lack of data integrity
Answer: C
14. Which application area uses Big Data for traffic prediction?
A) Retail
B) Transport
C) Healthcare
D) Government
Answer: B
15. What is distributed computing?
A) Running multiple computations on a single server
B) Processing data across multiple servers
C) Centralizing data for faster access
D) Encrypting data for secure storage
Answer: B