Big Data

1. What is Big Data?

  • Definition:
    High-Volume, high-Velocity, and high-Variety information assets that require cost-effective, innovative processing for insights and decision-making.

2. Types of Big Data

TypeDescriptionKey CharacteristicsBanking/Finance Examples
Structured DataOrganized, tabular dataFixed schema, stored in RDBMS, easy to query with SQLCustomer account info, transactions, loan details, employee records
Unstructured DataRaw, unorganized dataNo schema, qualitative, requires advanced tools (NLP, AI, image processing)Call recordings, emails, scanned KYC docs, social media comments, CCTV footage
Semi-Structured DataMix of structured + unstructuredTags/metadata, flexible schema, often in XML/JSONWeb server logs, stock feeds (JSON/XML), SWIFT messages

3. Core Characteristics of Big Data (Vs)

VMeaningBanking Example
VolumeLarge amounts of dataMillions of daily transactions
VelocitySpeed of generation and processingReal-time fraud detection
VarietyDifferent forms of dataStructured account info + unstructured calls
VeracityAccuracy and trustworthinessClean and validated customer records
ValueBusiness impact from dataInsights for new loan products
VariabilityChanging meaning/contextChanging social media sentiments
VisualizationGraphical presentation for clarityNPA dashboards

4. Big Data Ecosystem & Architecture

A. Storage Layer (HDFS)

  • NameNode: Master node, stores metadata.
  • DataNode: Worker nodes, store actual data blocks (128MB/256MB).
  • Replication: Ensures fault tolerance.

B. Resource Management Layer (YARN)

  • ResourceManager: Allocates cluster resources.
  • NodeManager: Manages resources on each node.

C. Processing Layer

ToolDescription
MapReduceBatch processing, key-value pairs, slower.
Apache SparkIn-memory, 100x faster, supports real-time and ML.

D. Data Ingestion

ToolUse Case
Apache SqoopImport structured DB data to HDFS.
Apache FlumeCapture streaming/unstructured data like logs or social media.

E. Analysis & Query Tools

ToolUse Case
HiveSQL-like queries (batch reporting).
PigHigh-level scripting (Pig Latin).
ImpalaReal-time, fast SQL queries.

5. NoSQL Databases for Big Data

TypeDescriptionExamplesUse Case
Document StoreJSON-like documents, flexible schemaMongoDB, CouchDBUnified customer profiles
Column-Family StoreStores data in columns for analyticsCassandra, HBaseTime-series data (stock ticks)
Key-Value StoreKey-value pairs, very fastRedis, DynamoDBCaching sessions in mobile apps
Graph DatabaseRelationship-based storageNeo4j, JanusGraphDetecting fraud and money-laundering networks

6. Applications of Big Data in BFSI

ApplicationTechniqueExample
Fraud DetectionReal-time anomaly detection (ML models)Unusual high-value transfer alert
Credit Risk AssessmentPredictive analyticsUsing weather + satellite data for farm loans
Customer Segmentation360° view, churn predictionTargeted marketing campaigns
Regulatory ComplianceAML pattern detectionTracking suspicious fund transfers
Algorithmic TradingHigh-frequency data analyticsAutomated trade execution

7. Data Storage Approaches

AspectData WarehouseData LakeData Lakehouse
Data TypeStructuredRaw (all types)All types
SchemaWrite-first (Schema-on-Write)Read-later (Schema-on-Read)Both
UsersBusiness analystsData scientists, developersBoth
PurposeReporting, BIMachine learning, explorationUnified analytics

8. Cloud & Big Data

LayerAWSAzureGCP
StorageS3Blob StorageGCS
ProcessingEMRAzure DatabricksDataProc

9. Cheat Sheet

  • Frameworks: Hadoop (HDFS, MapReduce), Spark.
  • Databases: MongoDB, Cassandra, Redis, Neo4j.
  • Tools: Hive, Pig, Sqoop, Flume.
  • Applications: Finance, healthcare, retail, transportation, gaming.
  • Key Mnemonic: “3 Vs + 4 Extra Vs = 7 Vs (Volume, Velocity, Variety, Veracity, Value, Variability, Visualization)”

Multiple Choice Questions on Big Data

1. What are the 5 V’s of Big Data?

A) Volume, Velocity, Variety, Veracity, Value
B) Volume, Value, Visualization, Variety, Variance
C) Value, Volume, Variety, Verification, Velocity
D) Visualization, Variety, Veracity, Value, Volume

Answer: A

2. Which of the following is an open-source framework for distributed storage and processing of Big Data?

A) Spark
B) Hadoop
C) Tableau
D) SQL Server

Answer: B

3. What does HDFS stand for in the context of Big Data?

A) High Distributed File Storage
B) Hadoop Distributed File System
C) Hybrid Data File Storage
D) Hadoop Data Flow System

Answer: B

4. What type of data does NoSQL handle?

A) Structured data only
B) Unstructured and semi-structured data
C) Processed and raw data
D) Financial data exclusively

Answer: B

5. What is the main purpose of Apache Spark in Big Data?

A) Data visualization
B) In-memory data processing
C) Data storage
D) Predictive analytics

Answer: B

6. Which of the following is NOT a characteristic of Big Data?

A) Volume
B) Velocity
C) Variability
D) Variety

Answer: C

7. What is the role of MapReduce in Big Data?

A) Data visualization
B) Distributed processing of data
C) Managing databases
D) Analyzing data in real-time

Answer: B

8. Which database is commonly used in Big Data for unstructured data?

A) MySQL
B) Oracle
C) MongoDB
D) SQL Server

Answer: C

9. Which type of analytics focuses on “What should we do?”

A) Descriptive Analytics
B) Diagnostic Analytics
C) Predictive Analytics
D) Prescriptive Analytics

Answer: D

10. Which Big Data tool is used for data visualization?

A) Hive
B) Tableau
C) Pig
D) Cassandra

Answer: B

11. What is an example of semi-structured data?

A) Video files
B) SQL tables
C) JSON files
D) Text documents

Answer: C

12. Which Big Data technology is known for its distributed storage and scalability?

A) Tableau
B) Hadoop
C) Excel
D) Oracle

Answer: B

13. What challenge does Big Data face with real-time analysis?

A) Storage capacity
B) Privacy concerns
C) High latency
D) Lack of data integrity

Answer: C

14. Which application area uses Big Data for traffic prediction?

A) Retail
B) Transport
C) Healthcare
D) Government

Answer: B

15. What is distributed computing?

A) Running multiple computations on a single server
B) Processing data across multiple servers
C) Centralizing data for faster access
D) Encrypting data for secure storage

Answer: B