Big Data

🧠 1. What is Big Data?

Big Data means very large, complex, and fast-growing data that cannot be easily managed, processed, or analyzed using traditional database systems like Excel or SQL alone.


🔍 In Simple Words:

  • Every time you use your ATM, UPI, mobile app, or card, data is generated.
  • This data grows very fast and comes from many sources — transactions, sensors, social media, websites, etc.
  • The huge and diverse nature of such data is called Big Data.

📊 Example:

SourceType of DataExample
Banking transactionsStructuredAccount number, amount, balance
WhatsApp messagesUnstructuredText, audio, video
IoT sensors in ATMsSemi-structuredTemperature logs, alerts
Social mediaUnstructuredTweets, likes, comments

📏 2. Characteristics of Big Data (The 5 Vs)

VMeaningExplanationExample
VolumeSize of dataHuge amounts of data generated dailyMillions of ATM transactions per day
VelocitySpeed of generationData is created and updated in real-timeUPI transactions per second
VarietyDifferent forms of dataStructured, semi-structured, unstructuredExcel sheets, images, videos, JSON files
VeracityAccuracy of dataReliability and quality of dataRemoving duplicate or wrong entries
ValueUsefulness of dataHow much insight/benefit the data givesFraud detection, risk analysis

💡 Sometimes exams ask: “What are the 3 Vs / 5 Vs of Big Data?” — remember these keywords.


⚙️ 3. Components of Big Data Architecture

Big Data systems work in three stages — Storage, Processing, and Analysis.

StageFunctionTechnology Examples
StorageStore massive data sets safelyHDFS (Hadoop Distributed File System), HBase, S3
ProcessingHandle and compute data efficientlyHadoop MapReduce, Apache Spark
AnalysisExtract insights and visualizeHive, Pig, Tableau, Power BI

🧩 4. Key Technologies in Big Data

🔸 1. Hadoop Ecosystem

Hadoop = Open-source framework that allows distributed storage and parallel processing.

ComponentPurpose
HDFS (Hadoop Distributed File System)Stores large data across multiple servers
MapReduceProcesses and analyzes data in parallel
YARN (Yet Another Resource Negotiator)Manages cluster resources
HiveSQL-like querying on Big Data
PigData transformation (ETL scripting)
HBaseNoSQL database for large tables
SqoopTransfers data between Hadoop and RDBMS
FlumeCollects streaming data (like logs)
OozieSchedules and manages Hadoop workflows

🔸 2. Apache Spark

  • Faster alternative to Hadoop’s MapReduce.
  • Performs real-time analytics using in-memory processing.
  • Used for: Fraud detection, credit scoring, sentiment analysis.

🔸 3. NoSQL Databases

  • Handle unstructured and semi-structured data.
  • Examples: MongoDB, Cassandra, CouchDB, HBase
FeatureRDBMSNoSQL
StructureTables & rowsKey-value or document format
SchemaFixedFlexible
ScalabilityVertical (add hardware)Horizontal (add servers)
ExampleOracle, MySQLMongoDB, HBase

🔸 4. Real-Time & Stream Processing Tools

ToolUse
KafkaReal-time data streaming
Flink / StormStream data analytics
Spark StreamingReal-time event analysis (e.g., UPI transactions)

💾 5. Types of Data in Big Data

TypeDescriptionExample
StructuredOrganized in rows and columnsCore banking data
Semi-StructuredNot fixed format, has tagsXML, JSON
UnstructuredRaw, messy dataImages, videos, emails

📈 6. Big Data Analytics Types

TypeAnswersExample in Banking
Descriptive AnalyticsWhat happened?Monthly transaction reports
Diagnostic AnalyticsWhy did it happen?Fraud reason analysis
Predictive AnalyticsWhat will happen?Predicting loan default risk
Prescriptive AnalyticsWhat should be done?Recommending loan terms

🏦 7. Big Data in Banking & Financial Sector

Use CaseDescription
Fraud DetectionDetect unusual or suspicious activity in real time using AI + Big Data
Customer SegmentationGroup customers by behavior for marketing
Credit ScoringInclude non-traditional data (like digital behavior) for loan risk analysis
Regulatory ComplianceMaintain audit trails, data lineage, KYC data
Risk ManagementPredict potential NPAs and market risks
Personalized OffersRecommend credit cards, loans based on customer history

🏦 Example in Banking Context:

SBI or HDFC use Big Data to track your spending pattern and detect fraudulent credit card transactions instantly using real-time analytics.


🔐 8. Big Data Security & Governance

TermMeaning
Data GovernanceFramework for managing data quality, access, and usage
Data LineageTracking where data comes from and how it changes
Data PrivacyProtecting customer data (RBI and GDPR compliance)
EncryptionSecuring data at rest and in transit
AnonymizationHiding personal identity from datasets

☁️ 9. Big Data and Cloud Computing

Big Data is often stored and processed on the cloud for scalability and cost savings.

ProviderBig Data Service
AWSEMR, Redshift, S3
AzureHDInsight, Synapse
Google CloudBigQuery, Dataproc

🇮🇳 10. Big Data in Indian Context

InitiativeDescription
NITI Aayog – National Strategy for AI (“AI for All”)Promotes Big Data + AI in governance
NDAP (National Data and Analytics Platform)Unified public data portal
MeghRaj (GI Cloud)Government cloud project for data storage
UIDAI (Aadhaar)One of the world’s largest Big Data projects
RBI IT Framework (2017)Data governance and analytics guidelines for banks

⚖️ 11. Advantages & Challenges

AdvantagesChallenges
Better decision-makingData privacy concerns
Real-time insightsHigh infrastructure cost
Detect fraud quicklyShortage of skilled professionals
Customer personalizationData integration from multiple sources

💡 12. Real-World Examples

InstitutionBig Data Use
RBIData analytics for fraud & risk reporting
SBICustomer analytics and real-time fraud detection
HDFC Bank“EVA” chatbot powered by AI + Big Data
SEBIMarket surveillance for insider trading
NABARDCredit and rural data analytics for decision-making

🧾 13. Quick Exam Pointers

Remember:

  • 5 Vs: Volume, Velocity, Variety, Veracity, Value
  • Hadoop = HDFS + MapReduce + YARN
  • Hive = SQL-like query tool
  • HBase = NoSQL database
  • Spark = Real-time analytics
  • Sqoop = Transfer data between RDBMS and Hadoop
  • Flume = Collect log data
  • Oozie = Scheduler
  • Kafka = Real-time streaming

Banking Uses:
Fraud detection, AML, compliance, risk scoring, credit monitoring.

Govt Initiatives:
NITI Aayog (AI for All), NDAP, MeghRaj.


🧭 14. Summary Chart

TopicKey Idea
DefinitionHandling very large and complex datasets
5 VsVolume, Velocity, Variety, Veracity, Value
Core TechnologyHadoop Ecosystem
Key ToolApache Spark
Banking UseFraud detection, risk analytics
SecurityData encryption, governance, privacy
Govt InitiativeNDAP, NITI Aayog “AI for All”

🧠 Section 1: Basics of Big Data

  1. What is Big Data?
    A. Large amounts of structured and unstructured data generated daily
    B. Only data stored in Excel files
    C. Small datasets analyzed manually
    D. Data stored in floppy disks
    Answer: A
  2. Which of the following is NOT a characteristic of Big Data?
    A. Volume
    B. Velocity
    C. Variety
    D. Visibility
    Answer: D
  3. The term “3 Vs” in Big Data stands for:
    A. Volume, Velocity, Variety
    B. Virtualization, Visualization, Value
    C. Volume, Value, Visualization
    D. Version, Volume, Validation
    Answer: A
  4. Which two additional Vs are often added to the original 3Vs?
    A. Veracity and Value
    B. Validity and Volume
    C. Vision and Variety
    D. Velocity and Vacuum
    Answer: A
  5. What does ‘Volume’ represent in Big Data?
    A. Size or amount of data generated
    B. Type of data
    C. Speed of data
    D. Accuracy of data
    Answer: A
  6. ‘Velocity’ in Big Data refers to:
    A. Speed at which data is generated, processed and analyzed
    B. The variety of data formats
    C. The accuracy of data
    D. The cost of storage
    Answer: A
  7. ‘Variety’ refers to:
    A. Different forms of data — structured, semi-structured, unstructured
    B. Only structured data
    C. Data duplication
    D. Data redundancy
    Answer: A
  8. ‘Veracity’ means:
    A. Accuracy and trustworthiness of data
    B. Size of data
    C. Type of data
    D. None
    Answer: A
  9. ‘Value’ in Big Data means:
    A. Economic or business benefit derived from data
    B. Random number of bytes
    C. File system name
    D. Encryption code
    Answer: A
  10. Which statement about Big Data is TRUE?
    A. It cannot be stored in traditional systems efficiently
    B. It always comes from one source
    C. It is mostly static
    D. It does not require analytics
    Answer: A

⚙️ Section 2: Big Data Components & Architecture

  1. Which is the most popular open-source Big Data framework?
    A. Hadoop
    B. Oracle
    C. Access
    D. MySQL
    Answer: A
  2. Which language is Hadoop primarily written in?
    A. Java
    B. Python
    C. C++
    D. Go
    Answer: A
  3. HDFS stands for:
    A. Hadoop Distributed File System
    B. High Data File Storage
    C. High Definition File Server
    D. Hadoop Data File Set
    Answer: A
  4. The two main components of Hadoop are:
    A. HDFS and MapReduce
    B. Hive and Pig
    C. Spark and Kafka
    D. SQL and NoSQL
    Answer: A
  5. NameNode in Hadoop is responsible for:
    A. Storing metadata and directory tree of files
    B. Processing user data
    C. Executing MapReduce tasks
    D. Compressing data
    Answer: A
  6. DataNode in Hadoop:
    A. Stores actual data blocks
    B. Stores metadata
    C. Controls access rights
    D. Runs NameNode
    Answer: A
  7. MapReduce is used for:
    A. Parallel processing of data across distributed nodes
    B. Data visualization
    C. File encryption
    D. Network configuration
    Answer: A
  8. Map phase in MapReduce does:
    A. Splits and processes data
    B. Aggregates output
    C. Deletes logs
    D. Encrypts data
    Answer: A
  9. Reduce phase in MapReduce:
    A. Aggregates intermediate outputs and produces results
    B. Splits data
    C. Encrypts input
    D. None
    Answer: A
  10. YARN stands for:
    A. Yet Another Resource Negotiator
    B. Yearly Analysis Resource Node
    C. Yield Aggregation Random Network
    D. None
    Answer: A

💾 Section 3: Big Data Technologies

  1. Which of the following is NOT part of the Hadoop ecosystem?
    A. HDFS
    B. MapReduce
    C. Cassandra
    D. Hive
    Answer: C
  2. Which Hadoop component provides SQL-like queries?
    A. Hive
    B. Pig
    C. HBase
    D. Mahout
    Answer: A
  3. Apache Pig is used for:
    A. Data flow scripting and ETL (Extract, Transform, Load)
    B. Image processing
    C. File transfer
    D. Compression
    Answer: A
  4. Which Hadoop component is a NoSQL database?
    A. HBase
    B. Hive
    C. Sqoop
    D. Flume
    Answer: A
  5. Apache Sqoop is used for:
    A. Transferring data between Hadoop and RDBMS
    B. Data visualization
    C. System security
    D. Job scheduling
    Answer: A
  6. Apache Flume is used for:
    A. Collecting and moving streaming log data to HDFS
    B. Storing images
    C. Email filtering
    D. File encryption
    Answer: A
  7. Apache Oozie is:
    A. Workflow scheduler for Hadoop jobs
    B. Visualization tool
    C. Data encryption library
    D. Data cleaner
    Answer: A – Apache Oozie is a workflow scheduler for Hadoop that helps you run, manage, and automate big data jobs in a sequence. In simple words: It organizes and schedules your Hadoop tasks so they run in the right order.
  8. Mahout in Hadoop is used for:
    A. Machine Learning
    B. File Transfer
    C. Data Security
    D. Encryption
    Answer: A – Apache Mahout is an open-source library that provides machine learning algorithms (like clustering, classification, and recommendations) that can run on big data systems. In simple words: Mahout helps build ML models that work with large amounts of data.
  9. Which of the following is an in-memory Big Data processing engine?
    A. Apache Spark
    B. Hive
    C. Flume
    D. Oozie
    Answer: A – Apache Spark is a fast, open-source big data processing engine used for analyzing large datasets quickly. In simple words: Spark processes big data super fast, much faster than Hadoop MapReduce.
  10. Spark is written in which language?
    A. Scala
    B. Python
    C. Java
    D. All of the above
    Answer: D

☁️ Section 4: Databases & Storage

  1. NoSQL databases are designed for:
    A. Non-relational, unstructured data storage
    B. Only structured data
    C. Relational joins
    D. Small datasets only
    Answer: A
  2. Which of the following is a NoSQL database?
    A. MongoDB
    B. Oracle
    C. MySQL
    D. PostgreSQL
    Answer: A
  3. CAP theorem states:
    A. Consistency, Availability, Partition Tolerance
    B. Capacity, Accessibility, Processing
    C. Clustering, Aggregation, Partitioning
    D. Cache, API, Protocol
    Answer: A – A distributed system cannot provide Consistency, Availability, and Partition tolerance all together—only any two at a time.
  4. In Big Data, data stored in HDFS is split into:
    A. Blocks
    B. Tables
    C. Arrays
    D. Streams
    Answer: A
  5. Default block size in Hadoop 2.x is:
    A. 64 MB
    B. 128 MB
    C. 512 MB
    D. 1 GB
    Answer: B
  6. Which system provides column-oriented storage?
    A. HBase
    B. Hive
    C. Flume
    D. Sqoop
    Answer: A – HBase is a distributed, NoSQL database built on top of Hadoop that stores large amounts of data in tables with rows and columns.
  7. Which database supports high write throughput and scalability?
    A. Cassandra
    B. MySQL
    C. Oracle
    D. MS Access
    Answer: A – Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of data across many servers without downtime.
  8. Which Big Data tool is ideal for real-time analytics?
    A. Apache Spark Streaming
    B. MapReduce
    C. Sqoop
    D. Oozie
    Answer: A – Apache Spark Streaming is a component of Spark used to process real-time data streams (like logs, sensor data, or live events).
  9. Which Big Data storage solution is often used by cloud providers like AWS?
    A. S3 (Simple Storage Service)
    B. HDD
    C. SSD
    D. Pendrive
    Answer: A
  10. Which Big Data format is used for efficient columnar storage?
    A. Parquet
    B. CSV
    C. JSON
    D. TXT
    Answer: A – Columnar storage is a way of storing data column-wise instead of row-wise.

🧮 Section 5: Data Analytics & Tools

  1. What does ETL stand for?
    A. Extract, Transform, Load
    B. Evaluate, Test, Learn
    C. Encode, Translate, Load
    D. Encrypt, Transfer, Log
    Answer: A
  2. Data Lake is:
    A. Centralized storage of raw data in its native format
    B. Traditional data warehouse
    C. Temporary cache
    D. File backup
    Answer: A – A Data Lake is a large storage system that holds raw data in any format—structured or unstructured. It’s a big storage pool where you dump all types of data without needing to organize it first.
  3. Data Warehouse stores:
    A. Processed and structured data
    B. Raw unprocessed data
    C. Only images
    D. Logs only
    Answer: A
  4. Which tool is widely used for data visualization?
    A. Tableau
    B. Flume
    C. Sqoop
    D. Oozie
    Answer: A – Tableau is a data visualization tool used to create interactive charts, dashboards, and reports.
  5. Which programming language is most widely used in Big Data Analytics?
    A. Python
    B. PHP
    C. COBOL
    D. Fortran
    Answer: A
  6. R language is mainly used for:
    A. Statistical and data analysis
    B. Image rendering
    C. File compression
    D. None
    Answer: A
  7. Which tool provides interactive dashboards for Big Data?
    A. Power BI
    B. Spark
    C. Pig
    D. Sqoop
    Answer: A – Power BI is Microsoft’s data visualization and business analytics tool that helps create interactive reports and dashboards.
  8. Apache Kafka is used for:
    A. Real-time data streaming and messaging
    B. File storage
    C. Scheduling
    D. Data cleaning
    Answer: A
  9. Which is an example of batch processing?
    A. Hadoop MapReduce
    B. Spark Streaming
    C. Kafka
    D. Flink
    Answer: A – Hadoop MapReduce is a programming model used to process large datasets in parallel across many computers.
  10. Which is an example of real-time stream processing?
    A. Spark Streaming
    B. Pig
    C. Hive
    D. Oozie
    Answer: A – Spark Streaming is a part of Apache Spark that processes real-time data as it arrives.

💡 Section 6: Big Data in Banking & Governance

  1. Which of the following is NOT a use case of Big Data in banking?
    A. Credit risk analysis
    B. Customer churn prediction
    C. Data-driven lending decisions
    D. Manual bookkeeping
    Answer: D
  2. Banks use Big Data primarily for:
    A. Fraud detection and risk management
    B. ATM cash loading
    C. Locker assignment
    D. Branch painting
    Answer: A
  3. Which Big Data system can detect fraudulent transactions in real-time?
    A. Spark Streaming + ML model
    B. Excel pivot table
    C. Manual entry system
    D. None
    Answer: A
  4. Customer segmentation is done using:
    A. Clustering algorithms
    B. Linear regression only
    C. File splitting
    D. Encryption
    Answer: A
  5. Big Data supports compliance by:
    A. Maintaining detailed logs and audit trails
    B. Reducing records
    C. Ignoring regulations
    D. Storing only emails
    Answer: A
  6. In credit scoring, Big Data can include:
    A. Social media and digital behavior data
    B. Only balance sheets
    C. Paper forms
    D. None
    Answer: A
  7. Which Indian bank uses Big Data for cross-selling and risk analytics?
    A. SBI
    B. HDFC
    C. ICICI
    D. All of these
    Answer: D
  8. Which global regulation affects data management in banking?
    A. GDPR (General Data Protection Regulation)
    B. NATO
    C. Basel II
    D. KYC
    Answer: A
  9. RBI encourages Big Data use for:
    A. Fraud analytics, AML, customer behavior prediction
    B. Manual auditing
    C. Teller operations
    D. None
    Answer: A
  10. Big Data helps regulators like SEBI in:
    A. Detecting insider trading using transaction analytics
    B. Manual monitoring
    C. File archiving only
    D. None
    Answer: A

🔐 Section 7: Security, Privacy & Challenges

  1. Big Data security involves protecting:
    A. Data at rest, in motion, and in use
    B. Only physical systems
    C. Only small datasets
    D. None
    Answer: A
  2. Data encryption ensures:
    A. Confidentiality of sensitive data
    B. Faster processing
    C. Data duplication
    D. None
    Answer: A
  3. Main challenge in Big Data analytics:
    A. Data integration from multiple sources
    B. Low storage
    C. Static datasets only
    D. None
    Answer: A
  4. Data governance ensures:
    A. Quality, integrity, and security of data assets
    B. More data duplication
    C. Less control
    D. None
    Answer: A
  5. Which process removes duplicate or incorrect data?
    A. Data cleansing
    B. Data encryption
    C. Data generation
    D. Data replication
    Answer: A
  6. In Big Data, data lineage means:
    A. Tracking the origin and movement of data
    B. Data encryption
    C. Data compression
    D. None
    Answer: A
  7. Which of the following ensures regulatory compliance?
    A. Data governance policies
    B. Random sampling
    C. Data deletion
    D. Data duplication
    Answer: A
  8. One major privacy issue with Big Data is:
    A. Unauthorized profiling and surveillance
    B. Faster results
    C. Lower storage
    D. None
    Answer: A
  9. In banks, Big Data platforms must follow:
    A. RBI’s IT Framework for NBFCs & Banks (2017)
    B. No regulation
    C. Telecom Act
    D. None
    Answer: A
  10. Which Big Data principle supports ethical AI models?
    A. Transparency & Fairness
    B. Secrecy & Isolation
    C. Bias & Speed
    D. None
    Answer: A

🧩 Section 8: Cloud & Big Data Integration

  1. Which cloud model is most commonly used for Big Data processing?
    A. Hybrid Cloud
    B. Public Cloud
    C. Private Cloud
    D. All of the above
    Answer: D
  2. Which cloud service provides on-demand data storage for analytics?
    A. AWS S3
    B. Google Sheets
    C. OneDrive Basic
    D. Excel
    Answer: A
  3. Which Google service handles Big Data queries?
    A. BigQuery
    B. Gmail
    C. Google Docs
    D. GDrive
    Answer: A
  4. Which Microsoft service is used for Big Data analytics?
    A. Azure Synapse Analytics
    B. Outlook
    C. Excel only
    D. MS Paint
    Answer: A
  5. Which AWS service provides distributed data warehousing?
    A. Amazon Redshift
    B. CloudWatch
    C. Lambda
    D. SNS
    Answer: A – Amazon Redshift is a fully managed cloud data warehouse service by AWS used for fast analytics on large datasets.

📈 Section 9: Analytics & Business Intelligence

  1. Descriptive analytics means:
    A. What happened
    B. Why it happened
    C. What will happen
    D. What should be done
    Answer: A
  2. Diagnostic analytics focuses on:
    A. Why it happened
    B. What happened
    C. Predicting future events
    D. None
    Answer: A
  3. Predictive analytics answers:
    A. What will happen next
    B. Why it happened
    C. None
    D. Who caused it
    Answer: A
  4. Prescriptive analytics provides:
    A. Suggested actions to take based on data
    B. Only reporting
    C. Historical summaries
    D. None
    Answer: A
  5. Real-time analytics means:
    A. Immediate analysis of live streaming data
    B. Manual report generation
    C. Batch processing only
    D. None
    Answer: A

🌍 Section 10: Emerging Trends and Government Initiatives

  1. India’s National Data and Analytics Platform (NDAP) is launched by:
    A. NITI Aayog
    B. RBI
    C. SEBI
    D. SBI
    Answer: A
  2. The National Digital Communications Policy supports:
    A. Data-driven innovation and Big Data analytics
    B. Only manual processes
    C. None
    D. Hardware assembly
    Answer: A
  3. Which technology is converging with Big Data for faster insights?
    A. Artificial Intelligence (AI)
    B. Blockchain
    C. IoT
    D. All of the above
    Answer: D
  4. IoT generates Big Data mainly from:
    A. Connected sensors and devices
    B. Only mobile apps
    C. Human typing
    D. Paper forms
    Answer: A
  5. Which of the following combines AI + Big Data + IoT?
    A. Smart Cities
    B. Gaming
    C. Manual billing
    D. None
    Answer: A
  6. Which initiative aims to build data centers and cloud infra in India?
    A. MeghRaj (GI Cloud)
    B. UIDAI
    C. DigiLocker
    D. MyGov
    Answer: A
  7. Which term describes converting large unstructured datasets into meaningful patterns?
    A. Data Mining
    B. Data Cleaning
    C. Data Encryption
    D. Data Segmentation
    Answer: A
  8. ETL is part of:
    A. Data Integration Process
    B. Data Destruction
    C. Data Visualization
    D. File Compression
    Answer: A
  9. Which open-source platform is often used with Python for Big Data analytics?
    A. Jupyter Notebook
    B. PowerPoint
    C. Photoshop
    D. Excel
    Answer: A
  10. Which term means discovering hidden patterns in large datasets?
    A. Data Mining
    B. Data Hiding
    C. Data Scrubbing
    D. Data Compression
    Answer: A

🏁 Section 11: Miscellaneous / Advanced

  1. The main goal of Big Data Analytics is:
    A. Extract actionable insights for better decisions
    B. Store unused data
    C. Create random reports
    D. Delete old files
    Answer: A
  2. Which technology layer processes Big Data in-memory for speed?
    A. Spark
    B. Pig
    C. Hive
    D. Sqoop
    Answer: A
  3. Which company originally developed Hadoop?
    A. Yahoo
    B. Google
    C. IBM
    D. Microsoft
    Answer: A
  4. Google File System (GFS) inspired:
    A. Hadoop Distributed File System (HDFS)
    B. Hive
    C. Spark
    D. Flume
    Answer: A
  5. The “Map” function in MapReduce:
    A. Transforms input data into key-value pairs
    B. Aggregates data
    C. Deletes records
    D. Stores metadata
    Answer: A
  6. The “Reduce” function:
    A. Summarizes intermediate key-value pairs
    B. Splits data
    C. Encrypts data
    D. Compresses output
    Answer: A
  7. In Big Data, a ‘cluster’ means:
    A. Group of connected servers/nodes working together
    B. A database column
    C. A single server
    D. File compression tool
    Answer: A
  8. Which term refers to analyzing data as it arrives?
    A. Stream processing
    B. Batch processing
    C. Archiving
    D. Logging
    Answer: A
  9. Which of these is NOT a Big Data challenge?
    A. Data Quality
    B. Data Volume
    C. Scalability
    D. Manual labor shortage
    Answer: D
  10. Which of the following describes “Big Data Analytics”?
    A. Process of examining large datasets to uncover hidden patterns and insights
    B. Deleting historical data
    C. Manual tallying
    D. Printing reports
    Answer: A

Big Data — MCQs (60 questions)

  1. What is “Big Data”?
    A. Data stored in a single Excel sheet only
    B. Extremely large volumes of data (structured/unstructured) that traditional systems cannot process efficiently
    C. Only video files
    D. Data stored on paper
    Answer: B
  2. Which of the following is not one of the “3 Vs” of Big Data (classic definition)?
    A. Volume
    B. Velocity
    C. Variety
    D. Validity
    Answer: D
  3. Many experts now add 2 more Vs to Big Data, making it “5 Vs”. Which are they?
    A. Value and Veracity
    B. Visualization and Verification
    C. Vacuum and Variation
    D. Variability and Velocity (again)
    Answer: A
  4. “Velocity” in Big Data refers to:
    A. The speed at which data is generated and processed
    B. The size of the data
    C. The type of data only
    D. The accuracy of data
    Answer: A
  5. Which of the following is a major challenge of Big Data?
    A. Low data volume
    B. Lack of tools for real-time processing
    C. Only structured data exists
    D. No use case
    Answer: B
  6. Which technology is commonly used for distributed processing of Big Data?
    A. Excel
    B. Hadoop MapReduce framework
    C. Basic desktop database
    D. Paper ledger
    Answer: B
  7. What is the role of Hadoop’s HDFS (Hadoop Distributed File System)?
    A. Store small data only
    B. Store large amounts of data across many machines (distributed storage)
    C. Only process data
    D. None of the above
    Answer: B
  8. “NoSQL” databases are often associated with Big Data because:
    A. They support rigid schema only
    B. They handle large volumes, variety, and horizontal scalability better than many RDBMS
    C. They cannot scale
    D. They only support text data
    Answer: B
  9. Which of the following is a NoSQL database commonly used for Big Data?
    A. MySQL only
    B. MongoDB, Cassandra
    C. Excel only
    D. MS Access
    Answer: B
  10. What is “data lake”?
    A. A small Excel file
    B. A large repository that stores raw data (structured/unstructured) in original form for later processing
    C. A temporary folder only
    D. Only for paper files
    Answer: B
  11. Which of the following is not a Big Data analytics type?
    A. Descriptive analytics
    B. Predictive analytics
    C. Prescriptive analytics
    D. Manual ledger entries with no analytics
    Answer: D
  12. In the context of Big Data analytics, what is “predictive analytics”?
    A. Predicting future trends using statistical and machine-learning models
    B. Just describing past data
    C. Only storing data
    D. Printing data
    Answer: A
  13. Why is Big Data important for banks?
    A. For storing paper files only
    B. For improving risk scoring, fraud detection, customer insights and operations
    C. Only for cash counting
    D. None of the above
    Answer: B
  14. Which Big Data use case is relevant in banking?
    A. Real-time transaction monitoring for fraud
    B. Chatbots only
    C. Only branch expansion
    D. Paper archival only
    Answer: A
  15. What is “streaming data processing”?
    A. Processing batches overnight only
    B. Processing data continuously in real-time as it arrives (e.g., high-speed transactions, sensors)
    C. No processing
    D. Only manual reports monthly
    Answer: B
  16. Which framework supports real-time stream processing in Big Data?
    A. Hadoop MapReduce only
    B. Apache Spark Streaming, Flink
    C. MS Word
    D. None
    Answer: B
  17. What is “ETL” in data warehousing / Big Data context?
    A. Extract-Transform-Load: process of moving data from sources to analytics systems
    B. Editor-Type-Loop
    C. Electron-Transmission-Link
    D. None
    Answer: A
  18. Which of the following is a benefit of Big Data for financial inclusion?
    A. Only for large enterprises
    B. Better credit scoring using alternate data from mobile, social networks
    C. No benefit
    D. Only for hardware costs
    Answer: B
  19. What is “data veracity”?
    A. Speed of data only
    B. The trustworthiness, quality and accuracy of the data
    C. The size of data only
    D. The location of data only
    Answer: B
  20. What does “horizontal scalability” in Big Data mean?
    A. Increasing size of a single machine
    B. Adding more machines (nodes) to handle more data/workload
    C. Reducing machine count only
    D. None
    Answer: B
  21. “Partitioning” in Big Data storage means:
    A. Splitting large dataset into smaller pieces stored across machines for parallel access
    B. Merging small files only
    C. Deleting data
    D. None
    Answer: A
  22. Which one is a characteristic of Big Data platforms in banks?
    A. Only relate to paper data
    B. Ability to handle high-volume, high-velocity data, and ensure compliance & security
    C. No security needed
    D. Only offline use
    Answer: B
  23. What is “in-memory computing” in Big Data context?
    A. Storing and processing data in RAM for faster operations rather than disk
    B. Using pen and paper
    C. Only offline batch
    D. No processing
    Answer: A
  24. Which of the following tools is used for Big Data visualization?
    A. Tableau, Power BI
    B. Notebooks only
    C. Paper charts only
    D. None
    Answer: A
  25. What is “data governance” in Big Data?
    A. Only storing data with no rules
    B. Framework of policies, roles, processes to ensure data quality, privacy, compliance and usage
    C. Ignoring data rules
    D. No audits
    Answer: B
  26. Which of the following is a security concern for Big Data in banks?
    A. No concern
    B. Data breaches, unauthorized access, ensuring encryption at rest/in transit, masking sensitive customer data
    C. Only hardware theft
    D. None
    Answer: B
  27. What does “Hadoop YARN” do?
    A. Only stores data
    B. Manages resources/scheduling of tasks across the cluster
    C. Only processes images
    D. None
    Answer: B
  28. What is “Spark” in Big Data?
    A. A fireworks brand
    B. An engine for fast, in-memory processing of large datasets (batch + stream)
    C. A database only
    D. None
    Answer: B
  29. What is “MapReduce”?
    A. A procedure to map functions only in spreadsheets
    B. Programming model: map (process) and reduce (aggregate) functions for large data sets in parallel
    C. Only for graphics
    D. None
    Answer: B
  30. What is “HBase”?
    A. A relational DB
    B. A NoSQL distributed database built on HDFS for Big Data
    C. Only a file system
    D. None
    Answer: B
  31. When large data arrives very fast and must be processed within seconds, this is termed:
    A. Batch processing
    B. Real-time or near-real-time processing
    C. Offline archiving
    D. None
    Answer: B
  32. What is “data lakehouse”?
    A. A traditional lake only
    B. Modern architecture combining data lake (raw storage) + data warehouse (governed structure)
    C. Only archive files
    D. None
    Answer: B
  33. Which of the following is not a Big Data architecture component?
    A. Ingestion layer
    B. Storage layer
    C. Processing/analytics layer
    D. Typewriter layer
    Answer: D
  34. What is “schema-on-read” compared to “schema-on-write”?
    A. Schema-on-write: structure defined when writing data; schema-on-read: structure defined when reading/querying data
    B. The same thing
    C. Only for manual data
    D. None
    Answer: A
  35. In banking, “alternate data” used for credit-scoring in Big Data means:
    A. Only old paper records
    B. Non-traditional data like mobile behaviour, social network, utility payments which help assess creditworthiness
    C. No data at all
    D. Only cash payments
    Answer: B
  36. What is “metadata” in Big Data?
    A. Data about data (e.g., source, format, time stamp)
    B. Only raw numbers
    C. Only images
    D. None
    Answer: A
  37. What is “data swamps”?
    A. Clean data stores only
    B. Poorly managed data lakes full of ungoverned/raw/unused data that degrades value
    C. Small datasets
    D. None
    Answer: B
  38. Which is a big data storage best practice?
    A. No backup
    B. Archival of cold data, tiered storage, appropriate indexing, encryption and governance
    C. Keep everything in one massive file only
    D. None
    Answer: B
  39. What is “feature extraction” in Big Data analytics/machine learning?
    A. Manual filing
    B. Deriving meaningful features from raw data to feed into models
    C. Deleting data fields
    D. None
    Answer: B
  40. What is “data anonymisation”?
    A. Revealing all customer identifiers
    B. Removing or masking personally identifying information so that privacy is protected
    C. Publishing names only
    D. None
    Answer: B
  41. Which of the following best describes “data mart”?
    A. Small marketplace
    B. Sub-set of data warehouse focused on one area/department
    C. Entire bank data store
    D. None
    Answer: B
  42. What is “Hadoop Hive”?
    A. Spreadsheet software
    B. Data-warehouse tool on Hadoop allowing SQL-like queries on big data
    C. A web browser only
    D. None
    Answer: B
  43. Which of the following is NOT a Big Data value driver for banks?
    A. Improved customer insights
    B. Faster risk decisioning
    C. Real-time fraud detection
    D. Decreased data diversity only
    Answer: D
  44. What is “distributed file system”?
    A. File system on a single machine only
    B. File system where data is stored across multiple machines in a cluster
    C. Pen drive only
    D. None
    Answer: B
  45. What does “Terabyte” denote?
    A. 1024 bytes
    B. 1024 gigabytes
    C. 1024 megabytes
    D. None
    Answer: B
  46. What is “petabyte”?
    A. 1024 terabytes
    B. 1024 gigabytes
    C. 1024 kilobytes
    D. None
    Answer: A
  47. What is the purpose of “data ingestion” in big data pipelines?
    A. Consuming, collecting and importing data from various sources into storage/processing systems
    B. Writing reports only
    C. Deleting data only
    D. None
    Answer: A
  48. Which data format is widely used for big data interchange?
    A. CSV, JSON, Parquet, Avro
    B. Only DOCX
    C. Only PPT
    D. None
    Answer: A
  49. In Big Data context, what is “lambda architecture”?
    A. A big data architectural pattern combining batch + real-time processing layers
    B. Only real-time processing
    C. Only batch processing
    D. None
    Answer: A
  50. What is “kappa architecture”?
    A. Architecture with separate batch and stream layers
    B. Architecture that does only stream processing (unified real-time layer)
    C. Only batch processing
    D. None
    Answer: B
  51. What is “graph processing” used for in big data?
    A. Only charts
    B. Analyzing relationships and networks (e.g., social networks, fraud detection networks)
    C. Only arithmetic
    D. None
    Answer: B
  52. Which of the following Big Data tools is used for managing and scheduling workflow jobs in Hadoop ecosystem?
    A. Oozie
    B. Excel
    C. Handwritten document
    D. None
    Answer: A
  53. What is “YARN” in Hadoop ecosystem?
    A. Yarn fibre
    B. Yet Another Resource Negotiator — resource manager/scheduler in Hadoop
    C. Only text format
    D. None
    Answer: B
  54. What is the key objective of Big Data analytics in financial sector?
    A. Only archival of old records
    B. Risk mitigation, fraud detection, customer segmentation, regulatory compliance, operational efficiency
    C. Only printing paper statements
    D. None
    Answer: B
  55. Which of the following best describes “cold data” vs “hot data”?
    A. Cold data = rarely used, may be archived; Hot data = frequently accessed, needs fast storage/processing
    B. Cold data = only offline
    C. Hot data = paper records
    D. None
    Answer: A
  56. What is “data blending” in big data context?
    A. Mixing data from multiple heterogeneous sources into one for integrated analytics
    B. Only one data source
    C. Only images
    D. None
    Answer: A
  57. What is “data provenance”?
    A. History of where data came from and how it was processed
    B. Random data
    C. No tracking
    D. None
    Answer: A
  58. In banks, Big Data for credit scoring may include:
    A. Only past credit history
    B. Alternate data like mobile phone usage, utility payments, social networks (to enhance scoring & inclusion)
    C. Paper only
    D. None
    Answer: B
  59. Which compliance issue becomes significant in big data analytics?
    A. Only hardware cost
    B. Data privacy laws, data localisation, audit trails, consent management
    C. No compliance at all
    D. None
    Answer: B
  60. What is the “edge computing” in context of Big Data?
    A. Computing near the data source (e.g., IoT devices) to reduce latency and bandwidth usage
    B. Only cloud data centres far away
    C. Manual data entry only
    D. None
    Answer: A