Big Data means very large, complex, and fast-growing data that cannot be easily managed, processed, or analyzed using traditional database systems like Excel or SQL alone.
🔍 In Simple Words:
Every time you use your ATM, UPI, mobile app, or card, data is generated.
This data grows very fast and comes from many sources — transactions, sensors, social media, websites, etc.
The huge and diverse nature of such data is called Big Data.
📊 Example:
Source
Type of Data
Example
Banking transactions
Structured
Account number, amount, balance
WhatsApp messages
Unstructured
Text, audio, video
IoT sensors in ATMs
Semi-structured
Temperature logs, alerts
Social media
Unstructured
Tweets, likes, comments
📏 2. Characteristics of Big Data (The 5 Vs)
V
Meaning
Explanation
Example
Volume
Size of data
Huge amounts of data generated daily
Millions of ATM transactions per day
Velocity
Speed of generation
Data is created and updated in real-time
UPI transactions per second
Variety
Different forms of data
Structured, semi-structured, unstructured
Excel sheets, images, videos, JSON files
Veracity
Accuracy of data
Reliability and quality of data
Removing duplicate or wrong entries
Value
Usefulness of data
How much insight/benefit the data gives
Fraud detection, risk analysis
💡 Sometimes exams ask: “What are the 3 Vs / 5 Vs of Big Data?” — remember these keywords.
⚙️ 3. Components of Big Data Architecture
Big Data systems work in three stages — Storage, Processing, and Analysis.
Stage
Function
Technology Examples
Storage
Store massive data sets safely
HDFS (Hadoop Distributed File System), HBase, S3
Processing
Handle and compute data efficiently
Hadoop MapReduce, Apache Spark
Analysis
Extract insights and visualize
Hive, Pig, Tableau, Power BI
🧩 4. Key Technologies in Big Data
🔸 1. Hadoop Ecosystem
Hadoop = Open-source framework that allows distributed storage and parallel processing.
Component
Purpose
HDFS (Hadoop Distributed File System)
Stores large data across multiple servers
MapReduce
Processes and analyzes data in parallel
YARN (Yet Another Resource Negotiator)
Manages cluster resources
Hive
SQL-like querying on Big Data
Pig
Data transformation (ETL scripting)
HBase
NoSQL database for large tables
Sqoop
Transfers data between Hadoop and RDBMS
Flume
Collects streaming data (like logs)
Oozie
Schedules and manages Hadoop workflows
🔸 2. Apache Spark
Faster alternative to Hadoop’s MapReduce.
Performs real-time analytics using in-memory processing.
Used for: Fraud detection, credit scoring, sentiment analysis.
🔸 3. NoSQL Databases
Handle unstructured and semi-structured data.
Examples: MongoDB, Cassandra, CouchDB, HBase
Feature
RDBMS
NoSQL
Structure
Tables & rows
Key-value or document format
Schema
Fixed
Flexible
Scalability
Vertical (add hardware)
Horizontal (add servers)
Example
Oracle, MySQL
MongoDB, HBase
🔸 4. Real-Time & Stream Processing Tools
Tool
Use
Kafka
Real-time data streaming
Flink / Storm
Stream data analytics
Spark Streaming
Real-time event analysis (e.g., UPI transactions)
💾 5. Types of Data in Big Data
Type
Description
Example
Structured
Organized in rows and columns
Core banking data
Semi-Structured
Not fixed format, has tags
XML, JSON
Unstructured
Raw, messy data
Images, videos, emails
📈 6. Big Data Analytics Types
Type
Answers
Example in Banking
Descriptive Analytics
What happened?
Monthly transaction reports
Diagnostic Analytics
Why did it happen?
Fraud reason analysis
Predictive Analytics
What will happen?
Predicting loan default risk
Prescriptive Analytics
What should be done?
Recommending loan terms
🏦 7. Big Data in Banking & Financial Sector
Use Case
Description
Fraud Detection
Detect unusual or suspicious activity in real time using AI + Big Data
Customer Segmentation
Group customers by behavior for marketing
Credit Scoring
Include non-traditional data (like digital behavior) for loan risk analysis
Regulatory Compliance
Maintain audit trails, data lineage, KYC data
Risk Management
Predict potential NPAs and market risks
Personalized Offers
Recommend credit cards, loans based on customer history
🏦 Example in Banking Context:
SBI or HDFC use Big Data to track your spending pattern and detect fraudulent credit card transactions instantly using real-time analytics.
🔐 8. Big Data Security & Governance
Term
Meaning
Data Governance
Framework for managing data quality, access, and usage
Data Lineage
Tracking where data comes from and how it changes
Data Privacy
Protecting customer data (RBI and GDPR compliance)
Encryption
Securing data at rest and in transit
Anonymization
Hiding personal identity from datasets
☁️ 9. Big Data and Cloud Computing
Big Data is often stored and processed on the cloud for scalability and cost savings.
Provider
Big Data Service
AWS
EMR, Redshift, S3
Azure
HDInsight, Synapse
Google Cloud
BigQuery, Dataproc
🇮🇳 10. Big Data in Indian Context
Initiative
Description
NITI Aayog – National Strategy for AI (“AI for All”)
✅ Govt Initiatives: NITI Aayog (AI for All), NDAP, MeghRaj.
🧭 14. Summary Chart
Topic
Key Idea
Definition
Handling very large and complex datasets
5 Vs
Volume, Velocity, Variety, Veracity, Value
Core Technology
Hadoop Ecosystem
Key Tool
Apache Spark
Banking Use
Fraud detection, risk analytics
Security
Data encryption, governance, privacy
Govt Initiative
NDAP, NITI Aayog “AI for All”
🧠 Section 1: Basics of Big Data
What is Big Data? A. Large amounts of structured and unstructured data generated daily B. Only data stored in Excel files C. Small datasets analyzed manually D. Data stored in floppy disks Answer: A
Which of the following is NOT a characteristic of Big Data? A. Volume B. Velocity C. Variety D. Visibility Answer: D
The term “3 Vs” in Big Data stands for: A. Volume, Velocity, Variety B. Virtualization, Visualization, Value C. Volume, Value, Visualization D. Version, Volume, Validation Answer: A
Which two additional Vs are often added to the original 3Vs? A. Veracity and Value B. Validity and Volume C. Vision and Variety D. Velocity and Vacuum Answer: A
What does ‘Volume’ represent in Big Data? A. Size or amount of data generated B. Type of data C. Speed of data D. Accuracy of data Answer: A
‘Velocity’ in Big Data refers to: A. Speed at which data is generated, processed and analyzed B. The variety of data formats C. The accuracy of data D. The cost of storage Answer: A
‘Variety’ refers to: A. Different forms of data — structured, semi-structured, unstructured B. Only structured data C. Data duplication D. Data redundancy Answer: A
‘Veracity’ means: A. Accuracy and trustworthiness of data B. Size of data C. Type of data D. None Answer: A
‘Value’ in Big Data means: A. Economic or business benefit derived from data B. Random number of bytes C. File system name D. Encryption code Answer: A
Which statement about Big Data is TRUE? A. It cannot be stored in traditional systems efficiently B. It always comes from one source C. It is mostly static D. It does not require analytics Answer: A
⚙️ Section 2: Big Data Components & Architecture
Which is the most popular open-source Big Data framework? A. Hadoop B. Oracle C. Access D. MySQL Answer: A
Which language is Hadoop primarily written in? A. Java B. Python C. C++ D. Go Answer: A
HDFS stands for: A. Hadoop Distributed File System B. High Data File Storage C. High Definition File Server D. Hadoop Data File Set Answer: A
The two main components of Hadoop are: A. HDFS and MapReduce B. Hive and Pig C. Spark and Kafka D. SQL and NoSQL Answer: A
NameNode in Hadoop is responsible for: A. Storing metadata and directory tree of files B. Processing user data C. Executing MapReduce tasks D. Compressing data Answer: A
DataNode in Hadoop: A. Stores actual data blocks B. Stores metadata C. Controls access rights D. Runs NameNode Answer: A
MapReduce is used for: A. Parallel processing of data across distributed nodes B. Data visualization C. File encryption D. Network configuration Answer: A
Map phase in MapReduce does: A. Splits and processes data B. Aggregates output C. Deletes logs D. Encrypts data Answer: A
Reduce phase in MapReduce: A. Aggregates intermediate outputs and produces results B. Splits data C. Encrypts input D. None Answer: A
YARN stands for: A. Yet Another Resource Negotiator B. Yearly Analysis Resource Node C. Yield Aggregation Random Network D. None Answer: A
💾 Section 3: Big Data Technologies
Which of the following is NOT part of the Hadoop ecosystem? A. HDFS B. MapReduce C. Cassandra D. Hive Answer: C
Which Hadoop component provides SQL-like queries? A. Hive B. Pig C. HBase D. Mahout Answer: A
Apache Pig is used for: A. Data flow scripting and ETL (Extract, Transform, Load) B. Image processing C. File transfer D. Compression Answer: A
Which Hadoop component is a NoSQL database? A. HBase B. Hive C. Sqoop D. Flume Answer: A
Apache Sqoop is used for: A. Transferring data between Hadoop and RDBMS B. Data visualization C. System security D. Job scheduling Answer: A
Apache Flume is used for: A. Collecting and moving streaming log data to HDFS B. Storing images C. Email filtering D. File encryption Answer: A
Apache Oozie is: A. Workflow scheduler for Hadoop jobs B. Visualization tool C. Data encryption library D. Data cleaner Answer: A – Apache Oozie is a workflow scheduler for Hadoop that helps you run, manage, and automate big data jobs in a sequence. In simple words: It organizes and schedules your Hadoop tasks so they run in the right order.
Mahout in Hadoop is used for: A. Machine Learning B. File Transfer C. Data Security D. Encryption Answer: A – Apache Mahout is an open-source library that provides machine learning algorithms (like clustering, classification, and recommendations) that can run on big data systems. In simple words: Mahout helps build ML models that work with large amounts of data.
Which of the following is an in-memory Big Data processing engine? A. Apache Spark B. Hive C. Flume D. Oozie Answer: A – Apache Spark is a fast, open-source big data processing engine used for analyzing large datasets quickly. In simple words: Spark processes big data super fast, much faster than Hadoop MapReduce.
Spark is written in which language? A. Scala B. Python C. Java D. All of the above Answer: D
☁️ Section 4: Databases & Storage
NoSQL databases are designed for: A. Non-relational, unstructured data storage B. Only structured data C. Relational joins D. Small datasets only Answer: A
Which of the following is a NoSQL database? A. MongoDB B. Oracle C. MySQL D. PostgreSQL Answer: A
CAP theorem states: A. Consistency, Availability, Partition Tolerance B. Capacity, Accessibility, Processing C. Clustering, Aggregation, Partitioning D. Cache, API, Protocol Answer: A – A distributed system cannot provide Consistency, Availability, and Partition tolerance all together—only any two at a time.
In Big Data, data stored in HDFS is split into: A. Blocks B. Tables C. Arrays D. Streams Answer: A
Default block size in Hadoop 2.x is: A. 64 MB B. 128 MB C. 512 MB D. 1 GB Answer: B
Which system provides column-oriented storage? A. HBase B. Hive C. Flume D. Sqoop Answer: A – HBase is a distributed, NoSQL database built on top of Hadoop that stores large amounts of data in tables with rows and columns.
Which database supports high write throughput and scalability? A. Cassandra B. MySQL C. Oracle D. MS Access Answer: A – Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of data across many servers without downtime.
Which Big Data tool is ideal for real-time analytics? A. Apache Spark Streaming B. MapReduce C. Sqoop D. Oozie Answer: A – Apache Spark Streaming is a component of Spark used to process real-time data streams (like logs, sensor data, or live events).
Which Big Data storage solution is often used by cloud providers like AWS? A. S3 (Simple Storage Service) B. HDD C. SSD D. Pendrive Answer: A
Which Big Data format is used for efficient columnar storage? A. Parquet B. CSV C. JSON D. TXT Answer: A – Columnar storage is a way of storing data column-wise instead of row-wise.
🧮 Section 5: Data Analytics & Tools
What does ETL stand for? A. Extract, Transform, Load B. Evaluate, Test, Learn C. Encode, Translate, Load D. Encrypt, Transfer, Log Answer: A
Data Lake is: A. Centralized storage of raw data in its native format B. Traditional data warehouse C. Temporary cache D. File backup Answer: A – A Data Lake is a large storage system that holds raw data in any format—structured or unstructured. It’s a big storage pool where you dump all types of data without needing to organize it first.
Data Warehouse stores: A. Processed and structured data B. Raw unprocessed data C. Only images D. Logs only Answer: A
Which tool is widely used for data visualization? A. Tableau B. Flume C. Sqoop D. Oozie Answer: A – Tableau is a data visualization tool used to create interactive charts, dashboards, and reports.
Which programming language is most widely used in Big Data Analytics? A. Python B. PHP C. COBOL D. Fortran Answer: A
R language is mainly used for: A. Statistical and data analysis B. Image rendering C. File compression D. None Answer: A
Which tool provides interactive dashboards for Big Data? A. Power BI B. Spark C. Pig D. Sqoop Answer: A – Power BI is Microsoft’s data visualization and business analytics tool that helps create interactive reports and dashboards.
Apache Kafka is used for: A. Real-time data streaming and messaging B. File storage C. Scheduling D. Data cleaning Answer: A
Which is an example of batch processing? A. Hadoop MapReduce B. Spark Streaming C. Kafka D. Flink Answer: A – Hadoop MapReduce is a programming model used to process large datasets in parallel across many computers.
Which is an example of real-time stream processing? A. Spark Streaming B. Pig C. Hive D. Oozie Answer: A – Spark Streaming is a part of Apache Spark that processes real-time data as it arrives.
💡 Section 6: Big Data in Banking & Governance
Which of the following is NOT a use case of Big Data in banking? A. Credit risk analysis B. Customer churn prediction C. Data-driven lending decisions D. Manual bookkeeping Answer: D
Banks use Big Data primarily for: A. Fraud detection and risk management B. ATM cash loading C. Locker assignment D. Branch painting Answer: A
Which Big Data system can detect fraudulent transactions in real-time? A. Spark Streaming + ML model B. Excel pivot table C. Manual entry system D. None Answer: A
Customer segmentation is done using: A. Clustering algorithms B. Linear regression only C. File splitting D. Encryption Answer: A
Big Data supports compliance by: A. Maintaining detailed logs and audit trails B. Reducing records C. Ignoring regulations D. Storing only emails Answer: A
In credit scoring, Big Data can include: A. Social media and digital behavior data B. Only balance sheets C. Paper forms D. None Answer: A
Which Indian bank uses Big Data for cross-selling and risk analytics? A. SBI B. HDFC C. ICICI D. All of these Answer: D
Which global regulation affects data management in banking? A. GDPR (General Data Protection Regulation) B. NATO C. Basel II D. KYC Answer: A
RBI encourages Big Data use for: A. Fraud analytics, AML, customer behavior prediction B. Manual auditing C. Teller operations D. None Answer: A
Big Data helps regulators like SEBI in: A. Detecting insider trading using transaction analytics B. Manual monitoring C. File archiving only D. None Answer: A
🔐 Section 7: Security, Privacy & Challenges
Big Data security involves protecting: A. Data at rest, in motion, and in use B. Only physical systems C. Only small datasets D. None Answer: A
Data encryption ensures: A. Confidentiality of sensitive data B. Faster processing C. Data duplication D. None Answer: A
Main challenge in Big Data analytics: A. Data integration from multiple sources B. Low storage C. Static datasets only D. None Answer: A
Data governance ensures: A. Quality, integrity, and security of data assets B. More data duplication C. Less control D. None Answer: A
Which process removes duplicate or incorrect data? A. Data cleansing B. Data encryption C. Data generation D. Data replication Answer: A
In Big Data, data lineage means: A. Tracking the origin and movement of data B. Data encryption C. Data compression D. None Answer: A
Which of the following ensures regulatory compliance? A. Data governance policies B. Random sampling C. Data deletion D. Data duplication Answer: A
One major privacy issue with Big Data is: A. Unauthorized profiling and surveillance B. Faster results C. Lower storage D. None Answer: A
In banks, Big Data platforms must follow: A. RBI’s IT Framework for NBFCs & Banks (2017) B. No regulation C. Telecom Act D. None Answer: A
Which Big Data principle supports ethical AI models? A. Transparency & Fairness B. Secrecy & Isolation C. Bias & Speed D. None Answer: A
🧩 Section 8: Cloud & Big Data Integration
Which cloud model is most commonly used for Big Data processing? A. Hybrid Cloud B. Public Cloud C. Private Cloud D. All of the above Answer: D
Which cloud service provides on-demand data storage for analytics? A. AWS S3 B. Google Sheets C. OneDrive Basic D. Excel Answer: A
Which Google service handles Big Data queries? A. BigQuery B. Gmail C. Google Docs D. GDrive Answer: A
Which Microsoft service is used for Big Data analytics? A. Azure Synapse Analytics B. Outlook C. Excel only D. MS Paint Answer: A
Which AWS service provides distributed data warehousing? A. Amazon Redshift B. CloudWatch C. Lambda D. SNS Answer: A – Amazon Redshift is a fully managed cloud data warehouse service by AWS used for fast analytics on large datasets.
📈 Section 9: Analytics & Business Intelligence
Descriptive analytics means: A. What happened B. Why it happened C. What will happen D. What should be done Answer: A
Diagnostic analytics focuses on: A. Why it happened B. What happened C. Predicting future events D. None Answer: A
Predictive analytics answers: A. What will happen next B. Why it happened C. None D. Who caused it Answer: A
Prescriptive analytics provides: A. Suggested actions to take based on data B. Only reporting C. Historical summaries D. None Answer: A
Real-time analytics means: A. Immediate analysis of live streaming data B. Manual report generation C. Batch processing only D. None Answer: A
🌍 Section 10: Emerging Trends and Government Initiatives
India’s National Data and Analytics Platform (NDAP) is launched by: A. NITI Aayog B. RBI C. SEBI D. SBI Answer: A
The National Digital Communications Policy supports: A. Data-driven innovation and Big Data analytics B. Only manual processes C. None D. Hardware assembly Answer: A
Which technology is converging with Big Data for faster insights? A. Artificial Intelligence (AI) B. Blockchain C. IoT D. All of the above Answer: D
IoT generates Big Data mainly from: A. Connected sensors and devices B. Only mobile apps C. Human typing D. Paper forms Answer: A
Which of the following combines AI + Big Data + IoT? A. Smart Cities B. Gaming C. Manual billing D. None Answer: A
Which initiative aims to build data centers and cloud infra in India? A. MeghRaj (GI Cloud) B. UIDAI C. DigiLocker D. MyGov Answer: A
Which term describes converting large unstructured datasets into meaningful patterns? A. Data Mining B. Data Cleaning C. Data Encryption D. Data Segmentation Answer: A
ETL is part of: A. Data Integration Process B. Data Destruction C. Data Visualization D. File Compression Answer: A
Which open-source platform is often used with Python for Big Data analytics? A. Jupyter Notebook B. PowerPoint C. Photoshop D. Excel Answer: A
Which term means discovering hidden patterns in large datasets? A. Data Mining B. Data Hiding C. Data Scrubbing D. Data Compression Answer: A
🏁 Section 11: Miscellaneous / Advanced
The main goal of Big Data Analytics is: A. Extract actionable insights for better decisions B. Store unused data C. Create random reports D. Delete old files Answer: A
Which technology layer processes Big Data in-memory for speed? A. Spark B. Pig C. Hive D. Sqoop Answer: A
Which company originally developed Hadoop? A. Yahoo B. Google C. IBM D. Microsoft Answer: A
Google File System (GFS) inspired: A. Hadoop Distributed File System (HDFS) B. Hive C. Spark D. Flume Answer: A
The “Map” function in MapReduce: A. Transforms input data into key-value pairs B. Aggregates data C. Deletes records D. Stores metadata Answer: A
The “Reduce” function: A. Summarizes intermediate key-value pairs B. Splits data C. Encrypts data D. Compresses output Answer: A
In Big Data, a ‘cluster’ means: A. Group of connected servers/nodes working together B. A database column C. A single server D. File compression tool Answer: A
Which term refers to analyzing data as it arrives? A. Stream processing B. Batch processing C. Archiving D. Logging Answer: A
Which of these is NOT a Big Data challenge? A. Data Quality B. Data Volume C. Scalability D. Manual labor shortage Answer: D
Which of the following describes “Big Data Analytics”? A. Process of examining large datasets to uncover hidden patterns and insights B. Deleting historical data C. Manual tallying D. Printing reports Answer: A
Big Data — MCQs (60 questions)
What is “Big Data”? A. Data stored in a single Excel sheet only B. Extremely large volumes of data (structured/unstructured) that traditional systems cannot process efficiently C. Only video files D. Data stored on paper Answer: B
Which of the following is not one of the “3 Vs” of Big Data (classic definition)? A. Volume B. Velocity C. Variety D. Validity Answer: D
Many experts now add 2 more Vs to Big Data, making it “5 Vs”. Which are they? A. Value and Veracity B. Visualization and Verification C. Vacuum and Variation D. Variability and Velocity (again) Answer: A
“Velocity” in Big Data refers to: A. The speed at which data is generated and processed B. The size of the data C. The type of data only D. The accuracy of data Answer: A
Which of the following is a major challenge of Big Data? A. Low data volume B. Lack of tools for real-time processing C. Only structured data exists D. No use case Answer: B
Which technology is commonly used for distributed processing of Big Data? A. Excel B. Hadoop MapReduce framework C. Basic desktop database D. Paper ledger Answer: B
What is the role of Hadoop’s HDFS (Hadoop Distributed File System)? A. Store small data only B. Store large amounts of data across many machines (distributed storage) C. Only process data D. None of the above Answer: B
“NoSQL” databases are often associated with Big Data because: A. They support rigid schema only B. They handle large volumes, variety, and horizontal scalability better than many RDBMS C. They cannot scale D. They only support text data Answer: B
Which of the following is a NoSQL database commonly used for Big Data? A. MySQL only B. MongoDB, Cassandra C. Excel only D. MS Access Answer: B
What is “data lake”? A. A small Excel file B. A large repository that stores raw data (structured/unstructured) in original form for later processing C. A temporary folder only D. Only for paper files Answer: B
Which of the following is not a Big Data analytics type? A. Descriptive analytics B. Predictive analytics C. Prescriptive analytics D. Manual ledger entries with no analytics Answer: D
In the context of Big Data analytics, what is “predictive analytics”? A. Predicting future trends using statistical and machine-learning models B. Just describing past data C. Only storing data D. Printing data Answer: A
Why is Big Data important for banks? A. For storing paper files only B. For improving risk scoring, fraud detection, customer insights and operations C. Only for cash counting D. None of the above Answer: B
Which Big Data use case is relevant in banking? A. Real-time transaction monitoring for fraud B. Chatbots only C. Only branch expansion D. Paper archival only Answer: A
What is “streaming data processing”? A. Processing batches overnight only B. Processing data continuously in real-time as it arrives (e.g., high-speed transactions, sensors) C. No processing D. Only manual reports monthly Answer: B
Which framework supports real-time stream processing in Big Data? A. Hadoop MapReduce only B. Apache Spark Streaming, Flink C. MS Word D. None Answer: B
What is “ETL” in data warehousing / Big Data context? A. Extract-Transform-Load: process of moving data from sources to analytics systems B. Editor-Type-Loop C. Electron-Transmission-Link D. None Answer: A
Which of the following is a benefit of Big Data for financial inclusion? A. Only for large enterprises B. Better credit scoring using alternate data from mobile, social networks C. No benefit D. Only for hardware costs Answer: B
What is “data veracity”? A. Speed of data only B. The trustworthiness, quality and accuracy of the data C. The size of data only D. The location of data only Answer: B
What does “horizontal scalability” in Big Data mean? A. Increasing size of a single machine B. Adding more machines (nodes) to handle more data/workload C. Reducing machine count only D. None Answer: B
“Partitioning” in Big Data storage means: A. Splitting large dataset into smaller pieces stored across machines for parallel access B. Merging small files only C. Deleting data D. None Answer: A
Which one is a characteristic of Big Data platforms in banks? A. Only relate to paper data B. Ability to handle high-volume, high-velocity data, and ensure compliance & security C. No security needed D. Only offline use Answer: B
What is “in-memory computing” in Big Data context? A. Storing and processing data in RAM for faster operations rather than disk B. Using pen and paper C. Only offline batch D. No processing Answer: A
Which of the following tools is used for Big Data visualization? A. Tableau, Power BI B. Notebooks only C. Paper charts only D. None Answer: A
What is “data governance” in Big Data? A. Only storing data with no rules B. Framework of policies, roles, processes to ensure data quality, privacy, compliance and usage C. Ignoring data rules D. No audits Answer: B
Which of the following is a security concern for Big Data in banks? A. No concern B. Data breaches, unauthorized access, ensuring encryption at rest/in transit, masking sensitive customer data C. Only hardware theft D. None Answer: B
What does “Hadoop YARN” do? A. Only stores data B. Manages resources/scheduling of tasks across the cluster C. Only processes images D. None Answer: B
What is “Spark” in Big Data? A. A fireworks brand B. An engine for fast, in-memory processing of large datasets (batch + stream) C. A database only D. None Answer: B
What is “MapReduce”? A. A procedure to map functions only in spreadsheets B. Programming model: map (process) and reduce (aggregate) functions for large data sets in parallel C. Only for graphics D. None Answer: B
What is “HBase”? A. A relational DB B. A NoSQL distributed database built on HDFS for Big Data C. Only a file system D. None Answer: B
When large data arrives very fast and must be processed within seconds, this is termed: A. Batch processing B. Real-time or near-real-time processing C. Offline archiving D. None Answer: B
What is “data lakehouse”? A. A traditional lake only B. Modern architecture combining data lake (raw storage) + data warehouse (governed structure) C. Only archive files D. None Answer: B
Which of the following is not a Big Data architecture component? A. Ingestion layer B. Storage layer C. Processing/analytics layer D. Typewriter layer Answer: D
What is “schema-on-read” compared to “schema-on-write”? A. Schema-on-write: structure defined when writing data; schema-on-read: structure defined when reading/querying data B. The same thing C. Only for manual data D. None Answer: A
In banking, “alternate data” used for credit-scoring in Big Data means: A. Only old paper records B. Non-traditional data like mobile behaviour, social network, utility payments which help assess creditworthiness C. No data at all D. Only cash payments Answer: B
What is “metadata” in Big Data? A. Data about data (e.g., source, format, time stamp) B. Only raw numbers C. Only images D. None Answer: A
What is “data swamps”? A. Clean data stores only B. Poorly managed data lakes full of ungoverned/raw/unused data that degrades value C. Small datasets D. None Answer: B
Which is a big data storage best practice? A. No backup B. Archival of cold data, tiered storage, appropriate indexing, encryption and governance C. Keep everything in one massive file only D. None Answer: B
What is “feature extraction” in Big Data analytics/machine learning? A. Manual filing B. Deriving meaningful features from raw data to feed into models C. Deleting data fields D. None Answer: B
What is “data anonymisation”? A. Revealing all customer identifiers B. Removing or masking personally identifying information so that privacy is protected C. Publishing names only D. None Answer: B
Which of the following best describes “data mart”? A. Small marketplace B. Sub-set of data warehouse focused on one area/department C. Entire bank data store D. None Answer: B
What is “Hadoop Hive”? A. Spreadsheet software B. Data-warehouse tool on Hadoop allowing SQL-like queries on big data C. A web browser only D. None Answer: B
Which of the following is NOT a Big Data value driver for banks? A. Improved customer insights B. Faster risk decisioning C. Real-time fraud detection D. Decreased data diversity only Answer: D
What is “distributed file system”? A. File system on a single machine only B. File system where data is stored across multiple machines in a cluster C. Pen drive only D. None Answer: B
What does “Terabyte” denote? A. 1024 bytes B. 1024 gigabytes C. 1024 megabytes D. None Answer: B
What is “petabyte”? A. 1024 terabytes B. 1024 gigabytes C. 1024 kilobytes D. None Answer: A
What is the purpose of “data ingestion” in big data pipelines? A. Consuming, collecting and importing data from various sources into storage/processing systems B. Writing reports only C. Deleting data only D. None Answer: A
Which data format is widely used for big data interchange? A. CSV, JSON, Parquet, Avro B. Only DOCX C. Only PPT D. None Answer: A
In Big Data context, what is “lambda architecture”? A. A big data architectural pattern combining batch + real-time processing layers B. Only real-time processing C. Only batch processing D. None Answer: A
What is “kappa architecture”? A. Architecture with separate batch and stream layers B. Architecture that does only stream processing (unified real-time layer) C. Only batch processing D. None Answer: B
What is “graph processing” used for in big data? A. Only charts B. Analyzing relationships and networks (e.g., social networks, fraud detection networks) C. Only arithmetic D. None Answer: B
Which of the following Big Data tools is used for managing and scheduling workflow jobs in Hadoop ecosystem? A. Oozie B. Excel C. Handwritten document D. None Answer: A
What is “YARN” in Hadoop ecosystem? A. Yarn fibre B. Yet Another Resource Negotiator — resource manager/scheduler in Hadoop C. Only text format D. None Answer: B
What is the key objective of Big Data analytics in financial sector? A. Only archival of old records B. Risk mitigation, fraud detection, customer segmentation, regulatory compliance, operational efficiency C. Only printing paper statements D. None Answer: B
Which of the following best describes “cold data” vs “hot data”? A. Cold data = rarely used, may be archived; Hot data = frequently accessed, needs fast storage/processing B. Cold data = only offline C. Hot data = paper records D. None Answer: A
What is “data blending” in big data context? A. Mixing data from multiple heterogeneous sources into one for integrated analytics B. Only one data source C. Only images D. None Answer: A
What is “data provenance”? A. History of where data came from and how it was processed B. Random data C. No tracking D. None Answer: A
In banks, Big Data for credit scoring may include: A. Only past credit history B. Alternate data like mobile phone usage, utility payments, social networks (to enhance scoring & inclusion) C. Paper only D. None Answer: B
Which compliance issue becomes significant in big data analytics? A. Only hardware cost B. Data privacy laws, data localisation, audit trails, consent management C. No compliance at all D. None Answer: B
What is the “edge computing” in context of Big Data? A. Computing near the data source (e.g., IoT devices) to reduce latency and bandwidth usage B. Only cloud data centres far away C. Manual data entry only D. None Answer: A