Comprehensive DBMS and SQL questions for technical and system design rounds. Covers 10–15% of typical interview weightage.
A DBMS is software that manages databases — it provides an interface between users/applications and the physical data stored. It handles data storage, retrieval, and modification while ensuring data integrity, security, and consistency. Advantages over file systems: reduced data redundancy, concurrent access control, data independence, backup/recovery, enforced constraints. Examples: MySQL, PostgreSQL, Oracle, SQL Server, MongoDB.
SQL (Relational) databases use structured tables with predefined schemas, support ACID transactions, and use SQL for queries. Best for structured data with complex relationships (e.g., MySQL, PostgreSQL). NoSQL databases are schema-flexible, designed for horizontal scaling, and come in types: Document (MongoDB), Key-Value (Redis), Column (Cassandra), Graph (Neo4j). Best for large-scale, unstructured/semi-structured data. Choose based on: consistency needs, scalability, and data structure.
Normalization organizes database tables to reduce redundancy and improve integrity. 1NF: Each column has atomic values; no repeating groups. 2NF: 1NF + No partial dependency (non-key attributes depend on entire primary key — applicable when composite key exists). 3NF: 2NF + No transitive dependency (non-key attributes depend only on primary key, not on other non-key attributes). BCNF (Boyce-Codd): Stricter 3NF — for every dependency X→Y, X must be a superkey.
ACID ensures reliable database transactions: Atomicity — transaction is all-or-nothing; if any part fails, the entire transaction rolls back. Consistency — database remains in a valid state before and after transaction. Isolation — concurrent transactions appear to execute serially; intermediate states are invisible to others. Durability — once committed, changes persist even after system failure (ensured via write-ahead logging). ACID is critical for financial systems, booking platforms, and any data-critical applications.
Primary Key: uniquely identifies each row; cannot be NULL; only one per table; automatically creates a clustered index. Foreign Key: references the Primary Key of another table; enforces referential integrity; can have NULL values; ensures no orphan records. Unique Key: ensures all values in a column are distinct; can contain one NULL; a table can have multiple unique keys; used for alternate candidate keys like email or phone number.
An Entity-Relationship (ER) Diagram is a visual representation of the database structure. Components: Entity (rectangle) — a real-world object (e.g., Student, Course). Attribute (ellipse) — property of an entity (name, roll_no). Relationship (diamond) — association between entities. Cardinality: One-to-One, One-to-Many, Many-to-Many. Weak Entity (double rectangle) — cannot exist without a strong entity. Derived attribute (dashed ellipse) — computed from other attributes (age from date_of_birth).
INNER JOIN: returns rows matching in both tables. LEFT JOIN: all rows from left table + matching from right (NULLs where no match). RIGHT JOIN: all rows from right + matching from left. FULL OUTER JOIN: all rows from both tables (NULLs where no match on either side). CROSS JOIN: Cartesian product — every combination of rows (n×m rows). SELF JOIN: join a table with itself (for hierarchical data like employee-manager). LEFT ANTI JOIN: rows in left not in right (using `WHERE right.id IS NULL`).
Indexing creates a data structure that speeds up data retrieval at the cost of additional storage and slower writes. Without index: O(n) scan. With index: O(log n) search. Types: Primary Index — on primary key (auto-created). Secondary/Non-clustered Index — on non-key columns. Clustered Index — data rows stored in index order; one per table. Composite Index — on multiple columns. When to avoid indexing: small tables, frequently updated columns, columns with low selectivity (e.g., boolean). Use EXPLAIN to analyze query plans.
A Transaction is a unit of work that must be completed entirely or not at all. COMMIT: permanently saves all changes made in the transaction. ROLLBACK: undoes all changes since the last COMMIT or SAVEPOINT. SAVEPOINT: creates a point within a transaction to partially rollback to. Example: in a bank transfer, debit and credit must both succeed; if credit fails, ROLLBACK undoes the debit. Transaction states: Active → Partially Committed → Committed (success) or Failed → Aborted (failure).
SQL commands are categorized as: DDL (Data Definition Language) — CREATE, ALTER, DROP, TRUNCATE (auto-commit). DML (Data Manipulation Language) — SELECT, INSERT, UPDATE, DELETE (can be rolled back). DCL (Data Control Language) — GRANT, REVOKE (access control). TCL (Transaction Control Language) — COMMIT, ROLLBACK, SAVEPOINT. DQL (Data Query Language) — SELECT. Key difference: TRUNCATE vs DELETE — TRUNCATE removes all rows faster, cannot be rolled back (DDL), and resets auto-increment; DELETE can be filtered with WHERE and rolled back.
A View is a virtual table based on a SQL query — it doesn't store data physically but provides a saved query interface. Advantages: Simplifies complex queries, provides security by restricting column access, offers data abstraction (changes to underlying tables don't affect application code using views), and enables code reuse. Materialized View: stores query results physically and refreshes periodically — better performance but uses storage. Views can be updatable (simple views on single tables without aggregations).
Stored Procedure: precompiled SQL block stored in the database, executed with CALL/EXEC. Can: return multiple result sets, use DML statements, have IN/OUT parameters, and commit/rollback transactions. Function: returns a single value (scalar) or table (table-valued), must return a value, can be used in SELECT statements, and cannot commit/rollback. Key difference: procedures are for actions/side effects; functions are for computations. Both improve performance through precompilation and reduce network traffic.
Concurrency problems:
1) Dirty Read — reading uncommitted data from another transaction.
2) Non-Repeatable Read — same query returns different results because another transaction modified data.
3) Phantom Read — new rows appear in a re-executed query due to another transaction's inserts.
4) Lost Update — two transactions update the same data, one overwrites the other. SQL Isolation Levels (ascending strictness): Read Uncommitted → Read Committed → Repeatable Read → Serializable. Each level prevents specific anomalies at the cost of performance.
Denormalization intentionally introduces redundancy into a normalized database to improve read performance. Achieved by merging tables, adding redundant columns, or pre-computing aggregates. When to use: read-heavy systems (data warehouses, analytics), when JOIN operations are too slow, reporting systems requiring frequent aggregations. Trade-offs: faster reads but slower writes, more storage, risk of data inconsistency. Example: storing order_total in Orders table instead of computing it from OrderItems every time.
Aggregate Functions perform calculations on multiple rows: COUNT() — number of rows. SUM() — total of a column. AVG() — average value. MAX()/MIN() — highest/lowest value. GROUP BY — groups rows for aggregate calculations. HAVING — filters groups (like WHERE but for aggregates). Example: `SELECT department, COUNT(*), AVG(salary) FROM employees GROUP BY department HAVING AVG(salary) > 50000`. ORDER: WHERE → GROUP BY → HAVING → SELECT → ORDER BY.
DELETE: DML command, removes rows one by one with optional WHERE clause, can be rolled back, triggers fire, slower, does not reset auto-increment. TRUNCATE: DDL command, removes all rows at once, typically cannot be rolled back (database-dependent), no triggers, resets auto-increment, much faster. DROP: DDL command, completely removes the table structure and all data permanently, cannot be rolled back, removes all associated indexes, constraints, and triggers. Use DELETE for selective removal, TRUNCATE to clear a table, DROP to eliminate the table entirely.
A Subquery (nested query) is a query inside another query. Types: Scalar subquery (returns one value), Row subquery, Column subquery, Table subquery. Correlated subquery: references outer query, executed once per outer row — slower. Non-correlated: independent of outer query, executed once. JOINs vs Subqueries: JOINs are generally faster as the optimizer can choose execution strategy; subqueries are more readable for certain cases. Use EXISTS instead of IN for correlated subqueries with large datasets — EXISTS short-circuits on first match.
CAP Theorem (Brewer's Theorem) states that a distributed system can guarantee at most 2 of 3 properties: Consistency (C) — all nodes see the same data simultaneously. Availability (A) — every request receives a response (not necessarily the most recent). Partition Tolerance (P) — system continues operating despite network partitions. Since network partitions are inevitable in distributed systems, the practical choice is CP (consistent + partition tolerant, e.g., HBase, Zookeeper) or AP (available + partition tolerant, e.g., Cassandra, DynamoDB, CouchDB).
Sharding (Horizontal Partitioning): splits a large database into smaller, more manageable pieces called shards, each stored on a separate server. Each shard holds a subset of rows. Strategies: Range-based, Hash-based, Directory-based. Benefits: improved performance, scalability. Challenges: cross-shard queries, rebalancing. Replication: copies database to multiple servers. Master-Slave: writes go to master, reads distributed among slaves. Master-Master: all nodes accept writes. Benefits: high availability, fault tolerance, read scalability. Challenges: replication lag, conflict resolution.
Window Functions perform calculations across a set of rows related to the current row without collapsing results (unlike GROUP BY). Syntax: `function() OVER (PARTITION BY col ORDER BY col ROWS BETWEEN ...)`. Common functions: ROW_NUMBER() — unique sequential numbers. RANK() — same rank for ties, gaps after. DENSE_RANK() — same rank for ties, no gaps. LAG()/LEAD() — access previous/next row. SUM()/AVG() OVER — running totals/moving averages. Example: `SELECT name, salary, RANK() OVER (PARTITION BY dept ORDER BY salary DESC) as rank FROM employees`.