๐ Course Outline
- NOSQL & Definition
- Scaling & Horizontal Methods
- CAP Theorem & Properties
- Data Models & Types
- Key-Value & API
- Document & JSON
- Column-Based & Families
- Graph & Interconnectivity
- Tradeoffs & NoSQL Use
- Next-Gen & NewSQL
๐ 1. NOSQL & Definition
๐ Key Concepts & Definitions
- NOSQL (Not Only SQL): A class of non-relational databases designed to handle large-scale, distributed data with flexible schemas, offering high scalability and performance.
- Schema-less Data: Data stored without a fixed schema, allowing dynamic and flexible data models such as JSON, XML, or key-value pairs.
- Horizontal Scalability: The ability to increase capacity by adding more servers or nodes, essential for handling big data and high traffic.
- CAP Theorem: A principle stating that in a distributed system, it is impossible to simultaneously guarantee Consistency, Availability, and Partition Tolerance; systems must choose two.
- Eventual Consistency: A consistency model where updates propagate asynchronously, and all nodes will eventually become consistent without strict real-time guarantees.
- Types of NOSQL DBMS: Categorized into key-value, document-based, column-based, and graph-based systems, each optimized for specific data models and use cases.
๐ Essential Points
- NOSQL databases emerged as alternatives to traditional relational databases due to scaling challenges, especially in web and cloud applications.
- They prioritize horizontal scaling, fault tolerance, and performance over strict relational features like joins and referential integrity.
- Major NOSQL systems are inspired by foundational papers such as Googleโs BigTable and Amazonโs DynamoDB.
- The CAP theorem influences design choices: systems often sacrifice strong consistency for availability and partition tolerance, leading to models like eventual consistency.
- Different NOSQL types serve different needs:
- Key-Value: Fast, scalable, simple data access (e.g., DynamoDB, Redis).
- Document-based: Flexible, semi-structured data (e.g., MongoDB, CouchDB).
- Column-based: Handle large, sparse datasets efficiently (e.g., Cassandra, HBase).
- Graph-based: Model complex relationships and interconnectivity (e.g., Neo4j).
- Hybrid solutions and next-generation systems like NewSQL aim to combine scalability with ACID guarantees.
๐ก Key Takeaway
NOSQL databases provide scalable, flexible alternatives to traditional RDBMS, optimized for large-scale, distributed environments, but often sacrifice some relational features and strong consistency guarantees. They are essential for modern web and cloud applications requiring high performance and elasticity.
๐ 2. Scaling & Horizontal Methods
๐ Key Concepts & Definitions
- Vertical Scaling (Scaling Up): Increasing the capacity of a single machine (more CPU, RAM, storage). Limited by hardware constraints; becomes impractical for very large datasets.
- Horizontal Scaling (Scaling Out): Adding more machines or nodes to distribute the workload and data, enhancing capacity and fault tolerance.
- Master/Slave Replication: A scaling approach where all writes go to the master node, and reads are distributed among replicated slave nodes. Risks include data inconsistency and propagation delays.
- Sharding (Partitioning): Dividing data into smaller, manageable pieces (shards) stored across multiple servers. Improves scalability for both reads and writes but complicates relationships and joins.
- Multi-Master Replication: Multiple nodes accept writes independently, often used with de-normalized data to improve write performance, but can introduce conflicts.
- In-memory Databases: Store data entirely in RAM for ultra-fast access, suitable for real-time applications but limited by memory capacity.
- De-normalization: Duplicating data to reduce the need for joins across shards, trading storage efficiency for query speed.
- Partition Tolerance: The system's ability to continue operating despite network partitions or failures, a core aspect of distributed systems.
๐ Essential Points
- Vertical scaling has limits; horizontal scaling is preferred for large datasets.
- Master/slave replication simplifies scaling but can lead to eventual inconsistency.
- Sharding enhances scalability but sacrifices referential integrity and complicates cross-shard queries.
- Multi-master and in-memory approaches provide high performance but require careful conflict resolution.
- De-normalization is common in NoSQL systems to optimize for distributed environments.
- The CAP theorem states that in the presence of network partitions, a system must choose between consistency and availability.
- Horizontal scaling techniques are fundamental to NoSQL databases, enabling handling of massive, distributed data loads efficiently.
๐ก Key Takeaway
Horizontal scaling methodsโsuch as sharding, replication, and in-memory architecturesโare essential for managing large-scale, distributed data systems, enabling high availability and performance at the expense of some relational features and strict consistency.
๐ 3. CAP Theorem & Properties
๐ Key Concepts & Definitions
- Consistency: All nodes see the same data at the same time; ensures data uniformity across the system.
- Availability: The system guarantees that every request receives a non-error response, without guarantee that it contains the most recent data.
- Partition Tolerance: The system continues to operate correctly despite network partitions or failures that prevent communication between nodes.
- CAP Theorem (Brewerโs Theorem): In a distributed system, it is impossible to simultaneously guarantee all three propertiesโConsistency, Availability, and Partition Tolerance. A system can only reliably provide two of these at any time.
- Strong Consistency: Guarantees that once a write completes, all subsequent reads will reflect that write across the entire system.
- Eventual Consistency: Guarantees that, in the absence of new updates, all replicas will eventually become consistent, allowing temporary divergence.
๐ Essential Points
- Distributed systems face a fundamental tradeoff: they cannot simultaneously ensure all three CAP properties during network partitions.
- During a partition, systems must choose between maintaining consistency (C) or availability (A).
- Traditional RDBMS prioritize Consistency and Partition Tolerance, often sacrificing Availability during network failures.
- NoSQL systems often favor Availability and Partition Tolerance, accepting eventual consistency.
- Eventual consistency is common in cloud and NoSQL systems, allowing for high availability at the cost of temporary data divergence.
- The choice among the three properties depends on application requirements; for example, financial systems prioritize consistency, while social media platforms may prioritize availability.
๐ก Key Takeaway
The CAP Theorem states that in a distributed system, it is impossible to simultaneously guarantee Consistency, Availability, and Partition Tolerance; system designers must choose which two to prioritize based on application needs.
๐ 4. Data Models & Types
๐ Key Concepts & Definitions
-
Data Model: A conceptual framework that defines how data is structured, stored, and manipulated within a database system. It specifies the relationships, constraints, and operations applicable to the data.
-
Relational Model: A data model based on tables (relations), where data is organized into rows and columns, supporting operations like joins and referential integrity.
-
NoSQL Data Model: A non-relational approach that offers flexible, schema-less data storage, often optimized for horizontal scaling, fault tolerance, and high performance. Types include key-value, document, column-family, and graph models.
-
CAP Theorem: A principle stating that in a distributed system, it is impossible to simultaneously guarantee Consistency, Availability, and Partition Tolerance; systems must choose two.
-
Consistency Types:
- Strong Consistency: All nodes see the same data at the same time, supporting ACID transactions.
- Eventual Consistency: Data updates propagate asynchronously, and all nodes will eventually become consistent.
-
Types of NoSQL DBMS:
- Key-Value: Stores data as pairs of keys and values.
- Document-Based: Stores semi-structured documents like JSON or XML.
- Column-Based: Organizes data into column families for efficient retrieval.
- Graph-Based: Models data as nodes and edges to represent relationships.
๐ Essential Points
- Traditional relational databases excel at complex queries and ACID guarantees but face scalability issues with large datasets, especially in web and cloud environments.
- NoSQL databases emerged to address scalability and flexibility challenges, sacrificing some relational features like joins and referential integrity.
- Horizontal scaling (scaling out) via sharding, replication, and multi-master setups is fundamental for handling big data in NoSQL systems.
- The CAP theorem guides the design choices in distributed systems, forcing a tradeoff between consistency and availability during network partitions.
- NoSQL models are categorized into key-value, document, column-family, and graph types, each suited for specific data and application needs.
- Eventual consistency is common in NoSQL systems, providing high availability and partition tolerance at the cost of immediate data consistency.
- Hybrid approaches and NewSQL systems aim to combine the scalability of NoSQL with the transactional guarantees of traditional RDBMS.
๐ก Key Takeaway
NoSQL data models provide flexible, scalable alternatives to relational databases, enabling efficient handling of large, distributed, and semi-structured data, but require careful consideration of consistency and integrity tradeoffs dictated by the CAP theorem.
๐ 5. Key-Value & API
๐ Key Concepts & Definitions
- Key-Value Store: A type of NoSQL database that stores data as a collection of key-value pairs, where each key is unique and directly maps to a value, enabling rapid access.
- API (Application Programming Interface): A set of protocols and tools that allow different software components to communicate; in key-value stores, APIs typically include operations like get, put, delete, and execute.
- Get Operation: Retrieves the value associated with a specific key from the database.
- Put Operation: Creates or updates the value linked to a specific key.
- Delete Operation: Removes a key and its associated value from the database.
- Fault Tolerance: The ability of a system to continue functioning correctly even when some components fail, often achieved through data replication across nodes.
๐ Essential Points
- Design Focus: Key-value stores are optimized for scalability, speed, and simplicity, handling massive loads with minimal latency.
- API Simplicity: Operations are straightforwardโget, put, deleteโmaking them easy to implement and use.
- Data Model Limitations: Cannot easily model complex relationships or nested data structures; primarily suitable for simple lookups.
- Examples: DynamoDB (Amazon), Redis, Voldemort, Scalaris.
- Advantages:
- Very fast read/write performance.
- Highly scalable through horizontal distribution.
- Fault-tolerant via data replication.
- Simple data model reduces complexity.
- Disadvantages:
- Limited to simple key-value data; not suitable for complex queries.
- No inherent support for relationships, joins, or multi-operation transactions.
- Usually eventual consistency, which may not suit all applications.
๐ก Key Takeaway
Key-value databases provide a highly scalable and efficient storage solution for simple data retrieval tasks, but their limited data modeling capabilities make them less suitable for applications requiring complex relationships or transactional integrity.
๐ 6. Document & JSON
๐ Key Concepts & Definitions
- NoSQL: A class of non-relational databases designed for scalability, flexibility, and high performance, often used for large-scale, distributed data storage.
- JSON (JavaScript Object Notation): A lightweight, text-based data interchange format that represents data as key-value pairs, supporting nested objects, arrays, and various data types.
- Document-based Database: A NoSQL database that stores data as semi-structured documents (e.g., JSON, XML), allowing complex objects with nested structures.
- Schema-less: A characteristic of NoSQL databases where data does not require a fixed schema, enabling flexible and dynamic data models.
- Object-structured Documents: Data stored as JSON objects with properties and nested sub-objects, facilitating complex data modeling.
- Primary Key: A unique identifier for each document within a collection, used for efficient retrieval.
๐ Essential Points
- Document Model: Stores data as self-contained JSON-like documents, supporting complex, nested structures, arrays, and various data types, enabling flexible data representation.
- Querying: Operations typically include finding documents by key or property, updating, deleting, and manipulating nested data structures.
- Advantages: Supports complex objects, flexible schema, easy to evolve data models, and natural fit for web applications.
- Popular Examples: MongoDB (uses BSON, a binary JSON format), CouchDB, which store data as JSON documents with indexing and querying capabilities.
- JSON Format: Key-value pairs with support for nested objects, arrays, strings, numbers, booleans, and nulls, making it suitable for semi-structured data.
- Use Cases: Content management, user profiles, product catalogs, and any application requiring flexible, hierarchical data storage.
๐ก Key Takeaway
Document and JSON-based NoSQL databases provide a flexible, schema-less approach to storing complex, nested data structures, making them ideal for web-scale applications that demand agility and scalability.
๐ 7. Column-Based & Families
๐ Key Concepts & Definitions
- Column-Based Database: A type of NoSQL database that stores data in columns rather than rows, optimized for read/write operations on specific columns.
- Column Family: A collection of columns stored together, similar to a table in RDBMS but with flexible schema; each column family contains related data grouped for efficient access.
- Key-Value Pair: The fundamental data unit where a unique key maps to a value, used in column-based systems to facilitate rapid retrieval.
- Partitioning: The process of dividing data across multiple nodes to enable horizontal scalability; in column-based DBs, partitioning can be done by row key, column key, or timestamp.
- Data Locality: The principle that related data (e.g., columns within a family) are stored together to optimize query performance.
๐ Essential Points
- Column-based databases handle semi-structured data efficiently, storing related columns together in column families.
- They support flexible schemas, allowing variable numbers of columns per row, which enhances scalability and performance.
- Data is indexed by row key, column key, and timestamp, enabling fast retrieval of specific data subsets.
- Unlike relational databases, they do not require fixed schemas and are optimized for large-scale, read-heavy workloads.
- Querying involves fetching entire column families, which improves data locality and reduces disk I/O.
- Examples include Cassandra, HBase, and BigTable.
- They are particularly suited for applications requiring high throughput and scalability, such as analytics and real-time data processing.
๐ก Key Takeaway
Column-based families organize semi-structured data into flexible, efficient units that enable high scalability and performance, making them ideal for large-scale, data-intensive applications where schema flexibility and quick access are critical.
๐ 8. Graph & Interconnectivity
๐ Key Concepts & Definitions
-
Graph Data Model: A structure consisting of nodes (entities) and edges (relationships) that connect nodes, often with properties attached to both nodes and edges. Used to represent complex interconnections and relationships.
-
Nodes: The entities or objects within a graph, which can have properties (attributes) such as IDs or labels.
-
Edges: The connections between nodes, representing relationships or interactions. Edges can have labels or roles and may also carry properties.
-
Property Graph: A type of graph model where both nodes and edges can have key-value properties, enabling detailed and flexible data representation.
-
Graph Query Languages: Specialized languages designed to traverse and query graph structures, such as Cypher (Neo4j), enabling pattern matching and path exploration.
-
Interconnectivity: The degree and nature of relationships among data points, crucial for modeling social networks, recommendation systems, and complex networks.
๐ Essential Points
-
Graph databases excel at modeling highly interconnected data, capturing complex relationships naturally.
-
Nodes and edges can have properties, allowing detailed descriptions and attributes within the graph.
-
Graph models are scalable to the complexity of data, supporting recursive and multi-step relationship queries.
-
Common graph databases include Neo4j, FlockDB, and Pregel, each optimized for different types of graph processing.
-
Querying in graph databases often involves pattern matching, path finding, and traversal operations, which are more efficient than join-heavy relational queries for interconnected data.
-
Graph interconnectivity enables applications like social networks, recommendation engines, fraud detection, and network topology analysis.
๐ก Key Takeaway
Graph and interconnectivity models provide a powerful framework for representing and querying complex, highly linked data structures, enabling insights and operations that are cumbersome with traditional relational databases.
๐ 9. Tradeoffs & NoSQL Use
๐ Key Concepts & Definitions
- NoSQL (Not Only SQL): A class of non-relational databases designed for horizontal scalability, flexible data models, and high performance, often sacrificing some traditional relational features.
- CAP Theorem: States that in a distributed system, it is impossible to simultaneously guarantee Consistency, Availability, and Partition Tolerance; systems must choose two.
- Consistency: Ensuring all nodes see the same data at the same time; can be strong (ACID) or weak (eventual).
- Availability: The system's ability to always accept read/write requests, even during failures.
- Partition Tolerance: The system's ability to continue operating despite network partitions or failures.
- Types of NoSQL DBMS: Key-value, Document-based, Column-based, Graph-based, each optimized for specific data models and use cases.
๐ Essential Points
- Scaling Challenges: Traditional RDBMS face limitations with vertical scaling; horizontal scaling (sharding, master/slave replication) is essential for handling large datasets.
- NoSQL Advantages: Non-relational, schema-less, fault-tolerant, and designed for massive write/read throughput, making them suitable for web-scale applications.
- Tradeoffs: NoSQL systems often relax ACID guarantees (favoring eventual consistency) to achieve high availability and partition tolerance, as per CAP theorem.
- Use Cases: NoSQL databases excel in scenarios with large-scale, distributed data, such as social media, real-time analytics, and cloud applications.
- Limitations: Lack of support for complex relational features like joins and referential integrity; less suitable for applications requiring strict transactional consistency.
- Hybrid & Next-Gen Solutions: Combining RDBMS and NoSQL, or adopting NewSQL databases that aim to provide both scalability and ACID guarantees.
๐ก Key Takeaway
NoSQL databases offer scalable, flexible solutions for large-scale, distributed data applications by trading off some traditional relational features, guided by the principles of the CAP theorem; understanding these tradeoffs is essential for selecting the appropriate database system for specific needs.
๐ 10. Next-Gen & NewSQL
๐ Key Concepts & Definitions
-
NoSQL (Not Only SQL): A class of non-relational databases designed for horizontal scalability, fault tolerance, and flexible data models, often sacrificing some traditional relational features like joins and strict ACID compliance.
-
CAP Theorem: States that in a distributed system, it is impossible to simultaneously guarantee Consistency, Availability, and Partition Tolerance; systems must choose two out of three.
-
Eventual Consistency: A consistency model where, in the absence of new updates, all replicas of data will eventually become consistent, favoring availability over immediate consistency.
-
Next-Gen Databases (NewSQL): Modern relational databases that aim to combine the scalability of NoSQL systems with the ACID guarantees of traditional RDBMS, supporting SQL and high performance on distributed architectures.
-
Types of NoSQL DBMS: Categorized into key-value, document-based, column-based, and graph-based systems, each optimized for specific data models and query patterns.
-
Sharding (Partitioning): Distributing data across multiple machines to improve scalability; can be horizontal (by rows) or vertical (by columns).
๐ Essential Points
-
Traditional RDBMS face scaling challenges, leading to the development of NoSQL systems that excel in handling large, distributed datasets with high write/read throughput.
-
NoSQL databases prioritize horizontal scalability, fault tolerance, and flexible schemas, often at the expense of full relational features and strict ACID compliance.
-
The CAP theorem influences NoSQL design choices, resulting in systems that often favor availability and partition tolerance, implementing eventual consistency models.
-
Major NoSQL types include key-value stores (e.g., DynamoDB), document stores (e.g., MongoDB), column-family stores (e.g., Cassandra), and graph databases (e.g., Neo4j).
-
Next-Gen or NewSQL databases aim to provide the best of both worlds: scalable distributed architecture with full ACID transaction support, primarily using SQL.
-
Hybrid approaches and system designs are common, combining RDBMS and NoSQL features to meet enterprise and cloud application requirements.
๐ก Key Takeaway
Next-Gen and NewSQL databases are evolving solutions that strive to deliver scalable, distributed data management with the transactional integrity and query richness of traditional relational systems, addressing the limitations of earlier NoSQL architectures.
๐ Synthesis Tables
| Aspect | Relational Databases (RDBMS) | NoSQL Databases |
|---|
| Data Model | Tables with fixed schemas, relations | Flexible models: key-value, document, column, graph |
| Scalability | Vertical (scale-up) | Horizontal (scale-out) |
| Schema | Fixed, predefined | Schema-less, dynamic |
| Consistency | Strong (ACID compliance) | Eventual or tunable consistency |
| Typical Use Cases | Complex queries, transactions | Large-scale, distributed, flexible data storage |
| Data Relationships | Supports joins, referential integrity | Limited or no joins, denormalized data |
| Aspect | Key-Value/Document/Column/Graph Systems | Tradeoffs & Use Cases |
|---|
| Data Access | Simple key-based, document queries, graph traversal | High performance, scalability, flexible schemas |
| Consistency | Often eventual, tunable | Sacrifice some consistency for availability |
| Scalability | Designed for horizontal scaling | Suitable for big data, real-time apps |
| Typical Systems | Redis, MongoDB, Cassandra, Neo4j | Use when high throughput, flexible schema needed |
โ ๏ธ Common Pitfalls & Confusions
- Confusing CAP properties: assuming all systems guarantee all three simultaneously.
- Believing relational databases are always better for all applications; neglecting scalability needs.
- Misunderstanding eventual consistency as data inconsistency; it guarantees convergence over time.
- Overlooking the importance of sharding and its impact on data integrity and query complexity.
- Assuming NoSQL systems lack consistency; many offer tunable consistency models.
- Confusing vertical scaling with horizontal scaling; the latter is essential for big data.
- Ignoring tradeoffs between performance, consistency, and availability when choosing a database type.
โ
Exam Checklist
- Define NOSQL and explain its key features compared to RDBMS.
- Describe the importance of horizontal scaling and methods like sharding and replication.
- State the CAP theorem and discuss how it influences distributed database design.
- Differentiate between data models: relational, key-value, document, column-family, graph.
- Explain key-value and document databases, including JSON data handling.
- Describe column-family databases and the concept of column families and families.
- Summarize graph databases and their use in modeling interconnectivity.
- Discuss the tradeoffs involved in NoSQL systems and their typical use cases.
- Understand the concept of next-generation databases and NewSQL solutions.
- Explain the differences between vertical and horizontal scaling.
- Describe the role of de-normalization in NoSQL systems.
- Identify scenarios where eventual consistency is acceptable versus strict consistency needs.
- List common NoSQL databases and their data models.
- Recognize the impact of the CAP theorem on database choice.
- Understand the importance of data models in selecting the appropriate database system.
- Be aware of the limitations of relational databases in distributed environments.
- Describe how graph databases model complex relationships.
- Summarize the advantages of in-memory databases for real-time applications.
- Recognize the concept of NewSQL as an attempt to combine SQL features with NoSQL scalability.
- Know the main types of NoSQL databases and their typical use cases.
Crea le tue schede di revisione
Importa il tuo corso e l'AI genera schede, quiz e flashcard in 30 secondi.
Generatore di schede