Scheda di Revisione: Understanding NoSQL: Scalability and Data Models

Course Outline

NOSQL & Definition
Scaling & Horizontal Methods
CAP Theorem & Properties
Data Models & Types
Key-Value & API
Document & JSON
Column-Based & Families
Graph & Interconnectivity
Tradeoffs & NoSQL Use
Next-Gen & NewSQL

1. NOSQL & Definition

Key Concepts & Definitions

NOSQL (Not Only SQL): A class of non-relational databases designed to handle large-scale, distributed data with flexible schemas, offering high scalability and performance.
Schema-less Data: Data stored without a fixed schema, allowing dynamic and flexible data models such as JSON, XML, or key-value pairs.
Horizontal Scalability: The ability to increase capacity by adding more servers or nodes, essential for handling big data and high traffic.
CAP Theorem: A principle stating that in a distributed system, it is impossible to simultaneously guarantee Consistency, Availability, and Partition Tolerance; systems must choose two.
Eventual Consistency: A consistency model where updates propagate asynchronously, and all nodes will eventually become consistent without strict real-time guarantees.
Types of NOSQL DBMS: Categorized into key-value, document-based, column-based, and graph-based systems, each optimized for specific data models and use cases.

Essential Points

NOSQL databases emerged as alternatives to traditional relational databases due to scaling challenges, especially in web and cloud applications.
They prioritize horizontal scaling, fault tolerance, and performance over strict relational features like joins and referential integrity.
Major NOSQL systems are inspired by foundational papers such as Google’s BigTable and Amazon’s DynamoDB.
The CAP theorem influences design choices: systems often sacrifice strong consistency for availability and partition tolerance, leading to models like eventual consistency.
Different NOSQL types serve different needs:
- Key-Value: Fast, scalable, simple data access (e.g., DynamoDB, Redis).
- Document-based: Flexible, semi-structured data (e.g., MongoDB, CouchDB).
- Column-based: Handle large, sparse datasets efficiently (e.g., Cassandra, HBase).
- Graph-based: Model complex relationships and interconnectivity (e.g., Neo4j).
Hybrid solutions and next-generation systems like NewSQL aim to combine scalability with ACID guarantees.

Key Takeaway

NOSQL databases provide scalable, flexible alternatives to traditional RDBMS, optimized for large-scale, distributed environments, but often sacrifice some relational features and strong consistency guarantees. They are essential for modern web and cloud applications requiring high performance and elasticity.

2. Scaling & Horizontal Methods

Key Concepts & Definitions

Vertical Scaling (Scaling Up): Increasing the capacity of a single machine (more CPU, RAM, storage). Limited by hardware constraints; becomes impractical for very large datasets.
Horizontal Scaling (Scaling Out): Adding more machines or nodes to distribute the workload and data, enhancing capacity and fault tolerance.
Master/Slave Replication: A scaling approach where all writes go to the master node, and reads are distributed among replicated slave nodes. Risks include data inconsistency and propagation delays.
Sharding (Partitioning): Dividing data into smaller, manageable pieces (shards) stored across multiple servers. Improves scalability for both reads and writes but complicates relationships and joins.
Multi-Master Replication: Multiple nodes accept writes independently, often used with de-normalized data to improve write performance, but can introduce conflicts.
In-memory Databases: Store data entirely in RAM for ultra-fast access, suitable for real-time applications but limited by memory capacity.
De-normalization: Duplicating data to reduce the need for joins across shards, trading storage efficiency for query speed.
Partition Tolerance: The system's ability to continue operating despite network partitions or failures, a core aspect of distributed systems.

Essential Points

Vertical scaling has limits; horizontal scaling is preferred for large datasets.
Master/slave replication simplifies scaling but can lead to eventual inconsistency.
Sharding enhances scalability but sacrifices referential integrity and complicates cross-shard queries.
Multi-master and in-memory approaches provide high performance but require careful conflict resolution.
De-normalization is common in NoSQL systems to optimize for distributed environments.
The CAP theorem states that in the presence of network partitions, a system must choose between consistency and availability.
Horizontal scaling techniques are fundamental to NoSQL databases, enabling handling of massive, distributed data loads efficiently.

Key Takeaway

Horizontal scaling methods—such as sharding, replication, and in-memory architectures—are essential for managing large-scale, distributed data systems, enabling high availability and performance at the expense of some relational features and strict consistency.

3. CAP Theorem & Properties

Key Concepts & Definitions

Consistency: All nodes see the same data at the same time; ensures data uniformity across the system.
Availability: The system guarantees that every request receives a non-error response, without guarantee that it contains the most recent data.
Partition Tolerance: The system continues to operate correctly despite network partitions or failures that prevent communication between nodes.
CAP Theorem (Brewer’s Theorem): In a distributed system, it is impossible to simultaneously guarantee all three properties—Consistency, Availability, and Partition Tolerance. A system can only reliably provide two of these at any time.
Strong Consistency: Guarantees that once a write completes, all subsequent reads will reflect that write across the entire system.
Eventual Consistency: Guarantees that, in the absence of new updates, all replicas will eventually become consistent, allowing temporary divergence.

Essential Points

Distributed systems face a fundamental tradeoff: they cannot simultaneously ensure all three CAP properties during network partitions.
During a partition, systems must choose between maintaining consistency (C) or availability (A).
Traditional RDBMS prioritize Consistency and Partition Tolerance, often sacrificing Availability during network failures.
NoSQL systems often favor Availability and Partition Tolerance, accepting eventual consistency.
Eventual consistency is common in cloud and NoSQL systems, allowing for high availability at the cost of temporary data divergence.
The choice among the three properties depends on application requirements; for example, financial systems prioritize consistency, while social media platforms may prioritize availability.

Key Takeaway

The CAP Theorem states that in a distributed system, it is impossible to simultaneously guarantee Consistency, Availability, and Partition Tolerance; system designers must choose which two to prioritize based on application needs.

4. Data Models & Types

Key Concepts & Definitions

Data Model: A conceptual framework that defines how data is structured, stored, and manipulated within a database system. It specifies the relationships, constraints, and operations applicable to the data.
Relational Model: A data model based on tables (relations), where data is organized into rows and columns, supporting operations like joins and referential integrity.
NoSQL Data Model: A non-relational approach that offers flexible, schema-less data storage, often optimized for horizontal scaling, fault tolerance, and high performance. Types include key-value, document, column-family, and graph models.
CAP Theorem: A principle stating that in a distributed system, it is impossible to simultaneously guarantee Consistency, Availability, and Partition Tolerance; systems must choose two.
Consistency Types:
- Strong Consistency: All nodes see the same data at the same time, supporting ACID transactions.
- Eventual Consistency: Data updates propagate asynchronously, and all nodes will eventually become consistent.
Types of NoSQL DBMS:
- Key-Value: Stores data as pairs of keys and values.
- Document-Based: Stores semi-structured documents like JSON or XML.
- Column-Based: Organizes data into column families for efficient retrieval.
- Graph-Based: Models data as nodes and edges to represent relationships.

Essential Points

Traditional relational databases excel at complex queries and ACID guarantees but face scalability issues with large datasets, especially in web and cloud environments.
NoSQL databases emerged to address scalability and flexibility challenges, sacrificing some relational features like joins and referential integrity.
Horizontal scaling (scaling out) via sharding, replication, and multi-master setups is fundamental for handling big data in NoSQL systems.
The CAP theorem guides the design choices in distributed systems, forcing a tradeoff between consistency and availability during network partitions.
NoSQL models are categorized into key-value, document, column-family, and graph types, each suited for specific data and application needs.
Eventual consistency is common in NoSQL systems, providing high availability and partition tolerance at the cost of immediate data consistency.
Hybrid approaches and NewSQL systems aim to combine the scalability of NoSQL with the transactional guarantees of traditional RDBMS.

Key Takeaway

NoSQL data models provide flexible, scalable alternatives to relational databases, enabling efficient handling of large, distributed, and semi-structured data, but require careful consideration of consistency and integrity tradeoffs dictated by the CAP theorem.

5. Key-Value & API

Key Concepts & Definitions

Key-Value Store: A type of NoSQL database that stores data as a collection of key-value pairs, where each key is unique and directly maps to a value, enabling rapid access.
API (Application Programming Interface): A set of protocols and tools that allow different software components to communicate; in key-value stores, APIs typically include operations like get, put, delete, and execute.
Get Operation: Retrieves the value associated with a specific key from the database.
Put Operation: Creates or updates the value linked to a specific key.
Delete Operation: Removes a key and its associated value from the database.
Fault Tolerance: The ability of a system to continue functioning correctly even when some components fail, often achieved through data replication across nodes.

Essential Points

Design Focus: Key-value stores are optimized for scalability, speed, and simplicity, handling massive loads with minimal latency.
API Simplicity: Operations are straightforward—get, put, delete—making them easy to implement and use.
Data Model Limitations: Cannot easily model complex relationships or nested data structures; primarily suitable for simple lookups.
Examples: DynamoDB (Amazon), Redis, Voldemort, Scalaris.
Advantages:
- Very fast read/write performance.
- Highly scalable through horizontal distribution.
- Fault-tolerant via data replication.
- Simple data model reduces complexity.
Disadvantages:
- Limited to simple key-value data; not suitable for complex queries.
- No inherent support for relationships, joins, or multi-operation transactions.
- Usually eventual consistency, which may not suit all applications.

Key Takeaway

Key-value databases provide a highly scalable and efficient storage solution for simple data retrieval tasks, but their limited data modeling capabilities make them less suitable for applications requiring complex relationships or transactional integrity.

6. Document & JSON

Key Concepts & Definitions

NoSQL: A class of non-relational databases designed for scalability, flexibility, and high performance, often used for large-scale, distributed data storage.
JSON (JavaScript Object Notation): A lightweight, text-based data interchange format that represents data as key-value pairs, supporting nested objects, arrays, and various data types.
Document-based Database: A NoSQL database that stores data as semi-structured documents (e.g., JSON, XML), allowing complex objects with nested structures.
Schema-less: A characteristic of NoSQL databases where data does not require a fixed schema, enabling flexible and dynamic data models.
Object-structured Documents: Data stored as JSON objects with properties and nested sub-objects, facilitating complex data modeling.
Primary Key: A unique identifier for each document within a collection, used for efficient retrieval.

Essential Points

Document Model: Stores data as self-contained JSON-like documents, supporting complex, nested structures, arrays, and various data types, enabling flexible data representation.
Querying: Operations typically include finding documents by key or property, updating, deleting, and manipulating nested data structures.
Advantages: Supports complex objects, flexible schema, easy to evolve data models, and natural fit for web applications.
Popular Examples: MongoDB (uses BSON, a binary JSON format), CouchDB, which store data as JSON documents with indexing and querying capabilities.
JSON Format: Key-value pairs with support for nested objects, arrays, strings, numbers, booleans, and nulls, making it suitable for semi-structured data.
Use Cases: Content management, user profiles, product catalogs, and any application requiring flexible, hierarchical data storage.

Key Takeaway

Document and JSON-based NoSQL databases provide a flexible, schema-less approach to storing complex, nested data structures, making them ideal for web-scale applications that demand agility and scalability.

7. Column-Based & Families

Key Concepts & Definitions

Column-Based Database: A type of NoSQL database that stores data in columns rather than rows, optimized for read/write operations on specific columns.
Column Family: A collection of columns stored together, similar to a table in RDBMS but with flexible schema; each column family contains related data grouped for efficient access.
Key-Value Pair: The fundamental data unit where a unique key maps to a value, used in column-based systems to facilitate rapid retrieval.
Partitioning: The process of dividing data across multiple nodes to enable horizontal scalability; in column-based DBs, partitioning can be done by row key, column key, or timestamp.
Data Locality: The principle that related data (e.g., columns within a family) are stored together to optimize query performance.

Essential Points

Column-based databases handle semi-structured data efficiently, storing related columns together in column families.
They support flexible schemas, allowing variable numbers of columns per row, which enhances scalability and performance.
Data is indexed by row key, column key, and timestamp, enabling fast retrieval of specific data subsets.
Unlike relational databases, they do not require fixed schemas and are optimized for large-scale, read-heavy workloads.
Querying involves fetching entire column families, which improves data locality and reduces disk I/O.
Examples include Cassandra, HBase, and BigTable.
They are particularly suited for applications requiring high throughput and scalability, such as analytics and real-time data processing.

Key Takeaway

Column-based families organize semi-structured data into flexible, efficient units that enable high scalability and performance, making them ideal for large-scale, data-intensive applications where schema flexibility and quick access are critical.

8. Graph & Interconnectivity

Key Concepts & Definitions

Graph Data Model: A structure consisting of nodes (entities) and edges (relationships) that connect nodes, often with properties attached to both nodes and edges. Used to represent complex interconnections and relationships.
Nodes: The entities or objects within a graph, which can have properties (attributes) such as IDs or labels.
Edges: The connections between nodes, representing relationships or interactions. Edges can have labels or roles and may also carry properties.
Property Graph: A type of graph model where both nodes and edges can have key-value properties, enabling detailed and flexible data representation.
Graph Query Languages: Specialized languages designed to traverse and query graph structures, such as Cypher (Neo4j), enabling pattern matching and path exploration.
Interconnectivity: The degree and nature of relationships among data points, crucial for modeling social networks, recommendation systems, and complex networks.

Essential Points

Graph databases excel at modeling highly interconnected data, capturing complex relationships naturally.
Nodes and edges can have properties, allowing detailed descriptions and attributes within the graph.
Graph models are scalable to the complexity of data, supporting recursive and multi-step relationship queries.
Common graph databases include Neo4j, FlockDB, and Pregel, each optimized for different types of graph processing.
Querying in graph databases often involves pattern matching, path finding, and traversal operations, which are more efficient than join-heavy relational queries for interconnected data.
Graph interconnectivity enables applications like social networks, recommendation engines, fraud detection, and network topology analysis.

Key Takeaway

Graph and interconnectivity models provide a powerful framework for representing and querying complex, highly linked data structures, enabling insights and operations that are cumbersome with traditional relational databases.

9. Tradeoffs & NoSQL Use

Key Concepts & Definitions

NoSQL (Not Only SQL): A class of non-relational databases designed for horizontal scalability, flexible data models, and high performance, often sacrificing some traditional relational features.
CAP Theorem: States that in a distributed system, it is impossible to simultaneously guarantee Consistency, Availability, and Partition Tolerance; systems must choose two.
Consistency: Ensuring all nodes see the same data at the same time; can be strong (ACID) or weak (eventual).
Availability: The system's ability to always accept read/write requests, even during failures.
Partition Tolerance: The system's ability to continue operating despite network partitions or failures.
Types of NoSQL DBMS: Key-value, Document-based, Column-based, Graph-based, each optimized for specific data models and use cases.

Essential Points

Scaling Challenges: Traditional RDBMS face limitations with vertical scaling; horizontal scaling (sharding, master/slave replication) is essential for handling large datasets.
NoSQL Advantages: Non-relational, schema-less, fault-tolerant, and designed for massive write/read throughput, making them suitable for web-scale applications.
Tradeoffs: NoSQL systems often relax ACID guarantees (favoring eventual consistency) to achieve high availability and partition tolerance, as per CAP theorem.
Use Cases: NoSQL databases excel in scenarios with large-scale, distributed data, such as social media, real-time analytics, and cloud applications.
Limitations: Lack of support for complex relational features like joins and referential integrity; less suitable for applications requiring strict transactional consistency.
Hybrid & Next-Gen Solutions: Combining RDBMS and NoSQL, or adopting NewSQL databases that aim to provide both scalability and ACID guarantees.

Key Takeaway

NoSQL databases offer scalable, flexible solutions for large-scale, distributed data applications by trading off some traditional relational features, guided by the principles of the CAP theorem; understanding these tradeoffs is essential for selecting the appropriate database system for specific needs.

10. Next-Gen & NewSQL

Key Concepts & Definitions

NoSQL (Not Only SQL): A class of non-relational databases designed for horizontal scalability, fault tolerance, and flexible data models, often sacrificing some traditional relational features like joins and strict ACID compliance.
CAP Theorem: States that in a distributed system, it is impossible to simultaneously guarantee Consistency, Availability, and Partition Tolerance; systems must choose two out of three.
Eventual Consistency: A consistency model where, in the absence of new updates, all replicas of data will eventually become consistent, favoring availability over immediate consistency.
Next-Gen Databases (NewSQL): Modern relational databases that aim to combine the scalability of NoSQL systems with the ACID guarantees of traditional RDBMS, supporting SQL and high performance on distributed architectures.
Types of NoSQL DBMS: Categorized into key-value, document-based, column-based, and graph-based systems, each optimized for specific data models and query patterns.
Sharding (Partitioning): Distributing data across multiple machines to improve scalability; can be horizontal (by rows) or vertical (by columns).

Essential Points

Traditional RDBMS face scaling challenges, leading to the development of NoSQL systems that excel in handling large, distributed datasets with high write/read throughput.
NoSQL databases prioritize horizontal scalability, fault tolerance, and flexible schemas, often at the expense of full relational features and strict ACID compliance.
The CAP theorem influences NoSQL design choices, resulting in systems that often favor availability and partition tolerance, implementing eventual consistency models.
Major NoSQL types include key-value stores (e.g., DynamoDB), document stores (e.g., MongoDB), column-family stores (e.g., Cassandra), and graph databases (e.g., Neo4j).
Next-Gen or NewSQL databases aim to provide the best of both worlds: scalable distributed architecture with full ACID transaction support, primarily using SQL.
Hybrid approaches and system designs are common, combining RDBMS and NoSQL features to meet enterprise and cloud application requirements.

Key Takeaway

Next-Gen and NewSQL databases are evolving solutions that strive to deliver scalable, distributed data management with the transactional integrity and query richness of traditional relational systems, addressing the limitations of earlier NoSQL architectures.

Synthesis Tables

Aspect	Relational Databases (RDBMS)	NoSQL Databases
Data Model	Tables with fixed schemas, relations	Flexible models: key-value, document, column, graph
Scalability	Vertical (scale-up)	Horizontal (scale-out)
Schema	Fixed, predefined	Schema-less, dynamic
Consistency	Strong (ACID compliance)	Eventual or tunable consistency
Typical Use Cases	Complex queries, transactions	Large-scale, distributed, flexible data storage
Data Relationships	Supports joins, referential integrity	Limited or no joins, denormalized data

Aspect	Key-Value/Document/Column/Graph Systems	Tradeoffs & Use Cases
Data Access	Simple key-based, document queries, graph traversal	High performance, scalability, flexible schemas
Consistency	Often eventual, tunable	Sacrifice some consistency for availability
Scalability	Designed for horizontal scaling	Suitable for big data, real-time apps
Typical Systems	Redis, MongoDB, Cassandra, Neo4j	Use when high throughput, flexible schema needed

Common Pitfalls & Confusions

Confusing CAP properties: assuming all systems guarantee all three simultaneously.
Believing relational databases are always better for all applications; neglecting scalability needs.
Misunderstanding eventual consistency as data inconsistency; it guarantees convergence over time.
Overlooking the importance of sharding and its impact on data integrity and query complexity.
Assuming NoSQL systems lack consistency; many offer tunable consistency models.
Confusing vertical scaling with horizontal scaling; the latter is essential for big data.
Ignoring tradeoffs between performance, consistency, and availability when choosing a database type.

Exam Checklist

Define NOSQL and explain its key features compared to RDBMS.
Describe the importance of horizontal scaling and methods like sharding and replication.
State the CAP theorem and discuss how it influences distributed database design.
Differentiate between data models: relational, key-value, document, column-family, graph.
Explain key-value and document databases, including JSON data handling.
Describe column-family databases and the concept of column families and families.
Summarize graph databases and their use in modeling interconnectivity.
Discuss the tradeoffs involved in NoSQL systems and their typical use cases.
Understand the concept of next-generation databases and NewSQL solutions.
Explain the differences between vertical and horizontal scaling.
Describe the role of de-normalization in NoSQL systems.
Identify scenarios where eventual consistency is acceptable versus strict consistency needs.
List common NoSQL databases and their data models.
Recognize the impact of the CAP theorem on database choice.
Understand the importance of data models in selecting the appropriate database system.
Be aware of the limitations of relational databases in distributed environments.
Describe how graph databases model complex relationships.
Summarize the advantages of in-memory databases for real-time applications.
Recognize the concept of NewSQL as an attempt to combine SQL features with NoSQL scalability.
Know the main types of NoSQL databases and their typical use cases.

📋 Course Outline

📖 1. NOSQL & Definition

🔑 Key Concepts & Definitions

📝 Essential Points

💡 Key Takeaway

📖 2. Scaling & Horizontal Methods

🔑 Key Concepts & Definitions

📝 Essential Points

💡 Key Takeaway

📖 3. CAP Theorem & Properties

🔑 Key Concepts & Definitions

📝 Essential Points

💡 Key Takeaway

📖 4. Data Models & Types

🔑 Key Concepts & Definitions

📝 Essential Points

💡 Key Takeaway

📖 5. Key-Value & API

🔑 Key Concepts & Definitions

📝 Essential Points

💡 Key Takeaway

📖 6. Document & JSON

🔑 Key Concepts & Definitions

📝 Essential Points

💡 Key Takeaway

📖 7. Column-Based & Families

🔑 Key Concepts & Definitions

📝 Essential Points

💡 Key Takeaway

📖 8. Graph & Interconnectivity

🔑 Key Concepts & Definitions

📝 Essential Points

💡 Key Takeaway

📖 9. Tradeoffs & NoSQL Use

🔑 Key Concepts & Definitions

📝 Essential Points

💡 Key Takeaway

📖 10. Next-Gen & NewSQL

🔑 Key Concepts & Definitions

📝 Essential Points

💡 Key Takeaway

📊 Synthesis Tables

⚠️ Common Pitfalls & Confusions

✅ Exam Checklist

Metti alla prova le tue conoscenze

Ripassa con le flashcard

Similar courses

Écosystème de l’esport et médiation numérique

Listes, piles, files et arbres

Algorithmique et structures de données

Gestion des fichiers en PHP

Identification utilisateur en PHP

Automated PCB Fault Diagnosis

Crea le tue schede di revisione

Course Outline

1. NOSQL & Definition

Key Concepts & Definitions

Essential Points

Key Takeaway

2. Scaling & Horizontal Methods

Key Concepts & Definitions

Essential Points

Key Takeaway

3. CAP Theorem & Properties

Key Concepts & Definitions

Essential Points

Key Takeaway

4. Data Models & Types

Key Concepts & Definitions

Essential Points

Key Takeaway

5. Key-Value & API

Key Concepts & Definitions

Essential Points

Key Takeaway

6. Document & JSON

Key Concepts & Definitions

Essential Points

Key Takeaway

7. Column-Based & Families

Key Concepts & Definitions

Essential Points

Key Takeaway

8. Graph & Interconnectivity

Key Concepts & Definitions

Essential Points

Key Takeaway

9. Tradeoffs & NoSQL Use

Key Concepts & Definitions

Essential Points

Key Takeaway

10. Next-Gen & NewSQL

Key Concepts & Definitions

Essential Points

Key Takeaway

Synthesis Tables

Common Pitfalls & Confusions

Exam Checklist