Networking

Distributed System course

A distributed system is a collection of independent computers that work together to achieve a common goal by sharing resources, data, and communication across a network. In a distributed system, the computers or nodes are connected by a communication network, and they coordinate their actions by exchanging messages and sharing data.

Distributed systems can vary in size and complexity, from small-scale systems consisting of only a few computers to large-scale systems comprising thousands or even millions of nodes. They are used in many applications and domains, including cloud computing, peer-to-peer networks, social networks, online gaming, and scientific simulations.

The main advantage of a distributed system is that it can provide a high level of performance, scalability, and fault tolerance by allowing the workload to be spread across multiple nodes. In addition, a distributed system can enable the sharing of data and resources, leading to more efficient use of computing resources and reducing the need for data duplication.

However, designing and implementing a distributed system can be challenging due to the complexity of distributed algorithms and the need to handle issues such as concurrency, consistency, and fault tolerance. Moreover, distributed systems are often subject to network delays, hardware failures, and other issues that can impact performance and reliability.

Overall, distributed systems are an important tool in modern computing, enabling the creation of large-scale, highly available, and scalable applications and services.

Communication Mechanisms in Distributed Systems

Distributed systems rely on various communication patterns and design techniques:

  • Message Passing
  • Remote Procedure Calls (RPC)
  • Event Bus Pattern
  • Publish-Subscribe Models

Examples of distributed systems

  • Domain Name System (DNS)
    • Distributed lookup table of hostname to IP address
  • Content Delivery Network (CDN)
  • Facebook and Google
    • Massive scale
    • Fast
    • Very reliable
  • Email servers (SMTP)
  • TOR network
  • Phone networks
    • Land line and cellular
  • Cars network electronic components via CANbus
  • Traffic light controllers
  • Train control networks
  • Airplanes
    • Avionics use RS232
    • Air traffic control uses verbal communication

Social benifices of distributed system

  • It is more collaborative
  • More resilient
  • User integrity
  • Hardware sharing economy

Failures in Distributed Systems

  • Network faillures
    • Protocols like TCP/IP and SSH are usefull to secure and manage the network
    • Network not fast enough
    • The biggest risk is network partition
      • When two nodes write to same data item in different subgraphs
      • Shared state diverges
      • Loss of connectivity
  • Node faillures
Link to original

Byzantine Fault Tolerance

What is BFT

Byzantine Fault Tolerance(BFT) is the feature of a distributed network to reach consensus(agreement on the same value) even when some of the nodes in the network fail to respond or respond with incorrect information. The objective of a BFT mechanism is to safeguard against the system failures by employing collective decision making(both – correct and faulty nodes) which aims to reduce to influence of the faulty nodes.

Fault tolerance is the system’s ability to continue functioning correctly even when some of its components, including nodes and communication channels, fail or behave maliciously.

  • A BFT is a faillure that is not a Fail Stop
  • It mean there is bad actors in the network : Traitor nodes
    • Flaky node(s)
    • Malicious nodes(s)
  • Example of extreme fault tolerance
    • Bitcoin
    • Boeing 777 & 787
  • Assumptions we make
    • Can all nodes see all message?  Some?  None?
    • Do nodes fail?  How about the network?
    • Finite computation?
    • Static or dynamic adversary?
    • Bounded communication time?
    • Fully connected network?
    • Randomized algorithms?
    • Quantum or binary computers?

The Two Generals Problem

Two armies, each led by a different general, are preparing to attack a fortified city. The armies are encamped near the city, each in its own valley. A third valley separates the two hills, and the only way for the two generals to communicate is by sending messengers through the valley. Unfortunately, the valley is occupied by the city’s defenders and there’s a chance that any given messenger sent through the valley will be captured.

While the two generals have agreed that they will attack, they haven’t agreed upon a time for an attack. It is required that the two generals have their armies attack the city simultaneously to succeed, lest the lone attacker army die trying. They must thus communicate with each other to decide on a time to attack and to agree to attack at that time, and each general must know that the other general knows that they have agreed to the attack plan. Because acknowledgement of message receipt can be lost as easily as the original message, a potentially infinite series of messages is required to come to consensus.

The thought experiment involves considering how they might go about coming to a consensus. In its simplest form, one general is known to be the leader, decides on the time of the attack, and must communicate this time to the other general. The problem is to come up with algorithms that the generals can use, including sending messages and processing received messages, that can allow them to correctly conclude:

Yes, we will both attack at the agreed-upon time.

Allowing that it is quite simple for the generals to come to an agreement on the time to attack (i.e. one successful message with a successful acknowledgement), the subtlety of the Two Generals’ Problem is in the impossibility of designing algorithms for the generals to use to safely agree to the above statement.

The Byzantine General Problem

Imagine that several divisions of the Byzantine army are camped outside an enemy city, each division commanded by its own general. 

The generals can communicate with one another only by messenger. After observing the enemy, they must decide upon a common plan of action. 

However, some of the generals may be traitors, trying to prevent the loyal generals from reaching an agreement. 

The generals must decide on when to attack the city, but they need a strong majority of their army to attack at the same time. 

The generals must have an algorithm to guarantee that : 
- (a) all loyal generals decide upon the same plan of action, and 
- (b) a small number of traitors cannot cause the loyal generals to adopt a bad plan. 

The loyal generals will all do what the algorithm says they should, but the traitors may do anything they wish. 

The algorithm must guarantee condition (a) regardless of what the traitors do. The loyal generals should not only reach agreement, but should agree upon a reasonable plan.
  • Goal of the solution : Reaching Consensus
  • At least two generals must agree for there to be a consensus.
  • If 1/3 of the generals are traitors, it’s impossible to solve the problem
    • 3m+1 generals
    • At least 4 generals is required

How to solve the Byzantine General Problem

  • Assuming
    • Less than 1/3 of generals are traitors
    • Oral messages (p2p)
    • No crypto
Link to original