Routing protocol

Jump to:

A routing protocol is a protocol that specifies how routers communicate with each other to disseminate information that allows them to select routes between any two nodes on a network. Typically, each router has a prior knowledge only of its immediate neighbors. A routing protocol shares this information so that routers have knowledge of the network topology at large. For a discussion of the concepts behind routing protocols, see: Routing.

The term routing protocol may refer more specifically to a protocol operating at Layer 3 of the OSI model which similarly disseminates topology information between routers.

Many routing protocols used in the public Internet are defined in documents called RFCs.[1][2][3][4]

There are three major types of routing protocols, some with variants: link-state routing protocols, path vector protocols and distance vector routing protocols.

The specific characteristics of routing protocols include the manner in which they either prevent routing loops from forming or break routing loops if they do form, and the manner in which they determine preferred routes from a sequence of hop costs and other preference factors.

Routed versus routing protocols

Confusion often arises between routing protocols and routed protocols. While routing protocols help the router in the decision-making on which paths to send traffic, routed protocols are responsible for the actual transfer of traffic between L3 devices.[5] Specifically, a routed protocol is any network protocol that provides enough information in its network layer address to allow a packet to be forwarded from one host to another host based on the addressing scheme, without knowing the entire path from source to destination. Routed protocols define the format and use of the fields within a packet. Packets generally are conveyed from end system to end system. Almost all layer 3 protocols and those that are layered over them are routable, with IP being an example. Layer 2 protocols such as Ethernet are necessarily non-routable protocols, since they contain only a link-layer address, which is insufficient for routing: some higher-level protocols based directly on these without the addition of a network layer address, such as NetBIOS, are also non-routable.

In some cases, routing protocols can themselves run over routed protocols: for example, BGP runs over TCP: care is taken in the implementation of such systems not to create a circular dependency between the routing and routed protocols. That a routing protocol runs over particular transport mechanism does not mean that the routing protocol is of layer (N+1) if the transport mechanism is of layer (N). Routing protocols, according to the OSI Routeing[sic] framework, are layer management protocols for the network layer, regardless of their transport mechanism:

IS-IS runs over the data link layer
OSPF, IGRP, and EIGRP run directly over IP; OSPF and EIGRP have their own reliable transmission mechanism while IGRP assumed an unreliable transport
RIP runs over UDP
BGP runs over TCP

Examples

Ad hoc network routing protocols

Ad hoc network routing protocols appear in networks with no or little infrastructure.

List of ad-hoc routing protocols

Interior routing protocols

Interior Gateway Protocols (IGPs) exchange routing information within a single routing domain. A given autonomous system [6] can contain multiple routing domains, or a set of routing domains can be coordinated without being an Internet-participating autonomous system. Common examples include:

IGRP (Interior Gateway Routing Protocol)
EIGRP (Enhanced Interior Gateway Routing Protocol)
OSPF (Open Shortest Path First)
RIP (Routing Information Protocol)
IS-IS (Intermediate System to Intermediate System)

Note that IGRP, a Cisco proprietary routing protocol, is no longer supported. EIGRP accepts IGRP configuration commands, but the internals of IGRP and EIGRP are completely different.

Exterior routing protocols

Exterior Gateway Protocols (EGPs) route between separate autonomous systems. Examples include:

EGP (the original Exterior Gateway Protocol used to connect to the former Internet backbone network; now obsolete)
BGP (Border Gateway Protocol: the current version, BGPv4, dates from around 1995)
CSPF (Constrained Shortest Path First)

Intermediate system to intermediate system (IS-IS), is a protocol used by network devices (routers) to determine the best way to forward datagrams or packets through a packet-based network, a process called routing. The protocol was defined in RFC 1142 as a protocol within the Open Systems Interconnection (OSI) reference design. IS-IS is not an Internet standard.

Description

IS-IS is an Interior Gateway Protocol (IGP) meaning that it is intended for use within an administrative domain or network. It is not intended for routing between Autonomous Systems (RFC 1930), a job which is the purpose of an Exterior Gateway Protocol, such as Border Gateway Protocol (BGP).

IS-IS is a link-state routing protocol, meaning that it operates by reliably flooding topology information throughout a network of routers. Each router then independently builds a picture of the network's topology. Packets or datagrams are forwarded based on the best topological path through the network to the destination.

IS-IS uses Dijkstra's algorithm for identifying the best path through the network.

History

The IS-IS protocol was developed by Digital Equipment Corporation as part of DECnet Phase V. It was standardised by the ISO in 1992 as ISO 10589 for communication between network devices which are termed Intermediate Systems (as opposed to end systems or hosts) by the ISO. The purpose of IS-IS was to make possible the routing of datagrams using the ISO-developed OSI protocol stack called CLNS.

IS-IS was developed at roughly the same time that the Internet Engineering Task Force IETF was developing a similar protocol called OSPF. IS-IS was later extended to support routing of datagrams (aka network-layer packets) using IP Protocol, the basic routed protocol of the global (public) Internet. This version of the IS-IS routing protocol was then called Integrated IS-IS (RFC 1195).

OSPF had achieved predominance as an IGP (Interior Gateway Protocol) routing protocol, particularly in medium-to-large-sized enterprise networks. IS-IS, in contrast, remained largely unknown by most network engineers and was used predominantly in the networks of certain very-large service providers.

IS-IS has become more widely known in the last several years, and has become a viable alternative to OSPF in enterprise networks. Detailed analysis[citation needed], however, tends to show that OSPF has traffic tuning features that are especially suitable to enterprise networks while ISIS has stability features especially suitable to ISP infrastructure.

Comparison with OSPF

Both IS-IS and OSPF are link state protocols, and both use the same Dijkstra algorithm for computing the best path through the network. As a result, they are conceptually similar. Both support variable length subnet masks, can use multicast to discover neighboring routers using hello packets, and can support authentication of routing updates.

While OSPF is natively built to route IP and is itself a layer 3 protocol that runs on top of IP, IS-IS is natively an ISO network layer protocol (it is at the same layer as CLNS), a fact that may have allowed OSPF to be more widely used. IS-IS does not use IP to carry routing information messages.

IS-IS routers build a topological representation of the network. This map indicates the IP subnets which each IS-IS router can reach, and the lowest cost (shortest) path to an IP subnet is used to forward IP traffic.

IS-IS also differs from OSPF in the methods by which it reliably floods topology and topology change information through the network. However, the basic concepts are similar.

Since OSPF is more popular, this protocol has a richer set of extensions and added features. However IS-IS is less "chatty" and can scale to support larger networks. Given the same set of resources, IS-IS can support more routers in an area than OSPF. This makes IS-IS favoured in ISP environments. Additionally, IS-IS is neutral regarding the type of network addresses for which it can route. OSPF, on the other hand, was designed for IPv4. Thus IS-IS was easily adapted to support IPv6, while the OSPF protocol needed a major overhaul (OSPF v3).

The TCP/IP implementation, known as "Integrated IS-IS" or "Dual IS-IS", is described in RFC 1195.

IS-IS differs from OSPF in the way that "areas" are defined and routed between. IS-IS routers are designated as being: Level 1 (intra-area); Level 2 (inter area); or Level 1-2 (both). Level 2 routers are inter area routers that can only form relationships with other Level 2 routers. Routing information is exchanged between Level 1 routers and other Level 1 routers, and Level 2 routers only exchange information with other Level 2 routers. Level 1-2 routers exchange information with both levels and are used to connect the inter area routers with the intra area routers. In OSPF, areas are delineated on the interface such that an area border router (ABR) is actually in two or more areas at once, effectively creating the borders between areas inside the ABR, whereas in IS-IS area borders are in between routers, designated as Level 2 or Level 1-2. The result is that an IS-IS router is only ever a part of a single area. IS-IS also does not require Area 0 (Area Zero) to be the backbone area through which all inter-area traffic must pass. The logical view is that OSPF creates something of a spider web or star topology of many areas all attached directly to Area Zero and IS-IS by contrast creates a logical topology of a backbone of Level 2 routers with branches of Level 1-2 and Level 1 routers forming the individual areas.

Other related protocols

Fabric Shortest Path First (FSPF):

Level 0: Between ES's and IS's on the same subnet. OSI routing begins at this level; END-System and Intermediate System
Level 1: Between IS's on the SAME AREA. Also called area routing/intra-area
Level 2: Called Inter-Area routing
Level 3: Routing between separate domains; It is similar to BGP

Interior Gateway Routing Protocol (IGRP) is a kind of IGP which is a distance-vector routing protocol invented by Cisco, used by routers to exchange routing data within an autonomous system.

IGRP is a proprietary protocol. IGRP was created in part to overcome the limitations of RIP (maximum hop count of only 15, and a single routing metric) when used within large networks. IGRP supports multiple metrics for each route, including bandwidth, delay, load, MTU, and reliability; to compare two routes these metrics are combined together into a single metric, using a formula which can be adjusted through the use of pre-set constants. The maximum hop count of IGRP-routed packets is 255 (default 100).

IGRP is considered a classful routing protocol. Because the protocol has no field for a subnet mask, the router assumes that all interface addresses within the same Class A, Class B, or Class C network have the same subnet mask as the subnet mask configured for the interfaces in question. This contrasts with classless routing protocols that can use variable length subnet masks. Classful protocols have become less popular as they are wasteful of IP address space.

Advancement

In order to address the issues of address space and other factors, Cisco created EIGRP (Enhanced Interior Gateway Routing Protocol). EIGRP adds support for VLSM (variable length subnet mask) and adds the Diffusing Update Algorithm (DUAL) in order to improve routing and provide a loopless environment. EIGRP has completely replaced IGRP, making IGRP an obsolete routing protocol. In Cisco IOS versions 12.3 and greater, IGRP is completely unsupported. In the new Cisco CCNA curriculum (version 4), IGRP is mentioned only briefly, as a "obsolete protocol".

Enhanced Interior Gateway Routing Protocol - (EIGRP) is a Cisco proprietary routing protocol loosely based on their original IGRP. EIGRP is an advanced distance-vector routing protocol, with optimizations to minimize both the routing instability incurred after topology changes, as well as the use of bandwidth and processing power in the router. Routers that support EIGRP will automatically redistribute route information to IGRP neighbors by converting the 32 bit EIGRP metric to the 24 bit IGRP metric. Most of the routing optimizations are based on the Diffusing Update Algorithm (DUAL) work from SRI, which guarantees loop-free operation and provides a mechanism for fast convergence.

Basic operation

The data EIGRP collects is stored in three tables:

Neighbor Table: Stores data about the neighboring routers, i.e. those directly accessible through directly connected interfaces.

Topology Table: Confusingly named, this table does not store an overview of the complete network topology; rather, it effectively contains only the aggregation of the routing tables gathered from all directly connected neighbors. This table contains a list of destination networks in the EIGRP-routed network together with their respective metrics. Also for every destination, a successor and a feasible successor are identified and stored in the table if they exist. Every destination in the topology table can be marked either as "Passive", which is the state when the routing has stabilized and the router knows the route to the destination, or "Active" when the topology has changed and the router is in the process of (actively) updating its route to that destination.

Routing table: Stores the actual routes to all destinations; the routing table is populated from the topology table with every destination network that has its successor and optionally feasible successor identified (if unequal-cost load-balancing is enabled using the variance command). The successors and feasible successors serve as the next hop routers for these destinations.

Unlike most other distance vector protocols, EIGRP does not rely on periodic route dumps in order to maintain its topology table. Routing information is exchanged only upon the establishment of new neighbor adjacencies, after which only changes are sent.

Multiple metrics

EIGRP associates five different metrics with each route:

Total Delay (in 10s of microseconds)
Minimum Bandwidth (in kilobits per second)
Reliability (number in range 1 to 255; 255 being most reliable)
Load (number in range 1 to 255; 255 being saturated)
Minimum path Maximum Transmission Unit (MTU) (though not actually used in the calculation)

For the purposes of comparing routes, these are combined together in a weighted formula to produce a single overall metric:

$\bigg [ \bigg ( K_1 \cdot \text{Bandwidth} + \frac{K_2 \cdot \text{Bandwidth}}{256-\text{Load}} + K_3 \cdot \text{Delay} \bigg ) \cdot \frac {K_5}{K_4 + \text{Reliability}} \bigg ] \cdot 256$

where the various constants (K₁ through K₅) can be set by the user to produce varying behaviors. An important and totally non-obvious fact is that if K₅ is set to zero, the term $\tfrac {K_5}{K_4 + \text{Reliability}}$ is not used (i.e. taken as 1).

The default is for K₁ and K₃ to be set to 1, and the rest to zero, effectively reducing the above formula to (Bandwidth + Delay) * 256.

Obviously, these constants must be set to the same value on all routers in an EIGRP system, or permanent routing loops will probably result. Cisco routers running EIGRP will not form an EIGRP adjacency and will complain about K-values mismatch until these values are identical on these routers.

EIGRP scales Bandwidth and Delay metrics with following calculations:

Bandwidth for EIGRP = 10⁷ / Interface Bandwidth

Delay for EIGRP = Interface Delay / 10

On Cisco routers, the interface bandwidth is a configurable static parameter expressed in kilobits per second. Dividing a value of 10⁷ kbit/s (i.e. 10 Gbit/s) by the interface bandwidth yields a value that is used in the weighted formula. Analogously, the interface delay is a configurable static parameter expressed in microseconds. Dividing this interface delay value by 10 yields a delay in units of tens of microseconds that is used in the weighted formula.

IGRP uses the same basic formula for computing the overall metric, the only difference is that in IGRP, the formula does not contain the scaling factor of 256. In fact, this scaling factor was introduced as a simple means to facilitate backward compatility between EIGRP and IGRP: In IGRP, the overall metric is a 24-bit value while EIGRP uses a 32-bit value to express this metric. By multiplying a 24-bit value with the factor of 256 (effectively bit-shifting it 8 bits to the left), the value is extended into 32 bits, and vice versa. This way, redistributing information between EIGRP and IGRP involves simply dividing or multiplying the metric value by a factor of 256, which is done automatically.

EIGRP also maintains a hop count for every route, however, the hop count is not used in metric calculation. It is only verified against a predefined maximum on an EIGRP router (by default it is set to 100 and can be changed to any value between 1 and 255). Routes having a hop count higher than the maximum will be advertised as unreachable by an EIGRP router.

Successor

A successor for a particular destination is a next hop router that satisfies these two conditions:

it provides the least distance to that destination
it is guaranteed not to be a part of some routing loop

The first condition can be satisfied by comparing metrics from all neighboring routers that advertise that particular destination, increasing the metrics by the cost of the link to that respective neighbor, and selecting the neighbor that yields the least total distance. The second condition can be satisfied by testing a so-called Feasibility Condition for every neighbor advertising that destination. There can be multiple successors for a destination, depending on the actual topology.

The successors for a destination are recorded in the topology table and afterwards they are used to populate the routing table as next-hops for that destination.

Feasible Successor

A feasible successor for a particular destination is a next hop router that satisfies this condition:

it is guaranteed not to be a part of some routing loop

This condition is also verified by testing the Feasibility Condition.

Thus, every successor is also a feasible successor. However, in most references about EIGRP the term "feasible successor" is used to denote only those routers which provide a loop-free path but which are not successors (i.e. they do not provide the least distance). From this point of view, for a reachable destination there is always at least one successor, however, there might not be any feasible successors.

A feasible successor provides a working route to the same destination, although with a higher distance. At any time, a router can send a packet to a destination marked "Passive" through any of its successors or feasible successors without alerting them in the first place, and this packet will be delivered properly. Feasible successors are also recorded in the topology table.

The feasible successor effectively provides a backup route in the case that existing successors die. Also, when performing unequal-cost load-balancing (balancing the network traffic in inverse proportion to the cost of the routes), the feasible successors are used as next hops in the routing table for the load-balanced destination.

By default, the total count of successors and feasible successors for a destination stored in the routing table is limited to four. This limit can be changed in the range from 1 to 6. In more recent versions of Cisco IOS (eg. 12.4), this range is between 1 and 16.

Active and Passive State

A destination in the topology table can be marked either as Passive or Active. A Passive state is a state when the router has identified the successor(s) for the destination. The destination changes to Active state when current successor no longer satisfies the Feasibility Condition and there are no feasible successors identified for that destination (i.e. no backup routes are available). The destination changes back from Active to Passive when the router received replies to all queries it has sent to its neighbors. Notice that if a successor stops satisfying the Feasibility Condition but there is at least one feasible successor available, the router will promote a feasible successor with the lowest total distance (the distance as reported by the feasible successor plus the cost of the link to this neighbor) to a new successor and the destination remains in the Passive state.

Advertised Distance and Feasible Distance

Advertised Distance (AD) is the distance to a particular destination as reported by a router to its neighbors. This distance is sometimes also called a Reported Distance and is equal to the current lowest total distance through a successor.

A Feasible Distance (FD) is the lowest known distance from a router to a particular destination since the last time the route went from Active to Passive state. It can be expressed in other words as a historically lowest known distance to a particular destination. While a route remains in Passive state, the FD is updated only if the actual distance to the destination decreases, otherwise it stays at its present value. On the other hand, if a router needs to enter Active state for that destination, the FD will be updated with a new value after the router transitions back from Active to Passive state. This is the only case when the FD can be increased. The transition from Active to Passive state in effect marks the start of a new history for that route.

For example, if the route to a newly discovered destination X went from Active to Passive state with a total distance of 10, the router sets the AD and FD to 10. Later this distance decreases from 10 to 8. The distance remains in the Passive state (because distance decrease never violates the Feasibility Condition) and the router updates the AD and FD to 8. Even later, the distance increases to 12 but in such a way that there is still a valid successor or feasible successor available. In this case, the AD gets updated to 12, however, the FD will remain at the value of 8. Therefore, the values of AD and FD can be different. Finally, the actual successor fails and no other feasible successor is currently identified. Therefore, the router has to transition to Active state and ask its neighbors for a new route to the destination X. Assuming that the newly found path to that destination has a total distance of 100, the router will transition back to Passive state and update both its AD and FD to the new shortest path length, in this case, 100.

Feasibility Condition

The feasibility condition is a sufficient condition for loop freedom in EIGRP-routed network. It is used to select the successors and feasible successors that are guaranteed to be on a loop-free route to a destination. Its simplified formulation is strikingly simple:

If, for a destination, a neighbor router advertises a distance that is strictly lower than our feasible distance, then this neighbor lies on a loop-free route to this destination.

or in other words,

If, for a destination, a neighbor router tells us that it is closer to the destination than we have ever been, then this neighbor lies on a loop-free route to this destination.

In exact terms, every neighbor that satisfies the relation AD < FD for a particular destination is on a loop-free route to that destination.

This condition is also called the Source Node Condition and is one of more equivalent conditions that were proposed and proven by Dr. J. J. Garcia-Luna-Aceves at SRI. The paper proposing the Source Node Condition and the Diffusing Update Algorithm algorithm itself can be found here.

It is important to realize that this condition is a sufficient, not a necessary condition. That means that neighbors which satisfy this condition are guaranteed to be on a loop-free path to some destination, however, there may be also other neighbors on a loop-free path which do not satisfy this condition. However, such neighbors do not provide the shortest path to a destination, therefore, not using them does not present any significant impairment of the network functionality. These neighbors will be re-evalued for possible usage if the router transitions to Active state for that destination.

EIGRP classification as a distance-vector

In the past, EIGRP was described in various Cisco marketing materials as a balanced hybrid routing protocol, allegedly combining the best features from link-state and distance-vector protocols. This description is not correct from a principal point of view. By definition:

Distance-vector routing protocols are based on a distributed form of Bellman-Ford algorithm to find shortest paths. They work by exchanging a vector of distances to all destinations known to each node. No further topological information is ever exchanged. Thus, each node knows about all destinations present in the network and it knows the resulting distance to each destination via every of the node's neighbors. However, the node does not have any idea of the actual network topology, nor does the node need it.
Link-state routing protocols are based on algorithms to find shortest paths in a graph (the most often used algorithm is Dijkstra's algorithm). They work by exchanging a description of each node and its exact connections to its neighbors (in essence, each node describes its adjacencies to neighboring nodes and this information is flooded throughout the network). Therefore, each node knows the exact network topology, i.e. it has a graph representation of the network. Using this graph, each node computes the shortest paths from itself to each available destination.

The EIGRP routers exchange messages that contain information about bandwidth, delay, load, reliability and MTU of the path to each destination as known by the advertising router. Each router uses these parameters to compute the resulting distance to a destination. No further topological information is present in the messages. This principle fully corresponds to the operation of distance-vector protocols. Therefore, EIGRP is in essence a distance-vector protocol.

It is true that EIGRP uses a number of techniques not present in naïve distance-vector protocols, notably

the use of explicit hello packets to discover and maintain adjacencies between routers;
the use of a reliable protocol to transport routing updates;
the use of a feasibility condition to select a loop-free path;
the use of diffusing computations to involve the affected part of network into computing a new shortest path

None of these techniques, however, makes any difference to the basic principles of EIGRP, which exchanges a vector of distances to each known destination network without full knowledge of the network topology, and, as a matter of fact, similar techniques have been used in other distance-vector protocols (notably DSDV and AODV). While EIGRP is indeed an advanced distance-vector routing protocol, it is not a hybrid protocol.

An example of a true hybrid routing protocol would be the multi-area Open Shortest Path First (OSPF) protocol. The intra-area routing in OSPF is done using the link-state approach, as each area knows its precise internal topology. Inter-area routing in OSPF is done using the distance-vector approach—the networks outside an area are known only by their distance, not by their exact topology.

Other details

EIGRP is able to deal with Classless Inter-Domain Routing (CIDR), allowing the use of variable-length subnet masks—one of the protocol's main advantages over its predecessor. Its main disadvantage is that it runs only on Cisco equipment, which may lead to an organization being locked in to this vendor. Also, EIGRP is not usable in applications where routers need to know the exact network topology (for example, traffic engineering in MPLS).

EIGRP can run separate routing processes for IP, IPv6, IPX and AppleTalk through the use of protocol-dependent modules (PDMs). However, this does not facilitate translation between protocols.

Example of setting up EIGRP on a Cisco IOS router using classful IP addressing:

Router> enable Router# config terminal Router(config)# router eigrp ?   <1-65535>  Autonomous system number Router(config)# router eigrp 1 Router(config-router)# network 192.168.0.0 Router(config-router)# end

Example of setting up EIGRP on a Cisco IOS router using classless IP addressing. The 0.0.15.255 wildcard in this example indicates a subnetwork with a maximum of 4094 hosts—it is the bitwise complement of the subnet mask 255.255.240.0. The no auto-summary command prevents automatic route summarization on classful boundaries, which would otherwise result in routing loops in discontiguous networks.

Router> enable Router# config terminal Router(config)# router eigrp 1 Router(config-router)# network 10.201.96.0 ?   A.B.C.D  EIGRP wild card bits Router(config-router)# network 10.201.96.0 0.0.15.255 Router(config-router)# no auto-summary Router(config-router)# end

Open Shortest Path First (OSPF) is a dynamic routing protocol for use in Internet Protocol (IP) networks. Specifically, it is a link-state routing protocol and falls into the group of interior gateway protocols, operating within an autonomous system (AS). It is defined as OSPF Version 2 in RFC 2328 (1998) for IPv4 [1]. The updates for IPv6 are specified as OSPF Version 3 in RFC 5340 (2008)[2].

OSPF is perhaps the most widely-used interior gateway protocol (IGP) in large enterprise networks; IS-IS, another link-state routing protocol, is more common in large service provider networks. The most widely-used exterior gateway protocol (EGP) is BGP.

OSPF routes packets based solely on the destination IP address found in IP packets. It was designed to support variable-length subnet masking (VLSM, CIDR). OSPF detects changes in the topology, such as link failures, very quickly and converges on a new loop-free routing structure within seconds. For this, each OSPF router collects link-state information to construct the entire network topology of so-called "areas" from which it computes the shortest path tree for each route using a method based on Dijkstra's algorithm. The link-state information is maintained on each router as a link-state database (LSDB) which is a tree-image of the network topology. Identical copies of the LSDB are periodically updated through flooding on all routers in each OSPF-aware area (region of the network included in an OSPF area type - see "Area types" below). By convention, area 0 represents the core or "backbone" region of an OSPF-enabled network, and other OSPF area numbers may be designated to serve other regions of an enterprise (large, business) network - however every additional OSPF area must have a direct or virtual connection to the backbone OSPF area. The backbone area has the identifier 0.0.0.0. Inter-area routing goes via the backbone.

Routers in the same broadcast domain or at each end of a point-to-point telecommunications link form adjacencies when they have detected each other. This detection occurs when a router "sees" itself in a hello packet. This is called a two way state and is the most basic relationship. The router in ethernet or frame relay select a designated router (DR) and a backup designated router (BDR) which act as a hub to reduce traffic between routers. OSPF uses both unicast and multicast to send "hello packets" and link state updates. Multicast addresses 224.0.0.5 (all SPF/link state routers, also known as AllSPFRouters) and 224.0.0.6 (all Designated Routers, AllDRouters) are reserved for OSPF (RFC 2328). In contrast to the Routing Information Protocol (RIP) or the Border Gateway Protocol (BGP), OSPF does not use TCP or UDP but uses IP directly, via IP protocol 89. OSPF handles its own error detection and correction, therefore negating the need for TCP or UDP functions.

The OSPF Protocol can operate securely between routers, optionally using a clear-text password or using MD5 to authenticate peers before forming adjacencies and before accepting link-state advertisements (LSA). A natural successor to the Routing Information Protocol (RIP), it was classless, or able to use Classless Inter-Domain Routing, from its inception. Multicast extensions to OSPF, the Multicast Open Shortest Path First (MOSPF) protocols, have been defined but these are not widely used at present.

Neighbour relationships

As a link state routing protocol, OSPF establishes and maintains neighbour relationships in order to exchange routing updates with other routers. The neighbour relationship table is called an adjacency database in OSPF. Provided that OSPF is configured correctly, OSPF forms neighbour relationships only with the routers directly connected to it. The routers that it forms a neighbour relationship with must be in the same area as the interface with which it is using to form a neighbor relationship. An interface can only belong to a single area.

Area types

An OSPF network is divided into areas, which have 32-bit area identifiers commonly, but not always, written in the dotted decimal format of an IP address. Area identifiers are not IP addresses and may duplicate, without conflict, any IP address. While most OSPF implementations will right-justify an area number written in other than dotted decimal format (e.g., area 1), it is wise always to use dotted decimal formats. Most implementations would expand area 1 to the area identifier 0.0.0.1, but some have been known to expand it as 1.0.0.0.

These are logical groupings of routers whose information may be summarized towards the rest of the network. Several "special" area types are defined:

Backbone area

The backbone area (also known as area zero or area 0.0.0.0) forms the core of an OSPF network. All other areas are connected to it, and inter-area routing happens via routers connected to the backbone area and to their own non-backbone areas. It is the logical and physical structure for the 'OSPF domain' and is attached to all nonzero areas in the OSPF domain. Note that in OSPF the term Autonomous System Border Router (ASBR) is historic, in the sense that many OSPF domains can coexist in the same Internet-visible autonomous system, RFC1996 (ASGuidelines 1996, p. 25) [3].

The backbone area is responsible for distributing routing information between nonbackbone areas. The backbone must be contiguous, but it does not need to be physically contiguous; backbone connectivity can be established and maintained through the configuration of virtual links.

All OSPF areas must connect to the backbone area. This connection, however, can be through a virtual link. For example, assume area 0.0.0.1 has a physical connection to area 0.0.0.0. Further assume that area 0.0.0.2 has no direct connection to the backbone, but this area does have a connection to area 0.0.0.1. Area 0.0.0.2 can use a virtual link through the transit area 0.0.0.1 to reach the backbone. To be a transit area, an area has to have the transit attribute, so it cannot be stubby in any way.

Stub area

A stub area is an area which does not receive external routes except the default route, but does receive inter-area routes. This kind of area is useful when, for example, all Internet access goes through autonomous system border routers (ASBRs) in Area 0.0.0.0, but there are multiple paths to other nonzero areas in the OSPF domain.

All routers in the area need to agree they are stub, so that they do not generate types of LSA not appropriate to a stub area. The Type 3 LSA for the default route is the only external that should enter the area, and none of its routers may generate externals.

Stub areas do receive inter-area (IA) routes, advertised with Type 3 and Type 4 LSAs. If the stub area has more than one area border router (ABR), the information on other non-backbone areas allows the routers in the stub area to pick the best route to another area.

Stub areas do not have the transit attribute and thus cannot be traversed by a virtual link.

Stub areas receive default routes as type 3 network summary LSAs.

Totally stubby area

A totally stubby area (TSA), which is a nonstandard but useful extension by Cisco [4], is similar to a stub area, however this area does not allow summary routes in addition to the external routes, that is, inter-area (IA) routes are not summarized into totally stubby areas. The only way for traffic to get routed outside of the area is a default route which is the only Type-3 LSA advertised into the area. When there is only one route out of the area, fewer routing decisions have to be made by the route processor, which lowers system resource utilization.

Occasionally, it is said that a TSA can have only one ABR. This is not true. If there are multiple ABRs, as might be required for high availability, routers interior to the TSA will send non-intra-area traffic to the ABR with the lowest intra-area metric (the "closest" ABR).

An area can simultaneously be not-so-stubby and totally stubby. This is done when the practical place to put an ASBR, as, for example, with a newly acquired subsidiary, is on the edge of a totally stubby area. In such a case, the ASBR does send externals into the totally stubby area, and they are available to OSPF speakers within that area. In Cisco's implementation, the external routes can be summarized before injecting them into the totally stubby area. In general, the ASBR should not advertise default into the TSA-NSSA, although this can work with extremely careful design and operation, for the limited special cases in which such an advertisement makes sense.

By declaring the totally stubby area as NSSA, no external routes from the backbone, except the default route, enter the area being discussed. The externals do reach area 0.0.0.0 via the TSA-NSSA, but no routes other than the default route enter the TSA-NSSA. Routers in the TSA-NSSA send all traffic to the ABR, except to routes advertised by the ASBR.

Not-so-stubby area

A not-so-stubby area (NSSA) is a type of stub area that can import autonomous system (AS) external routes and send them to the backbone, but cannot receive AS external routes from the backbone or other areas. The NSSA is a non-proprietary extension of the existing stub area feature that allows the injection of external routes in a limited fashion into the stub area.

Cisco also implements a proprietary version of a NSSA called a NSSA totally stubby area. It takes on the attributes of a TSA, meaning that type 3 and type 4 summary routes are not flooded into this type of area. It is also possible to declare an area both totally stubby and not-so-stubby, which means that the area will receive only the default route from area 0.0.0.0, but can also contain an autonomous system border router (ASBR) that accepts external routing information and injects it into the local area, and from the local area into area 0.0.0.0.

Redistribution into an NSSA area creates special type of LSA known as TYPE 7, which can exist only in an NSSA area. An NSSA ASBR generates this LSA, and an NSSA ABR router translates it into type 5 LSA which gets propagated into the OSPF domain.

Path preference

OSPF uses path cost as its basic routing metric, which was defined by the standard not to equate to any standard value such as speed, so the network designer could pick a metric important to the design. In practice, it is determined by the speed (bandwidth) of the interface addressing the given route, although that tends to need network-specific scaling factors now that links faster than 100 Mbit/s are common. Cisco uses a metric like 10^8/bandwidth. So, a 100Mbit/s link will have a cost of 1, a 10Mbit/s a cost of 10 and so on. But for links faster than 100Mbit/s, the cost would be <1.

Metrics, however, are only directly comparable when of the same type. There are four types of metrics, with the most preferred type listed in order below. An intra-area route is always preferred to an inter-area route regardless of metric, and so on for the other types.

Intra-area
Inter-area
External Type 1, which includes both the external path cost and the sum of internal path costs to the ASBR that advertises the route,
External Type 2, the value of which is solely that of the external path cost

Traffic engineering

OSPF-TE is an extension to OSPF extending the idea of route preference to include traffic engineering (RFC 3630, [5]). The Traffic Engineering extensions to OSPF add dynamic properties to the route calculation algorithm. The properties are:

Maximum Reservable bandwidth
Unreserved bandwidth
Available bandwidth

These fields are distributed between network nodes via the TLV fields of an opaque LSA.

OSPF-TE is commonly used within MPLS and GMPLS networks, as a means to determine the topology over which MPLS paths can be established. MPLS then uses its own path setup and forwarding protocols, once it has the full IP routing map.

Other extensions

RFC3717[6] documents work in optical routing for IP based on "constraint-based" extensions to OSPF and IS-IS.

OSPF router types

OSPF defines the following router types:

Area border router (ABR)
Autonomous system border router (ASBR)
Internal router (IR)
Backbone router (BR)

The router types are attributes of an OSPF process. A given physical router may have one or more OSPF processes. For example, a router that is connected to more than one area, and which receives routes from a BGP process connected to another AS, is both an ABR and an ASBR.

Each router has a router identifier, customarily written in the dotted decimal format (e.g.: 1.2.3.4) of an IP address. The way in which the router ID is determined is implementation-specific. The router ID, however, does not have to be a valid IP address or any IP address present in the routing domain, although it frequently will be advertised within the domain for troubleshooting purposes. Do not assume, until it is known how it is configured, that the router ID is anything more than a 32-bit number (e.g., 255.254.253.252 is legal as a router ID).

Do not confuse router types with designated router (DR), or backup designated router (BDR), which is an attribute of a router interface.

Area border router

An ABR is a router that connects one or more OSPF areas to the main backbone network. It is considered a member of all areas it is connected to. An ABR keeps multiple copies of the link-state database in memory, one for each area to which that router is connected.

Autonomous system boundary router

An ASBR is a router that is connected to more than one AS and that exchanges routing information with routers in other ASs. ASBRs typically also run a non-IGP routing protocol (e.g., BGP), or use static routes, or both. An ASBR is used to distribute routes received from other ASs throughout its own AS.

Internal router

An IR is a router that has only OSPF neighbor relationships with routers in the same area.

Backbone router

Backbone Routers: These are routers that are part of the OSPF backbone. By definition, this includes all area border routers, since those routers pass routing information between areas. However, a backbone router may also be a router that connects only to other backbone (or area border) routers, and is therefore not part of any area (other than Area 0).

Note that: an area border router is always a backbone router, but a backbone router is not necessarily an area border router.

Designated router

A designated router (DR) is the router interface elected among all routers on a particular multiaccess network segment, generally assumed to be broadcast multiaccess. Special techniques, often vendor-dependent, may be needed to support the DR function on nonbroadcast multiaccess (NBMA) media. It is usually wise to configure the individual virtual circuits of a NBMA subnet as individual point-to-point lines; the techniques used are implementation-dependent.

Do not confuse the DR with an OSPF router type. A given physical router can have some interfaces that are designated (DR), others that are backup designated (BDR), and others that are non-designated. If no router is DR or BDR on a given subnet, the BDR is first elected, and then a second election is held if there is more than one BDR. The router winning the second election becomes DR, or, if there is no other BDR, designates itself DR. The DR is elected based on the following default criteria:

If the priority setting on a OSPF router is set to 0, that means it can NEVER become a DR or BDR (Backup Designated Router).
When a DR fails and the BDR takes over, there is another election to see who becomes the replacement BDR.
The router sending the Hello packets with the highest priority wins the election.
If two or more routers tie with the highest priority setting, the router sending the Hello with the highest RID (Router ID) wins. NOTE: a RID is the highest logical (loopback) IP address configured on a router, if no logical/loopback IP address is set then the Router uses the highest IP address configured on its active interfaces. (e.g. 192.168.0.1 would be higher than 10.1.1.2).
Usually the router with the second highest priority number becomes the BDR.
The priority values range between 0 - 254, with a higher value increasing its chances of becoming DR or BDR.
IF a HIGHER priority OSPF router comes online AFTER the election has taken place, it will not become DR or BDR until (at least) the DR and BDR fail.
If the current DR 'goes down' the current BDR becomes the new DR and a new election takes place to find another BDR. If the new DR then 'goes down' and the original DR is now available, it then becomes DR again, but no change is made to the current BDR.

DR's exist for the purpose of reducing network traffic by providing a source for routing updates, the DR maintains a complete topology table of the network and sends the updates to the other routers via multicast. All routers in an area will form a slave/master relationship with the DR. They will form adjacencies with the DR and BDR only. Every time a router sends an update, it sends it to the DR and BDR on the multicast address 224.0.0.6. The DR will then send the update out to all other routers in the area, to the multicast address 224.0.0.5. This way all the routers do not have to constantly update each other, and can rather get all their updates from a single source. The use of multicasting further reduces the network load. DRs and BDRs are always setup/elected on Broadcast networks (Ethernet). DR's can also be elected on NBMA (Non-Broadcast Multi-Access) networks such as Frame Relay or ATM. DRs or BDRs are not elected on point-to-point links (such as a point-to-point WAN connection) because the two routers on either sides of the link must become fully adjacent and the bandwidth between them cannot be further optimized.

Backup designated router

A backup designated router (BDR) is a router that becomes the designated router if the current designated router has a problem or fails. The BDR is the OSPF router with second highest priority at the time of the last election.

OSPF Hello Packet

+	Bits 0–7	8–15	16–18
0	Version	Type	Packet Length
32	Router ID
64	Area ID
96	Checksum		Authentication Type
128	Authentication
160	Authentication
192	Network Mask
224	Hello Interval		Options	Router Priority
256	Router Dead Interval
288	Designated Router
320	Backup Designated Router
352	Neighbor ID
384	...

OSPF in broadcast multiple access topologies

Neighbor adjacency is formed dynamically using multicast hello packets to 224.0.0.5. A DR and BDR are elected normally, and function normally.

OSPF in NBMA topologies

As described in RFC 2328, has defined the following two official modes for OSPF in NBMA topologies:

nonbroadcast
point-to-multipoint

Cisco has defined the following three additional modes for OSPF in NBMA topologies:

point-to-multipoint nonbroadcast
broadcast
point-to-point

Miscellany

Applications

OSPF was the first widely deployed routing protocol that could converge a network in the low seconds, and guarantee loop-free paths. It has a great many features that allow the imposition of policies about the propagation of routes that it may be appropriate to keep local, for load sharing, and for selective route importing more than IS-IS. IS-IS, in contrast, can be tuned for lower overhead in a stable network, the sort more common in ISP than enterprise networks. There are some historical accidents that made IS-IS the preferred IGP for ISPs, but ISP's today may well choose to use the features of the now-efficient implementations of OSPF[7], after first considering the pros and cons of ISIS in service provider environments[8].

As mentioned, OSPF can provide better load-sharing on external links than other IGPs. When the default route to an ISP is injected into OSPF from multiple ASBRS as a Type I external route and the same external cost specified, other routers will go to the ASBR with the least path cost from its location. This can be tuned further by adjusting the external cost.

In contrast, if the default route from different ISPs is injected with different external costs, as a Type II external route, the lower-cost default becomes the primary exit and the higher-cost becomes the backup only.

Implementations

6WINDGate, commercial embedded open-source routing modules from 6WIND including OSPFv2 and OSPFv3
Vyatta, a commercial open-source router / firewall.
GNU Zebra, a GPL routing suite for Unix-like systems supporting OSPF
Quagga, a fork of GNU Zebra for Unix-like systems
OpenBGPD, includes an OSPF implementation
XORP, a routing suite implementing RFC2328 (OSPFv2) and RFC2740 (OSPFv3) for both IPv4 and IPv6
BIRD (http://bird.network.cz) implements RFC2328 OSPF
GateD project included an RFC1583 OSPF implementation (UMD OSPF by University of Maryland).

RFC history

1989, October - First put forward as a proposed standard as RFC 1131.
1994, The OSPF NSSA Option, RFC 1587.
1994, March - Multicast extensions to OSPF proposed as RFC 1584.
1997, July - OSPF version 2, as proposed in RFC 2178
1998, April - OSPF version 2, updated in RFC 2328, standard 54.
1999, December - OSPFv3 - OSPF for IPv6, RFC 2740.
2003, January - The OSPF NSSA Option updated, RFC 3101
2005, October - Prioritized Treatment of Specific OSPF Version 2 Packets and Congestion Avoidance, RFC 4222
2006, December - OSPF Version 2 Management Information Base, RFC 4750
2007, May - OSPF Version 3 Management Information Base, draft state
2008, July - OSPF for IPv6, RFC 5340 (obsoletes RFC 2740)

The Routing Information Protocol (RIP) is a dynamic routing protocol used in local area networks. As such it is classified as an interior gateway protocol (IGP) using the distance-vector routing algorithm. It was first defined in RFC 1058 (1988). The protocol has since been extended several times, resulting in RIP Version 2 (RFC 2453). The original version is now known as RIP Version 1. Both versions are still in use today, however, they are considered technically obsoleted by more advanced techniques, such as Open Shortest Path First (OSPF) and the OSI protocol IS-IS. Since the advent of IPv6, the next generation of the Internet Protocol, RIP has been adapted, known as RIPng (RFC 2080, 1997), for use with IPv6.

History

The routing algorithm used in RIP, the Bellman-Ford algorithm, was first deployed in a computer network in 1968, as the initial routing algorithm of the ARPANET.

The earliest version of the specific protocol that became RIP was the Gateway Information Protocol, part of Xerox Parc's PARC Universal Packet internetworking protocol suite. A later version, named the Routing Information Protocol, was part of Xerox Network Services.

A version of RIP which supported the Internet Protocol (IP) was later included in the Berkeley Software Distribution (BSD) of the Unix operating system as the routed daemon, and various other vendors would implement their own implementations of the routing protocol. Eventually RFC 1058 was issued to unify the various implementations under a single standard.

Technical details

RIP is a distance-vector routing protocol, which employs the hop count as a routing metric. The maximum number of hops allowed with RIP is 15, and the hold down time is 180 seconds. Originally each RIP router transmits full updates every 30 seconds by default. Originally, routing tables were small enough that the traffic was not significant.

As networks grew in size, however, it became evident there could be a massive burst every 30 seconds, even if the routers had been initialized at random times. It was thought, as a result of random initialization, the routing updates would spread out in time, but this was not true in practice. Sally Floyd and Van Jacobson published research in 1994 [1] that showed having all routers use a fixed 30 second timer was a very bad idea. Without slight randomization of the update timer, this research showed that the timers weakly synchronized over time and sent their updates out at the same time. Modern RIP implementations introduce deliberate time variation into the update timer of each router.

RIP prevents routing loops from continuing indefinitely by implementing a limit on the number of hops allowed in a path from the source to a destination. This hop limit, however, limits the size of networks that RIP can support.

RIP implements the split horizon and holddown mechanisms to prevent incorrect routing information from being propagated. These are some of the stability features of RIP.

In many current networking environments RIP would not be the first choice for routing as its time to converge and scalability are poor compared to EIGRP, OSPF, or IS-IS (the latter two being link-state routing protocols), and the hop limit severely limits the size of network it can be used in. On the other hand, it is easier to configure because, using minimal settings for any routing protocols, RIP does not require any parameter on a router whereas all the other protocols require one or more parameters.

RIP is a UDP-based protocol (cf. User Datagram Protocol) running on top of the Transport Layer. As such it really is a protocol used by routing applications (such as routed) to exchange routing table information with other nodes and should be considered to operate in the Application Layer of the TCP/IP model. However, it is also often placed in the Network layer for the reason that it supports the network, a designation preferred by many authors using the OSI Reference Model; this despite the fact that this breaks the often discussed encapsulation hierarchy of OSI.

Versions

There are three IP versions of RIP, RIPv1, RIPv2, and RIPng.

RIPv1

RIPv1, defined in RFC 1058, uses classful routing. The periodic routing updates do not carry subnet information, lacking support for variable length subnet masks (VLSM). This limitation makes it impossible to have different-sized subnets inside of the same network class. In other words, all subnets in a network class must be the same size. There is also no support for router authentication, making RIPv1 slightly vulnerable to various attacks.

RIPv2

Due to the above deficiencies of RIPv1, RIPv2 was developed in 1994 and included the ability to carry subnet information, thus supporting Classless Inter-Domain Routing (CIDR). However to maintain backwards compatibility the 15 hop count limit remained. Rudimentary plain text authentication was added to secure routing updates; later, MD5 authentication was defined in RFC 2082. Also, in an effort to avoid waking up hosts that do not participate in the routing protocol, RIPv2 multicasts routing updates to 224.0.0.9, as opposed to RIPv1 which uses broadcast.

RIPv2 is specified in RFC 2453 or STD 56.

RIPng

RIPng, defined in RFC 2080, is an extension of Ripv2 original protocol to support IPv6. The main differences between RIPv2 and RIPng are:

RIPv2 supports RIP updates authentication, RIPng does not (IPv6 routers were, at the time, supposed to use IPsec for authentication);
RIPv2 allows attaching arbitrary tags to routes, RIPng does not;
RIPv2 encodes the next-hop into each route entries, RIPng requires specific encoding of the nexthop for a set of route entries.

The Exterior Gateway Protocol (EGP) is a now obsolete routing protocol for the Internet originally specified in 1982 by Eric C. Rosen of Bolt, Beranek and Newman, and David L. Mills. It was first described in RFC 827 and formally specified in RFC 904 (1984). EGP is a simple reachability protocol, and, unlike modern distance-vector and path-vector protocols, it is limited to tree-like topologies.

During the early days of the Internet, an exterior gateway protocol, EGP version 3, was used to interconnect autonomous systems. EGP3 should not be confused with EGPs in general. Currently, Border Gateway Protocol (BGP) is the accepted standard for Internet routing and has essentially replaced the more limited EGP3.

The Border Gateway Protocol (BGP) is the core routing protocol of the Internet. It works by maintaining a table of IP networks or 'prefixes' which designate network reachability among autonomous systems (AS). It is described as a path vector protocol. BGP does not use traditional IGP metrics, but makes routing decisions based on path, network policies and/or rulesets.

BGP was created to replace the EGP routing protocol to allow fully decentralized routing in order to allow the removal of the NSFNet Internet backbone network. This allowed the Internet to become a truly decentralized system. Since 1994, version four of the protocol has been in use on the Internet. All previous versions are now obsolete. The major enhancement in version 4 was support of Classless Inter-Domain Routing and use of route aggregation to decrease the size of routing tables. From January 2006, version 4 is codified in RFC 4271, which went through well over 20 drafts from the earlier RFC 1771 version 4. The RFC 4271 version corrected a number of errors, clarified ambiguities, and also brought the RFC much closer to industry practices.

Most Internet users do not use BGP directly. However, since most Internet service providers must use BGP to establish routing between one another (especially if they are multihomed), it is one of the most important protocols of the Internet. Compare this with Signalling System 7 (SS7), which is the inter-provider core call setup protocol on the PSTN. Very large private IP networks can make use of BGP, however. An example would be the joining of a number of large Open Shortest Path First (OSPF) networks where OSPF by itself would not scale to size. Another reason to use BGP would be multihoming a network for better redundancy either to a multiple access points of a single ISP (RFC 1998) or to multiple ISPs.

BGP operation

BGP neighbors, or peers, are established by manual configuration between routers to create a TCP session on port 179. A BGP speaker will periodically send 19-byte keep-alive messages to maintain the connection (every 60 seconds by default). Among routing protocols, BGP is unique in using TCP as its transport protocol.

When BGP is running inside an autonomous system (AS), it is referred to as Internal BGP (IBGP Interior Border Gateway Protocol). When BGP runs between ASs, it is called External BGP (EBGP Exterior Border Gateway Protocol). Routers that sit on the boundary of one AS, and exchange information with another AS, are called border or edge routers. In the Cisco operating system, IBGP routes have an administrative distance of 200, which is less preferred than either external BGP or any interior routing protocol. Other router implementations also prefer eBGP to IGPs, and IGPs to iBGP.

Optional Extensions negotiated at Connection Setup

During the OPEN handshake, BGP speakers can negotiate[1] optional capabilities of the session, including multiprotocol extensions and various recovery modes. If the multiprotocol extensions to BGP [2] are negotiated at the time of creation, the BGP speaker can prefix the Network Layer Reachability Information (NLRI) it advertises with an address family prefix. These families include the default IPv4, but also IPv6, IPv4 and IPv6 Virtual Private Networks, and multicast BGP. Increasingly, BGP is used as a generalized signaling protocol to carry information about routes that may not be part of the global Internet, such as VPNs [3].

Finite state machine

BGP state machine

In order to make decisions in its operations with other BGP peers, a BGP peer uses a simple finite state machine (FSM) that consists of six states: Idle, Connect, Active, OpenSent, OpenConfirm, and Established. For each peer-to-peer session, a BGP implementation maintains a state variable that tracks which of these six states the session is in. The BGP protocol defines the messages that each peer should exchange in order to change the session from one state to another. The first mode is the "Idle" mode. In this mode BGP initalizes all resources, refuses all inbound BGP connection attempts, and initiates a TCP connection to the peer. The second state is "Connect". In this state the router waits for the TCP connection to complete, transitioning to the "OpenSent" state if successful. If not, it resets the ConnectRetry timer and transitions to the "Active" state upon expiration. In the "Active" state, the router resets the ConnectRetry timer to zero, and returns to the "Connect" state. After "OpenSent," the router sends an Open message, and waits for one in return. Keepalive messages are exchanged next, and upon successful receipt, the router is placed in the "Established" state. Once established the router can now send/receive Keepalive, Update, and Notification messages to/from its peer.

Idle State:

Initializes resources for the BGP process.
Tries to establish a TCP connection with its configured BGP peer.
Listen for a TCP connection from its peer. If an error occurs at any state of the FSM process, the BGP session is terminated immediately, and returned to the Idle State. Some of the reasons why a router does not progress from the Idle state are:
- TCP port 179 is not open.
- A random TCP port over 1023 is not open.
- Peer address configured incorrectly on either router.
- AS number configured incorrectly on either router Connect State.

Connect State

Wait for successful TCP negotiation with peer.
BGP does not spend much time in this state if the TCP session has been successfully established.
Sends OPEN message to peer.
If an error occurs, BGP moves to the ACTIVE state. Some reasons for the error are:
- TCP port 179 is not open.
- A random TCP port over 1023 is not open.
- Peer address configured incorrectly on either router.
- AS number configured incorrectly on either router.

Active State

If the router was unable to establish a successful TCP session, then it ends up in the ACTIVE state.
The router will try to restart another TCP session with the peer and if successful, then it will send an OPEN message to the peer.
If it is unsuccessful again, the FSM is reset to the IDLE state.
If you see a router cycling between the IDLE and the ACTIVE state, here are some of the reasons:
- TCP port 179 is not open.
- A random TCP port over 1023 is not open.
- BGP configuration error.
- Network congestion.
- Flapping network interface.

OpenSent State

The router listens for an OPEN message from its peer.
Once the message has been received, the router checks the validity of the OPEN message.
If there is an error it is because one of the fields in the OPEN message don't match between the peers, e.g. BGP version mismatch, MD5 password mismatch, the peering router expects a different My AS. The router will then send a NOTIFICATION message to the peer indicating why the error occurred.
If there is no error, a KEEPALIVE message is sent.

OpenConfirm State

The peer is listening for a KEEPALIVE message from its peer.
If a message is received, then BGP transitions to the next state.
If no KEEPALIVE message is received, the router transitions back to the IDLE state.

Established State

In this state, the peers send UPDATE messages to exchange information about each route being advertised to the BGP peer.
If there is any error in the UPDATE message then a NOTIFICATION message is sent to the peer, and BGP transitions back to the IDLE state

Basic BGP UPDATES

Once a BGP session is running, the BGP speakers exchange UPDATE messages about destinations to which the speaker offers connectivity. In the protocol, the basic CIDR route description is called Network Layer Reachability Information (NLRI). NLRI includes the expected destination prefix, prefix length, path of autonomous systems to the destination and next hop in attributes, which can carry a wide range of additional information that affects the acceptance policy of the receiving router. BGP speakers incrementally announce new NLRI to which they offer reachability, but also announce withdrawals of prefixes to which the speaker no longer offers connectivity.

BGP Router Connectivity and Learning Routes

In the simplest arrangement all routers within a single AS and participating in BGP routing must be configured in a full mesh: each router must be configured as peer to every other router. This causes scaling problems, since the number of required connections grows quadratically with the number of routers involved. To get around this, two solutions are built into BGP: route reflectors (RFC 4456) and confederations (RFC 5065). For the following discussion of basic UPDATE processing, assume a full iBGP mesh.

Basic UPDATE Processing

A given BGP router may accept NLRI in UPDATEs from multiple neighbors and advertise NLRI to the same, or a different set, of neighbors. Conceptually, BGP maintains its own "master" routing table, called the Loc-RIB (Local Routing Information Base), separate from the main routing table of the router. For each neighbor, the BGP process maintains a conceptual Adj-RIB-In (Adjactent Routing Information Base, Incoming) containing the NLRI received from the neighbor, and a conceptual Adj-RIB-Out (Outgoing) for NLRI to be sent to the neighbor.

"Conceptual", in the preceding paragraph, means that the physical storage and structure of these various tables are decided by the implementer of the BGP code. Their structure is not visible to other BGP routers, although they usually can be interrogated with management commands on the local router. It is quite common, for example, to store both Adj-RIBs and the Loc-RIB in the same data structure, with additional information attached to the RIB entries. The additional information tells the BGP process such things as whether individual entries belong in the Adj-RIBs for specific neighbors, whether the per-neighbor route selection process made received policies eligible for the Loc-RIB, and whether Loc-RIB entries are eligible to be submitted to the local router's routing table management process.

By "eligible to be submitted", BGP will submit the routes that it considers best to the main routing table process. Depending on the implementation of that process, the BGP route is not necessarily selected. For example, a directly connected prefix, learned from the router's own hardware, is usually most preferred. As long as that directly connected route's interface is active, the BGP route to the destination will not be put into the routing table. Once the interface goes down, and there are no more preferred routes, the Loc-RIB route would be installed in the main routing table. Until recently, it was a common mistake to say "BGP carries policies". BGP really carried the information with which rules inside BGP-speaking routers could make policy decisions. Some of the information carried that is explicitly intended to be used in policy decisions are communities and multi-exit discriminators (MED).

Route Selection

The BGP standard specifies a number of decision factors, more than are used by any other common routing process, for selecting NLRI to go into the Loc-RIB. The first decision point for evaluating NLRI is that its next-hop attribute must be reachable (or resolvable). Another way of saying the next-hop must be reachable is that there must be an active route, already in the main routing table of the router, to the prefix in which the next-hop address is located.

Next, for each neighbor, the BGP process applies various standard and implementation-dependent criteria to decide which routes conceptually should go into the Adj-RIB-In. The neighbor could send several possible routes to a destination, but the first level of preference is at the neighbor level. Only one route to each destination will be installed in the conceptual Adj-RIB-In. This process will also delete, from the Adj-RIB-In, any routes that are withdrawn by the neighbor.

Whenever a conceptual Adj-RIB-In changes, the main BGP process decides if any of the neighbor's new routes are preferred to routes already in the Loc-RIB. If so, it replaces them. If a given route is withdrawn by a neighbor, and there is no other route to that destination, the route is removed from the Loc-RIB, and no longer sent, by BGP, to the main routing table manager. If the router does not have a route to that destination from any non-BGP source, the withdrawn route will be removed from the main routing table.

Per-Neighbour Decisions

After verifying that the next hop is reachable, if the route comes from an internal (ie iBGP) peer, the first rule to apply according to the standard is to examine the LOCAL_PREF attribute. If there are several iBGP routes from the neighbour, the one with the highest LOCAL_PREF is selected unless there are several routes with the same LOCAL_PREF. In the latter case the route selection process moves to the next tie breaker. While LOCAL_PREF is the first rule in the standard, once reachability of the NEXT_HOP is verified, Cisco and several other vendors first consider a decision factor called WEIGHT which is local to the router (ie not transmitted by BGP). The route with the highest WEIGHT is preferred.

LOCAL_PREF, WEIGHT, and other criteria can be manipulated by local configuration and software capabilities. Such manipulation is outside the scope of the standard but is commonly used. For example the COMMUNITY attribute (see below) is not directly used by the BGP selection process. The BGP neighbour process however can have a rule to set LOCAL_PREFERENCE or another factor based on a manually programmed rule to set the attribute if the COMMUNITY value matches some pattern matching criterion. If the route was learned from an external peer the per-neighbour BGP process computes a LOCAL_PREFERENCE value from local policy rules and then compares the LOCAL_PREFERENCE of all routes from the neighbour.

At the per-neighbour level - ignoring implementation-specific policy modifiers - the order of tie breaking rules is:

Prefer the route with the shortest AS_PATH. An AS_PATH is the set of AS numbers that must be traversed to reach the advertised destination. AS1-AS2-AS3 is shorter than AS4-AS5-AS6-AS7.
Prefer routes with the lowest value of their ORIGIN attribute.
Prefer routes with the lowest MULTI_EXIT_DISC (multi-exit discriminator or MED) value.

Before the most recent edition of the BGP standard, if an UPDATE had no MULTI_EXIT_DISC value, several implementations created a MED with the least possible value. The current standard however specifies that missing MEDs are to be treated as the highest possible value. Since the current rule may cause different behaviour than the vendor interpretations, BGP implementations that used the nonstandard default value have a configuration feature that allows the old or standard rule to be selected.

Decision Factors at the LOC-Rib Level

Once candidate routes are received from neighbors, the Loc-RIB software applies additional tie-breakers to routes to the same destination.

If at least one route was learned from an external neighbor (i.e., the route was learned from eBGP), drop all routes learned from iBGP.
Prefer the route with the lowest interior cost to the NEXT_HOP, according to the main Routing Table. If two neighbors advertised the same route, but one neighbor is reachable via a low-bandwidth link and the other by a high-bandwidth link, and the interior routing protocol calculates lowest cost based on highest bandwidth, the route through the high-bandwidth link would be preferred and other routes dropped.

If there is more than one route still tied at this point, several BGP implementations offer a configurable option to load-share among the routes, accepting all (or all up to some number).

Prefer the route learned from the BGP speaker the numerically lowest BGP identifier
Prefer the route learned from the BGP speaker with the lowest peer IP address

Communities

BGP communities are attribute tags that can be applied to incoming or outgoing prefixes to achieve some common goal (RFC 1997). While it is common to say that BGP allows an administrator to set policies on how prefixes are handled by ISPs, this is generally not possible, strictly speaking. For instance, BGP natively has no concept to allow one AS to tell another AS to restrict advertisement of a prefix to only North American peering customers. Instead, an ISP generally publishes a list of well-known or proprietary communities with a description for each one, which essentially becomes an agreement of how prefixes are to be treated. Examples of common communities include local preference adjustments, geographic or peer type restrictions, DoS avoidance (black holing), and AS prepending options. An ISP might state that any routes received from customers with community XXX:500 will be advertised to all peers (default) while community XXX:501 will restrict advertisement to North America. The customer simply adjusts their configuration to include the correct community(ies) for each route, and the ISP is responsible for controlling who the prefix is advertised to. It should be noted that the end user has no technical ability to enforce correct actions being taken by the ISP, though problems in this area are generally rare and accidental.

It is a common tactic for end customers to use BGP communities (usually ASN:70,80,90,100) to control the local preference the ISP assigns to advertised routes instead of using MED (the effect is similar). It should also be noted that the community attribute is not transitive and communities applied by the customer very rarely become propagated outside the next-hop AS.

Extended Communities

The BGP Extended Community Attribute has been added in 2006 in order to extend the range of such attributes and to provide a community attribute structuring by means of a type field. The extended format consists of one or two octets for the type field followed by 7 or 6 octets for the respective community attribute content. The definition of this Extended Community Attribute is documented in RFC 4360. The IANA administers the registry for BGP Extended Communities Types [4]. The Extended Communities Attribute itself is a transitive optional BGP attribute. However, a bit in the type field within the attribute decides, whether the encoded extended community is of transitive or non-transitive nature. The IANA registry therefore provides different number ranges for the attribute types. Due to the extended attribute range, its usage can be manifold. RFC 4360 exemplarly defines the "Two-Octet AS Specific Extended Community", the "IPv4 Address Specific Extended Community", the "Opaque Extended Community", the "Route Target Community" and the "Route Origin Community". A number of BGP QoS drafts [5] also use this Extended Community Attribute structure for inter-domain QoS signalling.

Uses of Multi-Exit Discriminators

MEDs, defined in the main BGP standard, were originally intended to show the advertising AS's preference, to another neighbor AS, the advertising AS's preference as to which of several links, to the same AS, are preferred as the place to which the accepting AS should transmit traffic. Another application of MEDs is to advertise the value, typically based on delay, of multiple AS that have presence at an IXP, that they impose to send traffic to some destination.

BGP problems and mitigation

iBGP scalability

An autonomous system with iBGP (internal BGP) must have all of its iBGP peers connect to each other in a full mesh (where everyone speaks to everyone directly). This full-mesh configuration requires that each router maintain a session to every other router. In large networks, this number of sessions may degrade performance of routers, due either to a lack of memory, or too much CPU process requirements.

Route reflectors and confederations both reduce the number of iBGP peers to each router and thus reduce processing overhead. Route reflectors are a pure performance-enhancing technique, while confederations also can be used to implement more fine-grained policy.

Route reflectors [6] reduce the number of connections required in an AS. A single router (or two for redundancy) can be made a route reflector: other routers in the AS need only be configured as peers to them.

Confederations are sets of autonomous systems. In common practice, [7] only one of the confederation AS numbers is seen by the Internet as a whole. Confederations are used in very large networks where a large AS can be configured to encompass smaller more manageable internal ASs.

Confederations can be used in conjunction with route reflectors. Confederations allow more fine-grained policy while route reflectors are a pure scaling technique, but either or both may be relevant to a particular situation.

Both confederations and route reflectors can be subject to persistent oscillation, unless specific design rules, affecting both BGP and the interior routing protocol, are followed [8].

However, these alternatives can introduce problems of their own, including the following:

route oscillation,
sub-optimal routing,
increase of BGP convergence time [9]

Additionally, route reflectors and BGP confederation were not designed to ease BGP router's configuration. Nevertheless, these are common tools for experienced BGP network architects. These tools may be combined, as, for example, a hierarchy of route reflectors.

Instability

The routing tables managed by a BGP implementation are adjusted continually to reflect actual changes in the network, such as links breaking and being restored or routers going down and coming back up. In the network as a whole it is normal for these changes to happen almost continuously, but for any particular router or link changes are supposed to be relatively infrequent. If a router is misconfigured or mismanaged then it may get into a rapid cycle between down and up states. This pattern of repeated withdrawal and reannouncement, known as route flapping, can cause excessive activity in all the other routers that know about the broken link, as the same route is continuously injected and withdrawn from the routing tables.

A feature known as route flap damping (RFC 2439) is built into many BGP implementations in an attempt to mitigate the effects of route flapping. Without damping the excessive activity can cause a heavy processing load on routers, which may in turn delay updates on other routes, and so affect overall routing stability. With damping, a route's flapping is exponentially decayed. At first instance when a route becomes unavailable but quickly reappears for whatever reason, then the damping does not take effect, so as to maintain the normal fail-over times of BGP. At the second occurrence, BGP shuns that prefix for a certain length of time; subsequent occurrences are timed out exponentially. After the abnormalities have ceased and a suitable length of time has passed for the offending route, prefixes can be reinstated and its slate wiped clean. Damping can also mitigate denial of service attacks; damping timings are highly customizable.

However, subsequent research has shown that flap damping can actually lengthen convergence times in some cases, and can cause interruptions in connectivity even when links are not flapping.[10][11] Moreover, as backbone links and router processors have become faster, some network architects have suggested that flap damping may not be as important as it used to be, since changes to the routing table can be absorbed much faster by routers.[citation needed] This has led the RIPE Route Working Group to write that "with the current implementations of BGP flap damping, the application of flap damping in ISP networks is NOT recommended. ... If flap damping is implemented, the ISP operating that network will cause side-effects to their customers and the Internet users of their customers' content and services ... . These side-effects would quite likely be worse than the impact caused by simply not running flap damping at all." [1] Improving stability without the problems of flap damping is the subject of current research.[2]

Routing table growth

One of the largest problems faced by BGP, and indeed the Internet infrastructure as a whole, comes from the growth of the Internet routing table. If the global routing table grows to the point where some older, less capable, routers cannot cope with the memory requirements or the CPU load of maintaining the table, these routers will cease to be effective gateways between the parts of the Internet they connect. In addition, and perhaps even more importantly, larger routing tables take longer to stabilize (see above) after a major connectivity change, leaving network service unreliable, or even unavailable, in the interim.

Until late 2001, the global routing table was growing exponentially, threatening an eventual widespread breakdown of connectivity. In an attempt to prevent this from happening, there was a cooperative effort by ISPs to keep the global routing table as small as possible, by using CIDR and route aggregation. While this slowed the growth of the routing table to a linear process for several years, with the expanded demand for multihoming by end user networks the growth was once again exponential by the middle of 2004. The global routing table hit 200,000 entries on or about October 13, 2006.

A network black hole is often used to improve aggregation of the BGP global routing table.[citation needed] Consider an AS that has been allocated the address space 172.16.0.0/16, from which it has assigned the prefixes 172.16.0.0/18, 172.16.64.0/18, and 172.16.192.0/18. The AS can advertise the whole block, 172.16.0.0/16. This AS will still receive traffic sent to the "hole", 172.16.128.0/18, but will silently discard it.

BGP Hijacking and Transit-AS Problems

It has been suggested that this article or section be merged into IP hijacking. (Discuss)

By default eBGP peers will attempt to add all routes received by another peer into the device's routing table and will then attempt to advertise nearly all of these routes to other eBGP peers. This can be a problem as multi-homed organizations can inadvertently advertise prefixes learned from one AS to another, causing the end customer to become the new, best-path to the prefixes in question. For example, a customer with a Cisco router peering with say AT&T and Verizon and using no filtering will automatically attempt to link the two major carriers, which could cause the providers to prefer sending some or all traffic through the customer (on perhaps a T1), instead of using high-speed dedicated links. This problem can further affect others that peer with these two providers and also cause those ASes to prefer the misconfigured link. In reality, this problem hardly ever occurs with large ISPs, as these ISPs tend to restrict what an end customer can advertise. However, it is important to note that any ISP not filtering customer advertisements can allow errant information to be advertised into the global routing table where it can affect even the large Tier-1 providers.

The concept of BGP hijacking revolves around locating an ISP that is not filtering advertisements (intentionally or otherwise) or locating an ISP whose internal or ISP-to-ISP BGP session is susceptible to a man-in-the-middle attack. Once located, an attacker can potentially advertise any prefix they want, causing some or all traffic to be diverted from the real source towards the attacker. This can be done either to overload the ISP the attacker has infiltrated, or to perform a DoS or impersonation attack on the entity whose prefix is being advertised. It is not uncommon for an attacker to cause serious outages, up-to and including a complete loss of connectivity. In early 2008, at-least 8 US Universities had their traffic diverted to Indonesia for about 90 minutes one morning in an attack kept mostly quiet by those involved. Also, in February 2008, a large portion of YouTube's address space was redirected to Pakistan when the PTA decided to block access[3] to the site from inside the country, but accidentally blackholed the route in the global BGP table.

While filtering and MD5/TTL protection is already available for most BGP implementations (thus preventing the source of most attacks), the problem stems from the concept that ISPs rarely ever filter advertisements from other ISPs, as there is no common or efficient way to determine the list of permissible prefixes each AS can originate. The penalty for allowing errant information to be advertised can range from simple filtering by other/larger ISPs to a complete shutdown of the BGP session by the neighboring ISP (causing the two ISPs to cease peering), and repeated problems often end in permanent termination of all peering agreements. It is also noteworthy that even causing a major provider to block or shutdown a smaller, problematic provider, the global BGP table will often reconfigure and reroute the traffic through other available routes until all peers take action, or until the errant ISP fixes the problem at the source.

One useful offshoot of this concept is called BGP anycasting and is frequently used by root DNS servers to allow multiple servers to use the same IP address, providing redundancy and a layer of protection against DoS attacks without publishing hundreds of server IP addresses. The difference in this situation is that each point advertising a prefix actually has access to the real data (DNS in this case) and responds correctly to end user requests.

Requirements of a router for use of BGP for Internet and backbone-of-backbones purposes

Routers, especially small ones intended for Small Office/Home Office (SOHO) use, may not include BGP software. Some SOHO routers simply are not capable of running BGP using BGP routing tables of any size. Other commercial routers may need a specific software executable image that contains BGP, or a license that enables it. Open source packages that run BGP include GateD, GNU Zebra, Quagga, OpenBGPD, and Vyatta. Devices marketed as Layer 3 switches are less likely to support BGP than devices marketed as routers, but high-end Layer 3 Switches usually can run BGP.

Products marketed as switches may or may not have a size limitation on BGP tables, such as 20,000 routes, far smaller than a full Internet table plus internal routes. These devices, however, may be perfectly reasonable and useful when used for BGP routing of some smaller part of the network, such as a confederation-AS representing one of several smaller enterprises that are linked, by a BGP backbone of backbones, or a small enterprise that announces routes to an ISP but only accepts a default route and perhaps a small number of aggregated routes.

A BGP router used only for a network with a single point of entry to the internet may have a much smaller routing table size (and hence RAM and CPU requirement) than a multihomed network. Even simple multihoming can have modest routing table size. See RFC 4098 for vendor-independent performance parameters for single BGP router convergence in the control plane.

It is not a given that a router running BGP needs a large memory. The memory requirement depends on the amount of BGP information exchanged with other BGP speakers, and the way in which the particular router stores BGP information. Do be aware that the router may have to keep more than one copy of a route, so it can manage different policies for route advertising and acceptance to a specific neighboring AS. The term view is often used for these different policy relationships on a running router.

If one router implementation takes more memory per route than another implementation, this may be a legitimate design choice, trading processing speed against memory. A full BGP table from an external peer will have in excess of 245,000 routes as of late March 2008. Large ISPs may add another 50% for internal and customer routes. Again depending on implementation, separate tables may be kept for each view of a different peer AS.

Open Source Implementations of BGP

6WINDGate, commercial embedded open-source routing modules from 6WIND including multi-core and network processors support.
Vyatta, a commercial open-source router / firewall.
Quagga, a fork of GNU Zebra for Unix-like systems.
GNU Zebra, a GPL routing suite supporting BGP4.
OpenBGPD, a BSD licensed implementation by the OpenBSD team.
XORP, the eXtensible Open Router Platform, a BSD licensed suite.
BIRD, a GPL routing package for Unix-like systems.

Constrained Shortest Path First (CSPF) is an extension of shortest path algorithms. The path computed using CSPF is a shortest path fulfilling a set of constraints. It simply means that it runs shortest path algorithm after pruning those links that violate a given set of constraints. A constraint could be minimum bandwidth required per link (also know as bandwidth guaranteed constraint), end-to-end delay, maximum number of link traversed, include/exclude nodes. CSPF is widely used in MPLS Traffic Engineering[citation needed]. The routing using CSPF is known as Constraint Based Routing (CBR).

The path computed using CSPF could be exactly same as that of computed from OSPF and IS-IS, or it could be completely different depending on the set of constraints to be met.

An Example With Bandwidth Constraint

For example consider the following network.

An Example network

Say a route has to be computed from router-1 to the router-3 satisfying bandwidth constrained of x- units, and link cost for each link is based on hop-count (i.e., 1).

If x = 50 units then CSPF will give path 1 → 2 → 3.

If x = 55 units then CSPF will give path 1 → 4 → 5 → 3.

If x = 90 units then CSPF will give path 1 → 4 → 5 → 6 → 3.

Note that in all of the above cases OSPF and IS-IS will always give path 1 → 2 → 3.

If however the link cost in this topology is different, CSPF will accordingly pick a different path. Suppose we still assume hop count as link cost between all nodes, except for link 1 → 2 and 2 → 3, the link cost is assumed to be 4 each. This time, CSPF will pick the first one (x = 50) as follows:

If x = 50 units then CSPF will give path 1 → 4 → 5 → 3.

Retrieved from "http://en.wikipedia.org/wiki/Constrained_Shortest_Path_First"

Multi Protocol Label Switching (MPLS) is a data-carrying mechanism that belongs to the family of packet-switched networks. MPLS operates at an OSI Model layer that is generally considered to lie between traditional definitions of Layer 2 (Data Link Layer) and Layer 3 (Network Layer), and thus is often referred to as a "Layer 2.5" protocol. It was designed to provide a unified data-carrying service for both circuit-based clients and packet-switching clients which provide a datagram service model. It can be used to carry many different kinds of traffic, including IP packets, as well as native ATM, SONET, and Ethernet frames.

A number of different technologies were previously deployed with essentially identical goals, such as frame relay and ATM. MPLS technologies have evolved with the strengths and weaknesses of ATM in mind. Many network engineers agree that ATM should be replaced with a protocol that requires less overhead, while providing connection-oriented services for variable-length frames. MPLS is currently replacing some of these technologies in the marketplace. It is highly possible that MPLS will completely replace these technologies in the future. Thus aligning these technologies with current and future technology needs.[1]

In particular, MPLS dispenses with the cell-switching and signaling-protocol baggage of ATM. MPLS recognizes that small ATM cells are not needed in the core of modern networks, since modern optical networks (as of 2008) are so fast (at 40 Gbit/s and beyond) that even full-length 1500 byte packets do not incur significant real-time queuing delays (the need to reduce such delays — e.g., to support voice traffic — was the motivation for the cell nature of ATM).

At the same time, MPLS attempts to preserve the traffic engineering and out-of-band control that made frame relay and ATM attractive for deploying large-scale networks.

MPLS was originally proposed by a group of engineers from Ipsilon Networks, but their "IP Switching" technology, which was defined only to work over ATM, did not achieve market dominance. Cisco Systems, Inc. introduced a related proposal, not restricted to ATM transmission, called "Tag Switching" when it was a Cisco proprietary proposal, and was renamed "Label Switching" when it was handed over to the IETF for open standardization. The IETF work involved proposals from other vendors, and development of a consensus protocol that combined features from several vendors' work.

One original motivation was to allow the creation of simple high-speed switches, since for a significant length of time it was impossible to forward IP packets entirely in hardware. However, advances in VLSI have made such devices possible. Therefore the advantages of MPLS primarily revolve around the ability to support multiple service models and perform traffic management. MPLS also offers a robust recovery framework[2] that goes beyond the simple protection rings of synchronous optical networking (SONET/SDH).

While the traffic management benefits of migrating to MPLS are quite valuable (better reliability, increased performance), there is a significant loss of visibility and access into the MPLS cloud for IT departments.[3]

How MPLS works

MPLS works by prefixing packets with an MPLS header, containing one or more 'labels'. This is called a label stack.

Each label stack entry contains four fields:

A 20-bit label value..
a 3-bit field for QoS (Quality of Service) priority (experimental).
a 1-bit bottom of stack flag. If this is set, it signifies that the current label is the last in the stack.
an 8-bit TTL (time to live) field.

These MPLS-labeled packets are switched after a Label Lookup/Switch instead of a lookup into the IP table. As mentioned above, when MPLS was conceived, Label Lookup and Label Switching were faster than a RIB lookup because they could take place directly within the switched fabric and not the CPU.

The entry and exit points of an MPLS network are called Label Edge Routers (LER), which, respectively, push an MPLS label onto the incoming packet and pop it off the outgoing packet. Routers that perform routing based only on the label are called Label Switch Routers (LSR). In some applications, the packet presented to the LER already may have a label, so that the new LSR pushes a second label onto the packet. For more information see Penultimate Hop Popping.

Labels are distributed between LERs and LSRs using the "Label Distribution Protocol" (LDP)[4]. Label Switch Routers in an MPLS network regularly exchange label and reachability information with each other using standardized procedures in order to build a complete picture of the network they can then use to forward packets. Label Switch Paths (LSPs) are established by the network operator for a variety of purposes, such as to create network-based IP Virtual Private Networks or to route traffic along specified paths through the network. In many respects, LSPs are no different than PVCs in ATM or Frame Relay networks, except that they are not dependent on a particular Layer 2 technology.[5]

In the specific context of an MPLS-based Virtual Private Network (VPN), LSRs that function as ingress and/or egress routers to the VPN are often called PE (Provider Edge) routers. Devices that function only as transit routers are similarly called P (Provider) routers. See RFC 2547.[6] The job of a P router is significantly easier than that of a PE router, so they can be less complex and may be more dependable because of this.

When an unlabeled packet enters the ingress router and needs to be passed on to an MPLS tunnel, the router first determines the forwarding equivalence class (FEC) the packet should be in, and then inserts one or more labels in the packet's newly-created MPLS header. The packet is then passed on to the next hop router for this tunnel.

When a labeled packet is received by an MPLS router, the topmost label is examined. Based on the contents of the label a swap, push (impose) or pop (dispose) operation can be performed on the packet's label stack. Routers can have prebuilt lookup tables that tell them which kind of operation to do based on the topmost label of the incoming packet so they can process the packet very quickly.

In a swap operation the label is swapped with a new label, and the packet is forwarded along the path associated with the new label.

In a push operation a new label is pushed on top of the existing label, effectively "encapsulating" the packet in another layer of MPLS. This allows hierarchical routing of MPLS packets. Notably, this is used by MPLS VPNs.

In a pop operation the label is removed from the packet, which may reveal an inner label below. This process is called "decapsulation". If the popped label was the last on the label stack, the packet "leaves" the MPLS tunnel. This is usually done by the egress router, but see PHP below.

During these operations, the contents of the packet below the MPLS Label stack are not examined. Indeed transit routers typically need only to examine the topmost label on the stack. The forwarding of the packet is done based on the contents of the labels, which allows "protocol-independent packet forwarding" that does not need to look at a protocol-dependent routing table and avoids the expensive IP longest prefix match at each hop.

At the egress router, when the last label has been popped, only the payload remains. This can be an IP packet, or any of a number of other kinds of payload packet. The egress router must therefore have routing information for the packet's payload, since it must forward it without the help of label lookup tables. An MPLS transit router has no such requirement.

In some special cases, the last label can also be popped off at the penultimate hop (the hop before the egress router). This is called Penultimate Hop Popping (PHP). This may be interesting in cases where the egress router has lots of packets leaving MPLS tunnels, and thus spends inordinate amounts of CPU time on this. By using PHP, transit routers connected directly to this egress router effectively offload it, by popping the last label themselves.

MPLS can make use of existing ATM network infrastructure, as its labeled flows can be mapped to ATM virtual circuit identifiers, and vice versa.

Installing and removing MPLS paths

There are two standardized protocols for managing MPLS paths: CR-LDP (Constraint-based Routing Label Distribution Protocol) and RSVP-TE, an extension of the RSVP protocol for traffic engineering. As of February 2003, as documented in RFC 3468,[7] defined in RFC 3209.

Extensions of the BGP protocol, starting with RFC 2547, can be used to manage an MPLS path, including RFC 3107 and RFC 4781. [8] [9].

An MPLS header does not identify the type of data carried inside the MPLS path. If one wants to carry two different types of traffic between the same two routers, with different treatment from the core routers for each type, one has to establish a separate MPLS path for each type of traffic.

Comparison of MPLS versus IP

MPLS cannot be compared to IP as a separate entity because it works in conjunction with IP and IP's IGP routing protocols. MPLS gives IP networks simple traffic engineering, the ability to transport Layer 3 (IP) VPNs with overlapping address spaces, and support for Layer 2 pseudowires (with Any Transport Over MPLS, or ATOM - see Martini draft). Routers with programmable CPUs and without TCAM/CAM or another method for fast lookups may also see a limited increase in the performance.

MPLS relies on IGP routing protocols to construct its label forwarding table, and the scope of any IGP is usually restricted to a single carrier for stability and policy reasons. As there is still no standard for carrier-carrier MPLS it is not possible to have the same MPLS service (Layer2 or Layer3 VPN) covering more than one operator.

MPLS Traffic Engineering

MPLS Traffic Engineering provides benefits over a pure-IP network by allowing greater control over the spread of traffic in the network. The path of an LSP can either be (a) explicitly configured hop by hop, (b) dynamically routed by the Constrained Shortest Path First CSPF algorithm, or (c) configured as a loose route that avoids a particular IP or that is partly explicit and partly dynamic. In a pure IP network, the shortest path to a destination is chosen even when it becomes more congested. Meanwhile, in an IP network with MPLS Traffic Engineering CSPF routing, constraints such as the RSVP bandwidth of the traversed links can also be considered, such that the shortest path with available bandwidth will be chosen. MPLS Traffic Engineering relies upon the use of TE extensions to OSPF or IS-IS and RSVP. Besides the constraint of RSVP bandwidth, users can also define their own constraints by specifying link attributes and special requirements for tunnels to route (or to not route) over links with certain attributes. [10]

MPLS local protection (Fast Reroute)

Main article: MPLS local protection

In the event of a network element failure when recovery mechanisms are employed at the IP layer, restoration may take several seconds which is unacceptable for real-time applications (such as VoIP)[11] [12][13]. In contrast, MPLS local protection meets the requirements of real-time applications with recovery times comparable to those of SONET rings (up to 50ms).[11][13][14]

Comparison of MPLS versus Frame Relay

Frame relay aimed to make more efficient use of existing physical resources, which allow for the underprovisioning of data services by telecommunications companies (telcos) to their customers, as clients were unlikely to be utilizing a data service 100 percent of the time. In more recent years, frame relay has acquired a bad reputation in some markets because of excessive bandwidth overbooking by these telcos.

Telcos often sell frame relay to businesses looking for a cheaper alternative to dedicated lines; its use in different geographic areas depended greatly on governmental and telecommunication companies' policies. Some of the early companies to make frame relay products included StrataCom (later acquired by Cisco Systems) and Cascade Communications (later acquired by Ascend Communications and then by Lucent Technologies).

AT&T is currently (as of June 2007) the largest frame relay service provider in the United States, with local networks in 22 states, plus national and international networks. This number is expected to change between 2007 and 2009 when most of these frame relay contracts expire. Many customers are likely to migrate from frame relay to MPLS over IP or Ethernet within the next two years, which in many cases will reduce costs and improve manageability and performance of their wide area networks.[15] [16]

Comparison of MPLS versus ATM

While the underlying protocols and technologies are different, both MPLS and ATM provide a connection-oriented service for transporting data across computer networks. In both technologies, connections are signaled between endpoints, connection state is maintained at each node in the path, and encapsulation techniques are used to carry data across the connection. Excluding differences in the signaling protocols (RSVP/LDP for MPLS and PNNI:Private Network-to-Network Interface for ATM) there still remain significant differences in the behavior of the technologies.

The most significant difference is in the transport and encapsulation methods. MPLS is able to work with variable length packets while ATM transports fixed-length (53 byte) cells. Packets must be segmented, transported and re-assembled over an ATM network using an adaption layer, which adds significant complexity and overhead to the data stream. MPLS, on the other hand, simply adds a label to the head of each packet and transmits it on the network.

Differences exist, as well, in the nature of the connections. An MPLS connection (LSP) is uni-directional - allowing data to flow in only one direction between two endpoints. Establishing two-way communications between endpoints requires a pair of LSPs to be established. Because 2 LSPs are required for connectivity, data flowing in the forward direction may use a different path from data flowing in the reverse direction. ATM point-to-point connections (Virtual Circuits), on the other hand, are bi-directional, allowing data to flow in both directions over the same path (bi-directional are only svc ATM connections; pvc ATM connections are uni-directional).

Both ATM and MPLS support tunnelling of connections inside connections. MPLS uses label stacking to accomplish this while ATM uses Virtual Paths. MPLS can stack multiple labels to form tunnels within tunnels. The ATM Virtual Path Indicator (VPI) and Virtual Circuit Indicator (VCI) are both carried together in the cell header, limiting ATM to a single level of tunnelling.

The biggest single advantage that MPLS has over ATM is that it was designed from the start to be complementary to IP. Modern routers are able to support both MPLS and IP natively across a common interface allowing network operators great flexibility in network design and operation. ATM's incompatibilities with IP require complex adaptation making it largely unsuitable in today's predominantly IP networks.

Comparison of MPLS vs Ethernet VPN vs IP WAN

MPLS helps to improve productivity via management of a single network. There are companies which want to retain their IP Routing, hence they would choose an Ethernet VPN solution vs MPLS. However, there are often hard to reach locations not fully supported by land fibre. hence the integration of satellite IP, BGAN and MVSAT with an MPLS backbone will enable coverage using IP technology into seemingly hard to reach locations, covering both land and sea. This is also known as IP WAN

MPLS deployment

MPLS is currently in use in large "IP Only" networks, and is standardized by IETF in RFC 3031.

In practice, MPLS is mainly used to forward IP datagrams and Ethernet traffic. Major applications of MPLS are Telecommunications traffic engineering and MPLS VPN.

Competitors to MPLS

MPLS can exist in both IPv4 environment (IPv4 routing protocols) and IPv6 environment (IPv6 routing protocols). The major goal of MPLS development - the increase of routing speed - is no longer relevant because of the usage of ASIC, TCAM and CAM-based switching. Therefore the major usage of MPLS is to implement limited traffic engineering and Layer 3/Layer 2 "service provider type" VPNs over existing IPv4 networks. The only competitors to MPLS are technologies like L2TPv3 that also provide services such as service provider Layer 2 and Layer 3 VPNs.

IEEE 1355 is a completely unrelated technology that does something similar in hardware.

IPv6 references: Grossetete, Patrick, IPv6 over MPLS, Cisco Systems 2001; Juniper Networks IPv6 and Infranets White Paper; Juniper Networks DoD's Research and Engineering Community White Paper.

Access to MPLS networks

MPLS supports a range of access technologies, including T1, ATM and frame relay. In April 2008, New Edge Networks announced traffic prioritization on its MPLS network available via less expensive DSL access. Previously, traffic prioritization was not possible across DSL connections.

Benefits of MPLS

MPLS provides networks with a more efficient way to manage applications and move information between locations. With the convergence of voice, video and data applications, business networks face increasing traffic demands. MPLS enables class of service (CoS) tagging and prioritization of network traffic, so administrators may specify which applications should move across the network ahead of others. This function makes an MPLS network especially important to firms that need to ensure the performance of low-latency applications such as VoIP and their other business-critical functions. MPLS carriers differ on the number of classes of service they offer and in how these CoS tiers are priced. [17]

Synchronous optical networking (SONET) and Synchronous Digital Hierarchy (SDH), are two closely related multiplexing protocols for transferring multiple digital bit streams using lasers or light-emitting diodes (LEDs) over the same optical fiber. The method was developed to replace the Plesiochronous Digital Hierarchy (PDH) system for transporting larger amounts of telephone calls and data traffic over the same fiber wire without synchronization problems.

SONET and SDH are based on circuit mode communication, meaning that each connection achieves a constant bit rate and delay. For example, SDH or SONET may be utilized to allow several Internet Service Providers to share the same optical fiber, without being affected by each other's traffic load, and without being able to temporarily borrow free capacity from each other. Only certain integer multiples of 64 kbit/s are possible bit rates.

Since SONET and SDH are characterized as pure time division multiplexing (TDM) protocols (not to be confused with Time Division Multiple Access, TDMA), offering permanent connections, and do not involve packet mode communication, they are considered as physical layer protocols.

Both SDH and SONET are widely used today: SONET in the U.S. and Canada and SDH in the rest of the world. Although the SONET standards were developed before SDH, their relative penetrations in the worldwide market dictate that SONET now is considered the variation.

The two protocols are standardized according to the following:

SDH or Synchronous Digital Hierarchy standard developed by the International Telecommunication Union (ITU), documented in standard G.707 and its extension G.708
SONET or Synchronous Optical Networking standard as defined by GR-253-CORE from Telcordia and T1.105 from American National Standards Institute

Difference from PDH

Synchronous networking differs from PDH in that the exact rates that are used to transport the data are tightly synchronized across the entire network, made possible by atomic clocks. This synchronization system allows entire inter-country networks to operate synchronously, greatly reducing the amount of buffering required between elements in the network.

Both SONET and SDH can be used to encapsulate earlier digital transmission standards, such as the PDH standard, or used directly to support either Asynchronous Transfer Mode (ATM) or so-called Packet over SONET/SDH (POS) networking. As such, it is inaccurate to think of SDH or SONET as communications protocols in and of themselves, but rather as generic and all-purpose transport containers for moving both voice and data. The basic format of an SDH signal allows it to carry many different services in its Virtual Container (VC) because it is bandwidth-flexible.

Structure of SONET/SDH signals

SONET and SDH often use different terms to describe identical features or functions, sometimes leading to confusion that exaggerates their differences. With a few exceptions, SDH can be thought of as a superset of SONET. The two main differences between the two:

SONET can use either of two basic units for framing while SDH has one
SDH has additional mapping options which are not available in SONET.

Protocol overview

The protocol is an extremely heavily multiplexed structure, with the header interleaved between the data in a complex way. This is intended to permit the encapsulated data to have its own frame rate and to be able to float around relative to the SDH/SONET frame structure and rate. This interleaving permits a very low latency for the encapsulated data- data passing through equipment can be delayed by at most 32 microseconds, compared to a frame rate of 125 microseconds; many competing protocols buffer the data for at least one frame or packet before sending it on. Extra padding is allowed for the multiplexed data to move within the overall framing due to it being on a different clock to the frame rate, and the decision to allow this at most of the levels of the multiplexing structure makes the protocol complex, but gives high all-round performance..

The basic unit of transmission

The basic unit of framing in SDH is an STM-1 (Synchronous Transport Module level - 1), which operates at 155.52 Mbit/s. SONET refers to this basic unit as an STS-3c (Synchronous Transport Signal - 3, concatenated), but its high-level functionality, frame size, and bit-rate are the same as STM-1.

SONET offers an additional basic unit of transmission, the STS-1 (Synchronous Transport Signal - 1), operating at 51.84 Mbit/s - exactly one third of an STM-1/STS-3c. Some manufacturers also support the SDH equivalent STM-0, but this is not part of the standard.

Framing

In packet oriented data transmission such as Ethernet, a packet frame usually consists of a header and a payload, with the header of the frame being transmitted first, followed by the payload (and possibly a trailer, such as a CRC). In synchronous optical networking, this is modified slightly. The header is termed the overhead and the payload still exists, but instead of the overhead being transmitted before the payload, it is interleaved, with part of the overhead being transmitted, then part of the payload, then the next part of the overhead, then the next part of the payload, until the entire frame has been transmitted. In the case of an STS-1, the frame is 810 octets in size while the STM-1/STS-3c frame is 2430 octets in size. For STS-1, the frame is transmitted as 3 octets of overhead, followed by 87 octets of payload. This is repeated nine times over until 810 octets have been transmitted, taking 125 microseconds. In the case of an STS-3c/STM-1 which operates three times faster than STS-1, 9 octets of overhead are transmitted, followed by 261 octets of payload. This is also repeated nine times over until 2,430 octets have been transmitted, also taking 125 microseconds. For both SONET and SDH, this is normally represented by the frame being displayed graphically as a block: of 90 columns and 9 rows for STS-1; and 270 columns and 9 rows for SDH/STS-3c. This representation aligns all the overhead columns, so the overhead appears as a contiguous block, as does the payload.

The internal structure of the overhead and payload within the frame differs slightly between SONET and SDH, and different terms are used in the standards to describe these structures. However, the standards are extremely similar in implementation, such that it is easy to interoperate between SDH and SONET at particular bandwidths.

It is worth noting that the choice of a 125 microsecond interval is not an arbitrary one. What it means is that the same octet position in each frame comes past every 125 microseconds. If one octet is extracted from the bitstream every 125 microseconds, this gives a data rate of 8 bits per 125 microseconds - or 64 kbit/s, the basic DS0 telecommunications rate. This relation allows an extremely useful behaviour of synchronous optical networking, which is that low data rate channels or streams of data can be extracted from high data rate streams by simply extracting octets at regular time intervals - there is no need to understand or decode the entire frame. This is not possible in PDH networking. Furthermore, it shows that a relatively simple device is all that is needed to extract a datastream from an SDH framed connection and insert it into a SONET framed connection and vice versa.

In practice, the terms STS-1 and OC-1 are sometimes used interchangeably, though the OC-N format refers to the signal in its optical form. It is therefore incorrect to say that an OC-3 contains 3 OC-1s: an OC-3 can be said to contain 3 STS-1s.

SDH Frame

A STM-1 Frame. The first 9 columns contain the overhead and the pointers. For the sake of simplicity, the frame is shown as a rectangular structure of 270 columns and 9 rows, but the protocol does not transmit the bytes in this order in practice

For the sake of simplicity, the frame is shown as a rectangular structure of 270 columns and 9 rows. The first 3 rows and 9 columns contain Regenerator Section Overhead (RSOH) and the last 5 rows and 9 columns contain Multiplex Section Overhead (MSOH). The 4th row from the top contains pointers

The STM-1 (Synchronous Transport Module level - 1) frame is the basic transmission format for SDH or the fundamental frame or the first level of the synchronous digital hierarchy. The STS-1 frame is transmitted in exactly 125 microseconds, therefore there are 8000 frames per second on a fiber-optic circuit designated OC-1 (optical carrier one). The STM-1 frame consists of overhead plus a virtual container capacity. The first 9 columns of each frame make up the Section Overhead, and the last 261 columns make up the Virtual Container (VC) capacity. The VC plus the pointers (H1, H2, H3 bytes) is called the AU (Administrative Unit).

Carried within the VC capacity, which has its own frame structure of 9 rows and 261 columns, is the Path Overhead and the Container. The first column is for Path Overhead; it's followed by the payload container, which can itself carry other containers. Virtual Containers can have any phase alignment within the Administrative Unit, and this alignment is indicated by the Pointer in row four,

The Section overhead of an STM-1 signal (SOH) is divided into two parts: the Regenerator Section Overhead (RSOH) and the Multiplex Section Overhead (MSOH). The overheads contain information from the system itself, which is used for a wide range of management functions, such as monitoring transmission quality, detecting failures, managing alarms, data communication channels, service channels, etc.

The STM frame is continuous and is transmitted in a serial fashion, byte-by-byte, row-by-row.

STM–1 frame contains

Total content : 9 x 270 bytes = 2430 bytes

overhead : 9 rows x 9 bytes
payload : 9 rows x 261 bytes

Period : 125 μsec
Bitrate : 155.520 Mbit/s (2430 x 8 bits x 8000 frame/s )

payload capacity : 150.336 Mbit/s (2349 x 8 bits x 8000 frame/s)

The transmission of the frame is done row by row, from the top left corner.

Framing Structure

The frame consists of two parts, the transport overhead and the path virtual envelope.

Transport overhead

The transport overhead is used for signaling and measuring transmission error rates, and is composed as follows:

Section overhead - called RSOH (Regenerator Section Overhead) in SDH terminology: 27 octets containing information about the frame structure required by the terminal equipment.
Line overhead - called MSOH (Multiplex Section Overhead) in SDH: 45 octets containing information about alarms, maintenance and error correction as may be required within the network.
Pointer – It points to the location of the J1 byte in the payload.

Path virtual envelope

Data transmitted from end to end is referred to as path data. It is composed of two components:

Payload overhead (POH): 9 bytes used for end to end signaling and error measurement.
Payload: user data (774 bytes for STS-1, or 2349 bytes for STM-1/STS-3c)

For STS-1, the payload is referred to as the synchronous payload envelope (SPE), which in turn has 18 stuffing bytes, leading to the STS-1 payload capacity of 756 bytes.[1]

The STS-1 payload is designed to carry a full PDH DS3 frame. When the DS3 enters a SONET network, path overhead is added, and that SONET network element (NE) is said to be a path generator and terminator. The SONET NE is said to be line terminating if it processes the line overhead. Note that wherever the line or path is terminated, the section is terminated also. SONET Regenerators terminate the section but not the paths or line.

An STS-1 payload can also be subdivided into 7 VTGs, or Virtual Tributary Groups. Each VTG can then be subdivided into 4 VT1.5 signals, each of which can carry a PDH DS1 signal. A VTG may instead be subdivided into 3 VT2 signals, each of which can carry a PDH E1 signal. The SDH equivalent of a VTG is a TUG2; VT1.5 is equivalent to VC11, and VT2 is equivalent to VC12.

Three STS-1 signals may be multiplexed by time-division multiplexing to form the next level of the SONET hierarchy, the OC-3 (STS-3), running at 155.52 Mbit/s. The multiplexing is performed by interleaving the bytes of the three STS-1 frames to form the STS-3 frame, containing 2,430 bytes and transmitted in 125 microseconds.

Higher speed circuits are formed by successively aggregating multiples of slower circuits, their speed always being immediately apparent from their designation. For example, four STS-3 or AU4 signals can be aggregated to form a 622.08 Mbit/s signal designated as OC-12 or STM-4.

The highest rate that is commonly deployed is the OC-192 or STM-64 circuit, which operates at rate of just under 10 Gbit/s. Speeds beyond 10 Gbit/s are technically viable and are under evaluation. [Few vendors are offering STM-256 rates now, with speeds of nearly 40Gbit/s]. Where fiber exhaust is a concern, multiple SONET signals can be transported over multiple wavelengths over a single fiber pair by means of Wavelength division multiplexing, including Dense Wave Division Multiplexing (DWDM) and Coarse Wave Division Multiplexing (CWDM). DWDM circuits are the basis for all modern transatlantic cable systems and other long-haul circuits.

SONET/SDH and relationship to 10 Gigabit Ethernet

Another circuit type amongst data networking equipment is 10 Gigabit Ethernet (10GbE). This is similar to the line rate of OC-192/STM-64 (9.953 Gbit/s). The Gigabit Ethernet Alliance created two 10 Gigabit Ethernet variants: a local area variant (LAN PHY), with a line rate of exactly 10,000,000 kbit/s and a wide area variant (WAN PHY), with the same line rate as OC-192/STM-64 (9,953,280 kbit/s). The Ethernet wide area variant encapsulates its data using a light-weight SDH/SONET frame so as to be compatible at low level with equipment designed to carry those signals.

However, 10 Gigabit Ethernet does not explicitly provide any interoperability at the bitstream level with other SDH/SONET systems. This differs from WDM System Transponders, including both Coarse- and Dense-WDM systems (CWDM, DWDM) that currently support OC-192 SONET Signals, which can normally support thin-SONET framed 10 Gigabit Ethernet.

SONET/SDH data rates

SONET/SDH Designations and bandwidths

SONET Optical Carrier Level	SONET Frame Format	SDH level and Frame Format	Payload bandwidth (kbit/s)	Line Rate (kbit/s)
OC-1	STS-1	STM-0	48,960	51,840
OC-3	STS-3	STM-1	150,336	155,520
OC-12	STS-12	STM-4	601,344	622,080
OC-24	STS-24	STM-8	1,202,688	1,244,160
OC-48	STS-48	STM-16	2,405,376	2,488,320
OC-96	STS-96	STM-32	4,810,752	4,976,640
OC-192	STS-192	STM-64	9,621,504	9,953,280
OC-768	STS-768	STM-256	38,486,016	39,813,120
OC-1536	STS-1536	STM-512	76,972,032	79,626,120
OC-3072	STS-3072	STM-1024	153,944,064	159,252,240

In the above table, Payload bandwidth is the line rate less the bandwidth of the line and section overheads. User throughput must also deduct path overhead from this, but path overhead bandwidth is variable based on the types of cross-connects built across the optical system.

Note that the typical data rate progression starts at OC-3 and increases by multiples of 4. As such, while OC-24 and OC-1536, along with other rates such as OC-9, OC-18, OC-36, and OC-96 may be defined in some standards documents, they are not available on a wide-range of equipment.

As of 2007, OC-3072 is still a work in progress.

Physical layer

The physical layer actually comprises a large number of layers within it, only one of which is the optical/transmission layer (which includes bitrates, jitter specifications, optical signal specifications and so on). The SONET and SDH standards come with a host of features for isolating and identifying signal defects and their origins.

SONET/SDH Network Management Protocols

SONET equipment is often managed with the TL1 protocol. TL1 is a traditional telecom language for managing and reconfiguring SONET network elements. TL1 (or whatever command language a SONET Network Element utilizes) must be carried by other management protocols, including SNMP, CORBA and XML.

There are some features that are fairly universal in SONET Network Management. First of all, most SONET NEs have a limited number of management interfaces defined. These are:

Electrical Interface. The electrical interface (often 50 Ω) sends SONET TL1 commands from a local management network physically housed in the Central Office where the SONET NE is located. This is for "local management" of that NE and, possibly, remote management of other SONET NEs.

Craft Interface. Local "craftspersons" can access a SONET NE on a "craft port" and issue commands through a dumb terminal or terminal emulation program running on a laptop. This interface can also be hooked-up to a console server, allowing for remote out-of-band management and logging.

SONET and SDH have dedicated Data Communication Channels (DCC)s within the section and line overhead for management traffic. Generally, section overhead (regenerator section in SDH) is used. According to ITU-T G.7712, there are three modes used for management:

IP-only stack, using PPP as data-link
OSI-only stack, using LAP-D as data-link
Dual (IP+OSI) stack using PPP or LAP-D with tunneling functions to communicate between stacks.

An interesting fact about modern NEs is that, to handle all of the possible management channels and signals, most NEs actually contain a router for routing the network commands and underlying (data) protocols.

The main functions of Network Management include:

Network and NE Provisioning. In order to allocate bandwidth throughout a network, each NE must be configured. Although this can be done locally, through a craft interface, it is normally done through a Network Management System (sitting at a higher layer) that in turn operates through the SONET/SDH Network Management Network.

Software Upgrade. NE Software Upgrade is in modern NEs done mostly through the SONET/SDH Management network.

Performance Management. NEs have a very large set of standards for Performance Management. The PM criteria allow for monitoring not only the health of individual NEs, but for the isolation and identification of most network defects or outages. Higher-layer Network monitoring and management software allows for the proper filtering and troubleshooting of network-wide PM so that defects and outages can be quickly identified and responded to.

Equipment

With recent advances in SONET and SDH chipsets, the traditional categories of NEs are breaking down. Nevertheless, as Network architectures have remained relatively constant, even newer equipment (including "Multiservice Provisioning Platforms") can be examined in light of the architectures they will support. Thus, there is value in viewing new (as well as traditional) equipment in terms of the older categories.

Regenerator

Traditional regenerators terminate the section overhead, but not the line or path. Regens extend long haul routes in a way similar to most regenerators, by converting an optical signal that has already traveled a long distance into electrical format and then retransmitting a regenerated high-power signal.

Since the late 1990s, regenerators have been largely replaced by Optical Amplifiers. Also, some of the functionality of regens has been absorbed by the Transponders of Wavelength Division Multiplexing systems.

add-drop multiplexer (ADM)

ADMs are the most common type of NEs. Traditional ADMs were designed to support one of the Network Architectures, though new generation systems can often support several architectures, sometimes simultaneously. ADMs traditionally have a "high speed side" (where the full line rate signal is supported), and a "low speed side", which can consist of electrical as well as optical interfaces. The low speed side takes in low speed signals which are multiplexed by the NE and sent out from the high speed side, or vice versa.

Digital Cross Connect system

Recent Digital Cross Connect systems (DCSs or DXCs) support numerous high-speed signals, and allow for cross connection of DS1s, DS3s and even STS-3s/12c and so on, from any input to any output. Advanced DCSs can support numerous subtending rings simultaneously.

Network Architectures

Currently, SONET (and SDH) have a limited number of architectures defined. These architectures allow for efficient bandwidth usage as well as protection (i.e. the ability to transmit traffic even when part of the network has failed), and are key in understanding the almost worldwide usage of SONET and SDH for moving digital traffic. The three main architectures are:

Linear APS (Automatic Protection Switching), also known as 1+1: This involves 4 fibers: 2 working fibers in each direction, and two protection fibers. Switching is based on the line state, and may be unidirectional, with each direction switching independently, or bidirectional, where the NEs at each end negotiate so that both directions are generally carried on the same pair of fibers.
UPSR (Unidirectional Path Switched Ring): In a UPSR, two redundant (path-level) copies of protected traffic are sent in either direction around a ring. A selector at the egress node determines the higher-quality copy and decides to use the best copy, thus coping if deterioration in one copy occurs due to a broken fiber or other failure. UPSRs tend to sit nearer to the edge of a network and, as such, are sometimes called "collector rings". Because the same data is sent around the ring in both directions, the total capacity of a UPSR is equal to the line rate N of the OC-N ring. For example if we had an OC-3 ring with 3 STS-1s used to transport 3 DS-3s from ingress node A to the egress node D, then 100% of the ring bandwidth (N=3) would be consumed by nodes A and D. Any other nodes on the ring, say B and C could only act as pass through nodes. The SDH analog of UPSR is Subnetwork Connection Protection (SNCP); however, SNCP does not impose a ring topology, but may also be used in mesh topologies.

BLSR (Bidirectional Line Switched Ring): BLSR comes in two varieties, 2-fiber BLSR and 4-fiber BLSR. BLSRs switch at the line layer. Unlike UPSR, BLSR does not send redundant copies from ingress to egress. Rather, the ring nodes adjacent to the failure reroute the traffic "the long way" around the ring. BLSRs trade cost and complexity for bandwdith efficiency as well as the ability to support "extra traffic", which can be pre-empted when a protection switching event occurs. BLSRs can operate within a metropolitan region or, often, will move traffic between municipalities. Because a BLSR does not send redundant copies from ingress to egress the total bandwidth that a BLSR can support is not limited to the line rate N of the OC-N ring and can actually be larger than N depending upon the traffic pattern on the ring. The best case of this is that all traffic is between adjacent nodes. The worst case is when all traffic on the ring egresses from a single node, i.e. the BLSR is serving as a collector ring. In this case the bandwidth that the ring can support is equal to the line rate N of the OC-N ring. This is why BLSRs are seldom if ever deployed in collector rings but often deployed in inter-office rings. The SDH equivalent of BLSR is called Multiplex Section-Shared Protection Ring (MS-SPRING).

Synchronization

Clock sources used by Synchronization in telecommunications networks are rated by quality, commonly called a 'stratum' level. Typically, a network element uses the highest quality stratum available to it, which can be determined this by monitoring the Synchronization Status Messages(SSM) of selected clock sources.

As for Synchronization sources available to an NE, these are:

Local External Timing. This is generated by an atomic Caesium clock or a satellite-derived clock by a device located in the same central office as the NE. The interface is often a DS1, with Sync Status Messages supplied by the clock and placed into the DS1 overhead.
Line-derived timing. An NE can choose (or be configured) to derive its timing from the line-level, by monitoring the S1 sync status bytes to ensure quality.
Holdover. As a last resort, in the absence of higher quality timing, an NE can go into "holdover" until higher quality external timing becomes available again. In this mode, an NE uses its own timing circuits as a reference.

Timing loops

A timing loop occurs when NEs in a network are each deriving their timing from other NEs, without any of them being a "master" timing source. This network loop will eventually see its own timing "float away" from any external networks, causing mysterious bit errors and ultimately, in the worst cases, massive loss of traffic. The source of these kinds of errors can be hard to diagnose. In general, a network that has been properly configured should never find itself in a timing loop, but some classes of silent failures could nevertheless cause this issue.

Next-generation SONET/SDH

SONET/SDH development was originally driven by the need to transport multiple PDH signals like DS1, E1, DS3 and E3 along with other groups of multiplexed 64 kbit/s pulse-code modulated voice traffic. The ability to transport ATM traffic was another early application. In order to support large ATM bandwidths, the technique of concatenation was developed, whereby smaller multiplexing containers (eg, STS-1) are inversely multiplexed to build up a larger container (eg, STS-3c) to support large data-oriented pipes.

One problem with traditional concatenation, however, is inflexibility. Depending on the data and voice traffic mix that must be carried, there can be a large amount of unused bandwidth left over, due to the fixed sizes of concatenated containers. For example, fitting a 100 Mbit/s Fast Ethernet connection inside a 155 Mbit/s STS-3c container leads to considerable waste.

Virtual Concatenation (VCAT) allows for a more arbitrary assembly of lower order multiplexing containers, building larger containers of fairly arbitrary size (e.g. 100 Mbit/s) without the need for intermediate NEs to support this particular form of concatenation. Virtual Concatenation increasingly leverages X.86 or Generic Framing Procedure (GFP) protocols in order to map payloads of arbitrary bandwidth into the virtually concatenated container.

Link Capacity Adjustment Scheme (LCAS) allows for dynamically changing the bandwidth via dynamic virtual concatenation, multiplexing containers based on the short-term bandwidth needs in the network.

The set of next generation SONET/SDH protocols to enable Ethernet transport is referred to as Ethernet over SONET/SDH (EoS).

Plesiochronous Digital Hierarchy (PDH) is a technology used in telecommunications networks to transport large quantities of data over digital transport equipment such as fibre optic and microwave radio systems. The term plesiochronous is derived from Greek plesio, meaning near, and chronos, time, and refers to the fact that PDH networks run in a state where different parts of the network are nearly, but not quite perfectly, synchronised.

PDH is now being replaced by Synchronous Digital Hierarchy (SDH) equipment in most telecommunications networks.

PDH allows transmission of data streams that are nominally running at the same rate, but allowing some variation on the speed around a nominal rate. By analogy, any two watches are nominally running at the same rate, clocking up 60 seconds every minute. However, there is no link between watches to guarantee they run at exactly the same rate, and it is highly likely that one is running slightly faster than the other.

Implementation

The European and American versions of the PDH system differ slightly in the detail of their working, but the principles are the same. The European E-carrier system is described below.

The basic data transfer rate is a data stream of 2048 kilobits/s (kilobits/second). For speech transmission, this is broken down into thirty 64 kbit/s (kilobits/second) channels plus two 64 kbit/s channels used for signalling and synchronisation. Alternatively, the whole 2 Mbit/s (megabits/second) may be used for non speech purposes, for example, data transmission.

The exact data rate of the 2 Mbit/s data stream is controlled by a clock in the equipment generating the data. The exact rate is allowed to vary some percentage (+/- 50 ppm) on either side of an exact 2.048 Mbit/s. This means that different 2 Mbit/s data streams can be (probably are) running at slightly different rates to one another.

In order to move multiple 2 Mbit/s data streams from one place to another, they are combined together, or "multiplexed" in groups of four. This is done by taking 1 bit from stream #1, followed by 1 bit from stream #2, then #3, then #4. The transmitting multiplexer also adds additional bits in order to allow the far end receiving multiplexer to decode which bits belong to which 2-Meg data stream, and so correctly reconstitute the original data streams. These additional bits are called "justification" or "stuffing" bits.

Because each of the four 2 Mbit/s data streams is not necessarily running at the same rate, some compensation has to be made. The transmitting multiplexer combines the four data streams assuming that they are running at their maximum allowed rate. This means that occasionally, (unless the 2 Mbit/s really is running at the maximum rate) the multiplexer will look for the next bit but it will not have arrived. In this case, the multiplexer signals to the receiving multiplexer that a bit is "missing". This allows the receiving multiplexer to correctly reconstruct the original data for each of the four 2 Mbit/s data streams, and at the correct, different, plesiochronous rates.

The resulting data stream from the above process runs at 8,448 kbit/s (about 8 Mbit/s). Similar techniques are used to combine four x 8 Mbit/s together, giving 34 Mbit/s. Four x 34 Mbit/s, gives 140. Four x 140 gives 565.

565 Mbit/s is the rate typically used to transmit data over a fibre optic system for long distance transport. Recently, telecommunications companies have been replacing their PDH equipment with SDH equipment capable of much higher transmission rates.

MAGIC BOX

LinkWithin