[j-nsp] Network design problem in a bridged setup with 2x Juniper MX and some Brocade SuperX

Discussion:

Jeff Meyers

12 years ago

Hello list,

I'm currently a little stuck and might need some help in order to decide
how to improve the current setup. We are running a network where all
customer vlans are bridged because the same Vlan is usually required in
different areas in the network. This is the setup:

Room A: +--------+
| SX1600 |--------> [ 2nd SuperX not installed yet ]
+--------+
| |
| |
+--------+ +--------+
| MX480 |--------| MX80 |
+--------+ +--------+
| |
| |
Room B: +--------+ +--------+
| SX400 |--------| SX400 |
+--------+ +--------+

Both MX routers have a 10G link between each other with RSTP active, so
the the two SuperXes in Room B. These are the priorities:

MX480: 0 (root bridge)
MX80: 4k (backup root)
SX400: both 16k

Because topology changes caused some minor packet loss in Room B, I
installed the SX1600 with MSTP instead of RSTP to see if that performs
better. During some tests before connecting customers to the SX1600,
results looked fine. We proceeded with the setup and replaced the old
Cisco 6509/sup32 with the SX1600 and turned all routed Vlans active on
the Cisco into bridged Vlans.

I'm running just one instance of MSTP (CIST) on the SX1600 with the
following configuration:

mstp scope all
mstp instance 0 vlan 1
mstp instance 0 vlan 19
...
mstp instance 0 priority 16384
mstp edge-port-auto-detect
mstp start

On this SX1600, most uplinks go to switches on their own, usually HP
ProCurve 2600 or 2800 series. Although we manage those switches,
customers can install cables on their own. And here is where the problem
actually starts: a rack with two ProCurve switches installed receives
two uplinks from the same SX1600 and those switches are connected with
each other, causing a loop. No matter what I did, the loop continued to
cause trouble to the whole network because the MX routers saw topology
changes all the time (between a few and 200 seconds or so) and flushed
the whole arp cache. With about 90.000 active arp entries, this caused a
more or less heavy impact on the servers behind of course. Although STP
was active on both HP switches, the problem didn't vanish but the
topolgy change itself was not visible on the SX1600 as it seems. In
order to solve the issue, we had to remove the cable causing the loop
but of course this can't be the solution since customers may install a
new loop anytime and what's the point in running STP if you need to care
about that?

The question is now how to proceed and how to improve the setup
generally? Does it make sense to change RSTP to MSTP on the MX routers
in the first place? Is there any configuration I should perform on any
of those devices involved?
Since many of you are most likely from the Cisco world, here is a list
of the available commands on the SuperX running in MST mode:

SSH at A.cs0 (config)#mst
admin-edge-port Define this port to be an edge port
admin-pt2pt-mac Define this port to be a point-to-point link
disable Disable MSTP on this interface
edge-port-auto-detect Enable/Disable auto-detect edge port
force-migration-check Trigger port's migration state machine check
force-version Configure MSTP force version
forward-delay Configure bridge parameter forward-delay
hello-time Configure bridge parameter hello-time
instance Configure MSTP instance VLAN membership
max-age Configure bridge parameter max-age
max-hops Configure MSTP max-hops
name Configure MSTP configuration name
revision Configure MSTP revision level
scope Configure MSTP scope
start Start/stop MSTP operation

Inside the interface configuration, there is no way to configure e.g. a
bpdu-protect on the port but root-protect is configured on every port
towards customer switches.

I will be gladly thankful for any hints and I am aware that some of you
might declare the setup to be broken but on the other hand, for
colocation services where the same vlan might be required campus-wide,
it's hard to improve that without installing tons of cables.
Furthermore, we want to eliminate the dependency of just one big
core-switch. Both rooms are equally important and in the past, we had a
big core in Room A with downlinks going to smaller core-switches in Room
B but with the big core having a problem, everything was going down.

Thanks so far for reading this and hopefully some great ideas will
follow. Any help will rewarded with a cold beer in Frankfurt, Germany
anytime! ;-)

Best regards,
Jeff

Huan Pham

12 years ago

Permalink

Hi

Without going into your details, I think the fundamental issue here is you rely on customer to do the right thing: RSTP or MSTP. In this case, You need to allow customer to take part into your spanning tree domain. If they miss-configure it, then you've got a loop!

Changing from RSTP to MSTP does not solve the problem. They are spanning tree protocols: one instance vs multiple instances.

The solution here is to not trust customer for the layer 2 loop prevention.

The simplest solution is redundant trunk group (RTG). Pls check if your switches facing customer support it.

On Cisco, i think you can use "interface backup" command to do the same.

The down side with these solutions is customers have to connect to the same physical switch or virtual chassis.

Pls let me know if this works for you.

...

Tobias Heister

12 years ago

Permalink

Hi,

Post by Huan Pham
The simplest solution is redundant trunk group (RTG). Pls check if your switches facing customer support it.
On Cisco, i think you can use "interface backup" command to do the same.
The down side with these solutions is customers have to connect to the same physical switch or virtual chassis.

Why should they need to connect ot the same upstream switch?

RTG or flexlink is only run local on the TOR or Access Switch, the upstream switch does not need to know about it. You can connect the access switch running RTG/flexlink to two different
Core Switches without any problems. We do this in basically all of our datacenters.

You only need to make sure, that the Core to Core Switch Link does not fail or at least is redundant enough, otherwise you may end up with a split situation as RTG/flexlink does not know
about link errors upstream.

regards
Tobias

Huan Pham

12 years ago

Permalink

Hi Tobias,

Upstream or Downstream is from your perspective. RTG does not run between devices, so it does not care if the redundant paths are connected to upstream or downstream!! All it cares is that if the primary link is up, the the SW blocks all traffic comming and it does not send any traffic via those redundant paths, basically breaks any potential loop. If the primary link is down then the switch starts forwarding traffic via the next preferred link.

If you have a diagram and you turn it upside down, then your upstream switches now become downstream! Let's think it that way.

You are right that in the Juniper config example it is configured on an access switch, and it may be the best practice if you have the control of both core and access switches, e.g in your Data Centre.

But in case you want to control the loop when you connect with customer, you want to run RTG on your device! If you trust your customer doing the right thing then you can have the customer switch to connect to more than one of your devices, and run RTG on his switch. If he does not configure it properly then you are open to the loop again.

Huan

---

Huan Pham

...

Tobias Heister

12 years ago

Permalink

Hi,

...

Yeah, definitely a perspective thing.

Post by Huan Pham
But in case you want to control the loop when you connect with customer, you want to run RTG on your device! If you trust your customer doing the right thing then you can have the customer switch to connect to more than one of your devices, and run RTG on his switch. If he does not configure it properly then you are open to the loop again.

But i remember Jeff stating that the switches in the cabinets itself are managed by him and that the customers are "only" able to connect/disconnect cables on said switch(es)
In that case i would configure RTG/flexlink on the cabinet/access switch.

In either case if a customers loops (on) his own switch/switches there could be pps storm comming from this switch and you would have to mitigate that with stormcontrol limits or other kind
of policers. This should not be different wheter you run rtg on the core or the access layer.

If you have to connect "hostile" switches (configured by a customer) than i totally agree with you.

regards
Tobias

Ben Dale

12 years ago

Permalink

Hi Jeff,

Post by Jeff Meyers
The question is now how to proceed and how to improve the setup generally?
From what you've described, it sounds like there is a misconfiguration or bug *somewhere* amongst your 3 vendors. As painful as it will probably be to locate, that is probably the best place to start.

- Since you're only using a CIST ensure that *every* VLAN is configured on every switch.
- Make sure they are all configured as members of the CIST region too, otherwise your MSTP hash won't match and you'll end up with weird results not unlike what you are seeing
- Also make sure the MSTP revision level and configuration name for each switch is identical otherwise the hash won't match again
- Check all up/downlinks to make sure that there are no boundary ports - this will indicate a problem with one of the above items

All that said, almost every vendors implementation has it's peculiarities. In EXOS (Extreme Networks) for example, if you don't configure edge-safeguard on your edge ports, then if the *edge* port ever changes state (up or down), a TCN will trigger. Great when everyone shuts down their PCs in the evening at close intervals.

For your customer-facing ports, you want to BPDU Protect/Edge Guard or whatever HP call it configured. If they loop a port, you shut it down and leave it down. I've seen ESX vSwitches do this on plenty of occasions during reboots, even they shouldn't (eg: a loop is briefly formed inside the customer's hypervisor across a supposedly bonded link).

If you're downlinking to customer switches the only real option you have is Root-Protection/Root-Guard. This will block any port that receives a BPDU advertising a superior root bridge. A lot of people make the mistake of either disabling STP on links to "untrusted" switches, or filtering BPDUs altogether so that the customer can run their own xSTP domain beside yours. Bad move. When someone down the track loops that port, you'll remember why. You only want one root bridge in any L2 domain.

On all your switches, enable Storm-Control (or equivalent) with aggressive limits on broadcast traffic. Even with all of the above in place, there is nothing to stop one of your customer's downstream switches not running spanning-tree to have it's own loop and send the resulting broadcast storm back to you and there is very little you can do about it.

Post by Jeff Meyers
Does it make sense to change RSTP to MSTP on the MX routers in the first place?

Since you've only configured the CIST, your RSTP and MSTP operation is basically equivalent eg: you have a single spanning-tree instance across all your VLANs and convergence time and operation will be pretty much identical.

MSTP is a bit more work to configure and troubleshoot (especially if you're running multiple regions), but gives you that flexibility to lay out different trees across VLAN groups if required.

Post by Jeff Meyers
Is there any configuration I should perform on any of those devices involved?

Hard set your STP's point-to-point mode on all your uplinks, you may find it improves convergence time slightly on some vendors

Without knowing anything else about your set-up (L3 termination or the capabilities of the SX and HP boxen) you could configure Q-in-Q and use layer2-protocol-tunneling for all your customer's traffic (BPDUs included). Let them manage their own VLANs (give each customer a dedicated S-VLAN to go nuts with) and provide a level of separation between your STP and theirs. The MX can do plenty of packet-fu to Q-in-Q tagged frames in order to terminate to any L3 interfaces.

Good luck!

Ben