[j-nsp] MX960 Redundant RE problem

Discussion:

Mohammad

2012-02-15 07:40:51 UTC

Hi everyone

We have an MX960 with two routing engines, Re0: Backup, Re1: Master

When we try to switchover to the backup RE we see the following message:

XXX# run request chassis routing-engine master switch

error: Standby Routing Engine is not ready for graceful switchover
(replication_err soft_mask_err)

Toggle mastership between routing engines ? [yes,no] (no)

Noting that we used to switchover between the two Res a day a before with no
issues!!!!

Also, when we login to the re0 (backup) and check the isis, rsvp, etc? we
see the following:

XXX> request routing-engine login other-routing-engine

?

--- JUNOS 10.2R3.10 built 2010-10-16 19:24:06 UTC

{backup}

XXX> show isis adjacency

{backup}

XXX> show rsvp session

Ingress RSVP: 0 sessions

Total 0 displayed, Up 0, Down 0

Egress RSVP: 0 sessions

Total 0 displayed, Up 0, Down 0

Transit RSVP: 0 sessions

Total 0 displayed, Up 0, Down 0

{backup}

XXX>

While we can see the bgp routes and L3VPN routes,,,!!!!

We have tried to replace the backup with another one, but with the same
results

Any ideas, this issue is really confusing us, and it is a very critical
router in our network.

Thank you in advance

Mohammad Salbad

Morgan McLean

2012-02-15 07:56:13 UTC

Permalink

Correct me if I'm wrong, but backup routing engines never have adjacencies
or peering relationships etc because they are not active, correct? When
they become master they have to reestablish those sessions. Thats how it
seems to be for our SRX routing engines, at least, but routes are shared
between the two so that during the time it takes for those things to
reestablish, the routes are still moving traffic.

I might be wrong, but that was my impression.

Morgan

2012/2/14 Mohammad <masalbad at gmail.com>

Post by Mohammad
Hi everyone
We have an MX960 with two routing engines, Re0: Backup, Re1: Master
XXX# run request chassis routing-engine master switch
error: Standby Routing Engine is not ready for graceful switchover
(replication_err soft_mask_err)
Toggle mastership between routing engines ? [yes,no] (no)
Noting that we used to switchover between the two Res a day a before with no
issues!!!!
Also, when we login to the re0 (backup) and check the isis, rsvp, etc? we
XXX> request routing-engine login other-routing-engine
?
--- JUNOS 10.2R3.10 built 2010-10-16 19:24:06 UTC
{backup}
XXX> show isis adjacency
{backup}
XXX> show rsvp session
Ingress RSVP: 0 sessions
Total 0 displayed, Up 0, Down 0
Egress RSVP: 0 sessions
Total 0 displayed, Up 0, Down 0
Transit RSVP: 0 sessions
Total 0 displayed, Up 0, Down 0
{backup}
XXX>
While we can see the bgp routes and L3VPN routes,,,!!!!
We have tried to replace the backup with another one, but with the same
results
Any ideas, this issue is really confusing us, and it is a very critical
router in our network.
Thank you in advance
Mohammad Salbad
_______________________________________________
juniper-nsp mailing list juniper-nsp at puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

Stefan Fouant

2012-02-15 17:24:50 UTC

Permalink

Morgan,

You are correct if you are running GRES only, however if you enable NSR basically the Backup RE also actively runs rpd and maintains state adjacencies, etc, so in the event of a Primary RE failure you will not need to reestablish adjacencies, etc.

The cool thing is the Backup RE is actually listening to all the control plane messages coming on fxp1 destined for the Master RE and formulating it's own decisions, running its own Dijkstra, BGP Path Selection, etc. This is a preferred approach as opposed to simply mirroring routing state from the Primary to the Backup is because it eliminates fate sharing where there may be a bug on the Primary RE, we don't want to create a carbon copy of that on the Backup.

Stefan Fouant
JNCIE-SEC, JNCIE-SP, JNCIE-ER, JNCI
Technical Trainer, Juniper Networks

Follow us on Twitter @JuniperEducate

Sent from my iPad

Post by Morgan McLean
Correct me if I'm wrong, but backup routing engines never have adjacencies
or peering relationships etc because they are not active, correct? When
they become master they have to reestablish those sessions. Thats how it
seems to be for our SRX routing engines, at least, but routes are shared
between the two so that during the time it takes for those things to
reestablish, the routes are still moving traffic.
I might be wrong, but that was my impression.
Morgan
2012/2/14 Mohammad <masalbad at gmail.com>

_______________________________________________
juniper-nsp mailing list juniper-nsp at puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

Daniel Roesen

2012-02-15 18:56:57 UTC

Permalink

Post by Stefan Fouant
The cool thing is the Backup RE is actually listening to all the
control plane messages coming on fxp1 destined for the Master RE
and formulating it's own decisions, running its own Dijkstra,
BGP Path Selection, etc. This is a preferred approach as opposed
to simply mirroring routing state from the Primary to the Backup
is because it eliminates fate sharing where there may be a bug
on the Primary RE, we don't want to create a carbon copy of that
on the Backup.

I don't really buy that argument. Running the same code with the same
algorithm against the same data usually leads to the same results.
You'll get full bug redundancy - I'd expect RE crashing simultaneously.
Did NSR protect from any of the recent BGP bugs?

The advantage I see are less impacting failovers in case of a) hardware
failures of active RE, or b) data structure corruption happening on both
REs [same code => same bugs], but eventually leading to a crash of the
active RE sooner than on the backup RE, or c) race conditions being
triggered sufficiently differently timing-wise so only active RE
crashes.

Am I missing something?

Best regards,
Daniel

--
CLUE-RIPE -- Jabber: dr at cluenet.de -- dr at IRCnet -- PGP: 0xA85C8AA0

Joel jaeggli

2012-02-15 19:07:48 UTC

Permalink

Post by Daniel Roesen

I don't really buy that argument. Running the same code with the same
algorithm against the same data usually leads to the same results.
You'll get full bug redundancy - I'd expect RE crashing simultaneously.
Did NSR protect from any of the recent BGP bugs?
The advantage I see are less impacting failovers in case of a) hardware
failures of active RE, or b) data structure corruption happening on both
REs [same code => same bugs], but eventually leading to a crash of the
active RE sooner than on the backup RE, or c) race conditions being
triggered sufficiently differently timing-wise so only active RE
crashes.

when ISSU actually works it's a godsend.

Post by Daniel Roesen
Am I missing something?
Best regards,
Daniel

Stefan Fouant

2012-02-15 21:08:01 UTC

Permalink

I was referring more to a bug in hardware... Bad memory, etc.

Stefan Fouant
JNCIE-SEC, JNCIE-SP, JNCIE-ER, JNCI
Technical Trainer, Juniper Networks

Follow us on Twitter @JuniperEducate

Sent from my iPad

Post by Daniel Roesen

I don't really buy that argument. Running the same code with the same
algorithm against the same data usually leads to the same results.
You'll get full bug redundancy - I'd expect RE crashing simultaneously.
Did NSR protect from any of the recent BGP bugs?
The advantage I see are less impacting failovers in case of a) hardware
failures of active RE, or b) data structure corruption happening on both
REs [same code => same bugs], but eventually leading to a crash of the
active RE sooner than on the backup RE, or c) race conditions being
triggered sufficiently differently timing-wise so only active RE
crashes.
Am I missing something?
Best regards,
Daniel
--
CLUE-RIPE -- Jabber: dr at cluenet.de -- dr at IRCnet -- PGP: 0xA85C8AA0

Mohammad

2012-02-18 20:47:04 UTC

Permalink

Hi All

Thank you for your support, most probably what we are gonna do is:
- try turning GRES/NSR on/off
- upgrade to 10.4R8.5 or 10.4R9
Currently we are waiting JTAC response.
I'll let you once it is solved.

Thank you again
Mohammad Salbad

-----Original Message-----
From: Stefan Fouant [mailto:sfouant at shortestpathfirst.net]
Sent: Wednesday, February 15, 2012 11:08 PM
To: Daniel Roesen
Cc: Morgan McLean; juniper-nsp at puck.nether.net; Mohammad
Subject: Re: MX960 Redundant RE problem

I was referring more to a bug in hardware... Bad memory, etc.

Stefan Fouant
JNCIE-SEC, JNCIE-SP, JNCIE-ER, JNCI
Technical Trainer, Juniper Networks

Follow us on Twitter @JuniperEducate

Sent from my iPad

Post by Daniel Roesen

Post by Stefan Fouant
The cool thing is the Backup RE is actually listening to all the
control plane messages coming on fxp1 destined for the Master RE and
formulating it's own decisions, running its own Dijkstra, BGP Path
Selection, etc. This is a preferred approach as opposed to simply
mirroring routing state from the Primary to the Backup is because it
eliminates fate sharing where there may be a bug on the Primary RE,
we don't want to create a carbon copy of that on the Backup.

I don't really buy that argument. Running the same code with the same
algorithm against the same data usually leads to the same results.
You'll get full bug redundancy - I'd expect RE crashing simultaneously.
Did NSR protect from any of the recent BGP bugs?
The advantage I see are less impacting failovers in case of a)
hardware failures of active RE, or b) data structure corruption
happening on both REs [same code => same bugs], but eventually leading
to a crash of the active RE sooner than on the backup RE, or c) race
conditions being triggered sufficiently differently timing-wise so
only active RE crashes.
Am I missing something?
Best regards,
Daniel
--
CLUE-RIPE -- Jabber: dr at cluenet.de -- dr at IRCnet -- PGP: 0xA85C8AA0

Per Granath

2012-02-15 08:00:56 UTC

Permalink

Post by Mohammad
We have an MX960 with two routing engines, Re0: Backup, Re1: Master
XXX# run request chassis routing-engine master switch
error: Standby Routing Engine is not ready for graceful switchover
(replication_err soft_mask_err)

Disable graceful-switchover (and nonstop-routing) and then commit (assuming there is commit synchronize).
Then enable it again, commit, and wait for the REs to sync.

Something with the kernel database not being healthy, possibly.

...or try JTAC :)

Diogo Montagner

2012-02-15 08:03:39 UTC

Permalink

You have GRES enabled and the backup RE was not ready to takeover. See
the message in the first lines.

Thanks

--
Sent from my mobile device

./diogo -montagner

Mohammad

2012-02-15 10:44:42 UTC

Permalink

Kindly find the following output, I hope it is helpful
xxxxx> show task replication
Stateful Replication: Enabled
RE mode: Master

Protocol Synchronization Status
OSPF Complete
BGP Complete
IS-IS Complete
MPLS Complete
RSVP Complete

{master}
xxxx>

Serge Vautour

2012-02-15 15:47:18 UTC

Permalink

You can also run the following command on the backup RE to check it's state:

me at BLAH-re1> show system switchover
Graceful switchover: On
Configuration database: Ready
Kernel database: Ready
Peer state: Steady State

If this command and "show task replication" on the master RE don't show the correct outputs, I agree with the recommendation to turn GRES/NSR on/off. If that doesn't work, reboot REs.

Serge

________________________________
From: Mohammad <masalbad at gmail.com>
To: juniper-nsp at puck.nether.net
Sent: Wednesday, February 15, 2012 6:44:42 AM
Subject: Re: [j-nsp] MX960 Redundant RE problem

Kindly find the following output, I hope it is helpful
xxxxx> show task replication
? ? ? ? Stateful Replication: Enabled
? ? ? ? RE mode: Master

? ? Protocol? ? ? ? ? ? ? ? Synchronization Status
? ? OSPF? ? ? ? ? ? ? ? ? ? Complete? ? ? ? ? ? ?
? ? BGP? ? ? ? ? ? ? ? ? ? Complete? ? ? ? ? ? ?
? ? IS-IS? ? ? ? ? ? ? ? ? Complete? ? ? ? ? ? ?
? ? MPLS? ? ? ? ? ? ? ? ? ? Complete? ? ? ? ? ? ?
? ? RSVP? ? ? ? ? ? ? ? ? ? Complete? ? ? ? ? ? ?

{master}
xxxx>

_______________________________________________
juniper-nsp mailing list juniper-nsp at puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp