Discussion:
[j-nsp] auto b/w mpls best practice -- cpu spikes
tim tiriche
2018-09-12 05:04:03 UTC
Permalink
Hi,

Attached is my MPLS Auto B/w Configuration and i see frequent path changes
and cpu spikes. I have a small network and wanted to know if there is any
optimization/best practices i could follow to reduce the churn.

protocols {
mpls {
statistics {
file mpls.statistics size 1m files 10;
interval 300;
auto-bandwidth;
}
log-updown {
syslog;
trap;
trap-path-down;
trap-path-up;
}
traffic-engineering mpls-forwarding;

rsvp-error-hold-time 25;
smart-optimize-timer 180;
ipv6-tunneling;
optimize-timer 3600;
label-switched-path <*> {
retry-timer 600;
random;
node-link-protection;
adaptive;
auto-bandwidth {
adjust-interval 7200;
adjust-threshold 20;
minimum-bandwidth 1m;
maximum-bandwidth 9g;
adjust-threshold-overflow-limit 2;
adjust-threshold-underflow-limit 4;
}
primary <*> {
priority 5 5;
}
}
_______________________________________________
juniper-nsp mailing list juniper-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Saku Ytti
2018-09-12 09:11:26 UTC
Permalink
Hey Tim,

I'd optimise for customer experience, not CPU utilisation. Do you have
issues with convergence time, suboptimal paths?

Which JunOS you're running? There are quite good reasons to jump in
recent JunOS for RSVP, as you can get RSVP its own core, and you can
get make-before-break LSP reoptimisation, which actually works
event-driven rather than timer based (like what you have, causing LSP
blackholing if LSP convergence lasts longer than timers).
Post by tim tiriche
Hi,
Attached is my MPLS Auto B/w Configuration and i see frequent path changes
and cpu spikes. I have a small network and wanted to know if there is any
optimization/best practices i could follow to reduce the churn.
protocols {
mpls {
statistics {
file mpls.statistics size 1m files 10;
interval 300;
auto-bandwidth;
}
log-updown {
syslog;
trap;
trap-path-down;
trap-path-up;
}
traffic-engineering mpls-forwarding;
rsvp-error-hold-time 25;
smart-optimize-timer 180;
ipv6-tunneling;
optimize-timer 3600;
label-switched-path <*> {
retry-timer 600;
random;
node-link-protection;
adaptive;
auto-bandwidth {
adjust-interval 7200;
adjust-threshold 20;
minimum-bandwidth 1m;
maximum-bandwidth 9g;
adjust-threshold-overflow-limit 2;
adjust-threshold-underflow-limit 4;
}
primary <*> {
priority 5 5;
}
}
_______________________________________________
https://puck.nether.net/mailman/listinfo/juniper-nsp
--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
tim tiriche
2018-09-13 09:13:02 UTC
Permalink
.o issues with convergence or suboptimal paths. The noc is constantly
seeing high cpu alerts and that was concerning. Is this normal in other
networks?

Running 14.1R7.4 with mx480/240 mix.
I usually follow the code listed here:
https://kb.juniper.net/InfoCenter/index?page=content&id=KB21476

Which code version have these optimization happened in?
Post by Saku Ytti
Hey Tim,
I'd optimise for customer experience, not CPU utilisation. Do you have
issues with convergence time, suboptimal paths?
Which JunOS you're running? There are quite good reasons to jump in
recent JunOS for RSVP, as you can get RSVP its own core, and you can
get make-before-break LSP reoptimisation, which actually works
event-driven rather than timer based (like what you have, causing LSP
blackholing if LSP convergence lasts longer than timers).
Post by tim tiriche
Hi,
Attached is my MPLS Auto B/w Configuration and i see frequent path
changes
Post by tim tiriche
and cpu spikes. I have a small network and wanted to know if there is
any
Post by tim tiriche
optimization/best practices i could follow to reduce the churn.
protocols {
mpls {
statistics {
file mpls.statistics size 1m files 10;
interval 300;
auto-bandwidth;
}
log-updown {
syslog;
trap;
trap-path-down;
trap-path-up;
}
traffic-engineering mpls-forwarding;
rsvp-error-hold-time 25;
smart-optimize-timer 180;
ipv6-tunneling;
optimize-timer 3600;
label-switched-path <*> {
retry-timer 600;
random;
node-link-protection;
adaptive;
auto-bandwidth {
adjust-interval 7200;
adjust-threshold 20;
minimum-bandwidth 1m;
maximum-bandwidth 9g;
adjust-threshold-overflow-limit 2;
adjust-threshold-underflow-limit 4;
}
primary <*> {
priority 5 5;
}
}
_______________________________________________
https://puck.nether.net/mailman/listinfo/juniper-nsp
--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Saku Ytti
2018-09-13 11:39:10 UTC
Permalink
I think 16.1 was first.

ps Haux|grep rpd should show multiple rpd lines.

Also
***@r41.labxtx01.us.bb> show task io |match {
KRT IO task 0 0 0 0
0 {krtio-th}
krtio-th 0 0 0 0
0 {krtio-th}
krt ioth solic client 0 0 869 0
0 {krtio-th}
KRT IO 0 0 0 0
0 {krtio-th}
bgpio-0-th 0 0 0 0
0 {bgpio-0-th}
rsvp-io 0 0 0 0
0 {rsvp-io}
jtrace_jthr_task 0 0 0 0
0 {TraceThread}

I'd just go latest and greatest.
.o issues with convergence or suboptimal paths. The noc is constantly seeing high cpu alerts and that was concerning. Is this normal in other networks?
Running 14.1R7.4 with mx480/240 mix.
I usually follow the code listed here: https://kb.juniper.net/InfoCenter/index?page=content&id=KB21476
Which code version have these optimization happened in?
Post by Saku Ytti
Hey Tim,
I'd optimise for customer experience, not CPU utilisation. Do you have
issues with convergence time, suboptimal paths?
Which JunOS you're running? There are quite good reasons to jump in
recent JunOS for RSVP, as you can get RSVP its own core, and you can
get make-before-break LSP reoptimisation, which actually works
event-driven rather than timer based (like what you have, causing LSP
blackholing if LSP convergence lasts longer than timers).
Post by tim tiriche
Hi,
Attached is my MPLS Auto B/w Configuration and i see frequent path changes
and cpu spikes. I have a small network and wanted to know if there is any
optimization/best practices i could follow to reduce the churn.
protocols {
mpls {
statistics {
file mpls.statistics size 1m files 10;
interval 300;
auto-bandwidth;
}
log-updown {
syslog;
trap;
trap-path-down;
trap-path-up;
}
traffic-engineering mpls-forwarding;
rsvp-error-hold-time 25;
smart-optimize-timer 180;
ipv6-tunneling;
optimize-timer 3600;
label-switched-path <*> {
retry-timer 600;
random;
node-link-protection;
adaptive;
auto-bandwidth {
adjust-interval 7200;
adjust-threshold 20;
minimum-bandwidth 1m;
maximum-bandwidth 9g;
adjust-threshold-overflow-limit 2;
adjust-threshold-underflow-limit 4;
}
primary <*> {
priority 5 5;
}
}
_______________________________________________
https://puck.nether.net/mailman/listinfo/juniper-nsp
--
++ytti
--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Tom Beecher
2018-09-13 15:12:41 UTC
Permalink
There's no one magic knob that fixes CPU spikes in an MPLS environment.
They're all different. What I change to optimize mine might knock your
network over in 5 minutes. You need to determine what is triggering the
churn before you can reasonable optimize it. Take a look at logs and see
what is causing path changes that cause CPU spikes, work from there.

Having pre-signaled secondary paths will generally always be a good idea,
although with those try to use the sync-active-path-bandwidth command too
to prevent stale secondary RSVP reservations. Make-before-break is almost
universally a good idea too.

On code, personally I wouldn't ever go latest and greatest. It usually
means you just find the latest and greatest bugs. :) I go with the newest
stable version that doesnt have bugs that screw me, and upgrade only when
it's a feature I need, optimization I want, or security reasons.

Find your churn causes, work from there.
Post by Saku Ytti
I think 16.1 was first.
ps Haux|grep rpd should show multiple rpd lines.
Also
KRT IO task 0 0 0 0
0 {krtio-th}
krtio-th 0 0 0 0
0 {krtio-th}
krt ioth solic client 0 0 869 0
0 {krtio-th}
KRT IO 0 0 0 0
0 {krtio-th}
bgpio-0-th 0 0 0 0
0 {bgpio-0-th}
rsvp-io 0 0 0 0
0 {rsvp-io}
jtrace_jthr_task 0 0 0 0
0 {TraceThread}
I'd just go latest and greatest.
Post by tim tiriche
.o issues with convergence or suboptimal paths. The noc is constantly
seeing high cpu alerts and that was concerning. Is this normal in other
networks?
Post by tim tiriche
Running 14.1R7.4 with mx480/240 mix.
https://kb.juniper.net/InfoCenter/index?page=content&id=KB21476
Post by tim tiriche
Which code version have these optimization happened in?
Post by Saku Ytti
Hey Tim,
I'd optimise for customer experience, not CPU utilisation. Do you have
issues with convergence time, suboptimal paths?
Which JunOS you're running? There are quite good reasons to jump in
recent JunOS for RSVP, as you can get RSVP its own core, and you can
get make-before-break LSP reoptimisation, which actually works
event-driven rather than timer based (like what you have, causing LSP
blackholing if LSP convergence lasts longer than timers).
Post by tim tiriche
Hi,
Attached is my MPLS Auto B/w Configuration and i see frequent path
changes
Post by tim tiriche
Post by Saku Ytti
Post by tim tiriche
and cpu spikes. I have a small network and wanted to know if there
is any
Post by tim tiriche
Post by Saku Ytti
Post by tim tiriche
optimization/best practices i could follow to reduce the churn.
protocols {
mpls {
statistics {
file mpls.statistics size 1m files 10;
interval 300;
auto-bandwidth;
}
log-updown {
syslog;
trap;
trap-path-down;
trap-path-up;
}
traffic-engineering mpls-forwarding;
rsvp-error-hold-time 25;
smart-optimize-timer 180;
ipv6-tunneling;
optimize-timer 3600;
label-switched-path <*> {
retry-timer 600;
random;
node-link-protection;
adaptive;
auto-bandwidth {
adjust-interval 7200;
adjust-threshold 20;
minimum-bandwidth 1m;
maximum-bandwidth 9g;
adjust-threshold-overflow-limit 2;
adjust-threshold-underflow-limit 4;
}
primary <*> {
priority 5 5;
}
}
_______________________________________________
https://puck.nether.net/mailman/listinfo/juniper-nsp
--
++ytti
--
++ytti
_______________________________________________
https://puck.nether.net/mailman/listinfo/juniper-nsp
_______________________________________________
juniper-nsp mailing list juniper-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Saku Ytti
2018-09-13 17:11:07 UTC
Permalink
RPD was single threaded application until very recently, so your RSVP
would compete to access to single core with every other task. Perhaps
not huge deal if you are BGP free MPLS core, but if you are not, then
you're going to see massive improvement by running later JunOS with
multithreaded RPD, there are only very few threads, one of them
happens to be RSVP, it was added in initial release of multithreaded
RPD, because that is where practical deployments will most benefit
from multicore.

You can talk to your account team about what code improvements have
come after RSVP was multithreaded, there are orders of magnitude
convergence benefits in real customer networks by changing nothing but
JunOS release.
OP's release does not support lsp self ping nor adaptive tear down,
both which are needed for make-before-break to actually work, and not
just hope it works.

Juniper has also done fundamental changes on how they develop and
release and have gone back to single branch model, from which you can
start capitalising on after 17.3, IIRC. You can talk to your account
team to substantiate quality improvements with data, such as how many
bugs found in releases over time at specific spot of release cycle.

Generally strategy should be when you need new software

- pick latest long term supporter to test
- if it fails your test, go back to step1 with latest-1
- if it succeeds test, change to newer rebuild if you have bugs and if
you need new features restart the process

New software is bad, old software is good adage is not data driven. We
also need to understand what vendor is doing, how are they developing,
how are they testing, how are they releasing and when they are
changing something, in which release will the changes appear and what
does data tell about success of their efforts.
In my mind all major vendors have significantly improved their story
in past few years, it won't say anything meaningful about success in
any specific deployment, but I buy vendors' story and I believe on
average new releases are more successful today than they were say just
3 year ago.
There's no one magic knob that fixes CPU spikes in an MPLS environment. They're all different. What I change to optimize mine might knock your network over in 5 minutes. You need to determine what is triggering the churn before you can reasonable optimize it. Take a look at logs and see what is causing path changes that cause CPU spikes, work from there.
Having pre-signaled secondary paths will generally always be a good idea, although with those try to use the sync-active-path-bandwidth command too to prevent stale secondary RSVP reservations. Make-before-break is almost universally a good idea too.
On code, personally I wouldn't ever go latest and greatest. It usually means you just find the latest and greatest bugs. :) I go with the newest stable version that doesnt have bugs that screw me, and upgrade only when it's a feature I need, optimization I want, or security reasons.
Find your churn causes, work from there.
Post by Saku Ytti
I think 16.1 was first.
ps Haux|grep rpd should show multiple rpd lines.
Also
KRT IO task 0 0 0 0
0 {krtio-th}
krtio-th 0 0 0 0
0 {krtio-th}
krt ioth solic client 0 0 869 0
0 {krtio-th}
KRT IO 0 0 0 0
0 {krtio-th}
bgpio-0-th 0 0 0 0
0 {bgpio-0-th}
rsvp-io 0 0 0 0
0 {rsvp-io}
jtrace_jthr_task 0 0 0 0
0 {TraceThread}
I'd just go latest and greatest.
.o issues with convergence or suboptimal paths. The noc is constantly seeing high cpu alerts and that was concerning. Is this normal in other networks?
Running 14.1R7.4 with mx480/240 mix.
I usually follow the code listed here: https://kb.juniper.net/InfoCenter/index?page=content&id=KB21476
Which code version have these optimization happened in?
Post by Saku Ytti
Hey Tim,
I'd optimise for customer experience, not CPU utilisation. Do you have
issues with convergence time, suboptimal paths?
Which JunOS you're running? There are quite good reasons to jump in
recent JunOS for RSVP, as you can get RSVP its own core, and you can
get make-before-break LSP reoptimisation, which actually works
event-driven rather than timer based (like what you have, causing LSP
blackholing if LSP convergence lasts longer than timers).
Post by tim tiriche
Hi,
Attached is my MPLS Auto B/w Configuration and i see frequent path changes
and cpu spikes. I have a small network and wanted to know if there is any
optimization/best practices i could follow to reduce the churn.
protocols {
mpls {
statistics {
file mpls.statistics size 1m files 10;
interval 300;
auto-bandwidth;
}
log-updown {
syslog;
trap;
trap-path-down;
trap-path-up;
}
traffic-engineering mpls-forwarding;
rsvp-error-hold-time 25;
smart-optimize-timer 180;
ipv6-tunneling;
optimize-timer 3600;
label-switched-path <*> {
retry-timer 600;
random;
node-link-protection;
adaptive;
auto-bandwidth {
adjust-interval 7200;
adjust-threshold 20;
minimum-bandwidth 1m;
maximum-bandwidth 9g;
adjust-threshold-overflow-limit 2;
adjust-threshold-underflow-limit 4;
}
primary <*> {
priority 5 5;
}
}
_______________________________________________
https://puck.nether.net/mailman/listinfo/juniper-nsp
--
++ytti
--
++ytti
_______________________________________________
https://puck.nether.net/mailman/listinfo/juniper-nsp
--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Mark Tinka
2018-09-15 16:56:06 UTC
Permalink
Post by Saku Ytti
New software is bad, old software is good adage is not data driven. We
also need to understand what vendor is doing, how are they developing,
how are they testing, how are they releasing and when they are
changing something, in which release will the changes appear and what
does data tell about success of their efforts.
In my mind all major vendors have significantly improved their story
in past few years, it won't say anything meaningful about success in
any specific deployment, but I buy vendors' story and I believe on
average new releases are more successful today than they were say just
3 year ago.
Fully agree with the above.

Mark.
_______________________________________________
juniper-nsp mailing list juniper-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
a***@netconsultings.com
2018-09-16 09:29:48 UTC
Permalink
Of tim tiriche
Sent: Wednesday, September 12, 2018 6:04 AM
Hi,
Attached is my MPLS Auto B/w Configuration and i see frequent path
changes and cpu spikes. I have a small network and wanted to know if
there
is any optimization/best practices i could follow to reduce the churn.
protocols {
mpls {
statistics {
file mpls.statistics size 1m files 10;
interval 300;
auto-bandwidth;
}
log-updown {
syslog;
trap;
trap-path-down;
trap-path-up;
}
traffic-engineering mpls-forwarding;
rsvp-error-hold-time 25;
smart-optimize-timer 180;
ipv6-tunneling;
optimize-timer 3600;
label-switched-path <*> {
retry-timer 600;
random;
node-link-protection;
adaptive;
auto-bandwidth {
adjust-interval 7200;
adjust-threshold 20;
minimum-bandwidth 1m;
maximum-bandwidth 9g;
adjust-threshold-overflow-limit 2;
adjust-threshold-underflow-limit 4;
}
primary <*> {
priority 5 5;
}
}
My advice in short is Integrated Services QOS sucks, use Differentiated
Services QOS instead and use RSVP-TE solely for TE purposes, it will make
your life so much easier.

The problem with IntServ QOS is twofold:
1) TE tunnels need to have their BW adjusted.
2) Nodes in the network needs to know about available BW (per class) on each
link.
Where changes in 1 induce changes in 2, that's just not meant to scale.

Yes you can make the setup scale but it will be very stiff by,
1) Reducing the sampling frequency and/or making the adjust interval longer
and disabling underflow/overflow thresholds or by making them large.
2) Sparser population of link BW thresholds.
But then the side effects are not reacting to changes in time and you'll
likely run into the BW trailing effect.

I guess I could come up with some use cases for IntServ, but those are
rather corner cases.
But thinking about it I believe that even those could still be solved just
by adding more static tunnels to increase the granularity of load
distribution.

But I'd be very interested to hear about cases where the IntServ is the only
remedy.

adam

netconsultings.com
::carrier-class solutions for the telecommunications industry::


_______________________________________________
juniper-nsp mailing list juniper-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

Loading...