Outage #483
Information
Begins at: | 2018-11-30 04:00:00 CET |
Duration: | 120 minutes |
Type: | outage |
State: | resolved |
Impact: | network |
Affected systems: | Node vpsadmin1.prg
Node node2.prg Node node3.prg Node node4.prg Node node5.prg Node node6.prg Node node7.prg Node node8.prg Node node9.prg Node node10.prg Node node11.prg Node node12.prg Node node13.prg Node node14.prg Node node15.prg Node node16.prg Node node17.prg Node node18.prg Node backuper.prg Node nasbox.prg Node node1.pgnd Node node2.pgnd |
Summary: | |
English: Network downtime: L2 issues | |
Česky: Sitovani nedostupne: problem na L2 | |
Description: | |
English: We've experienced issues with L2 networking, most likely it was some unexpected Cisco "magic". Currently we're running with half of the gigabit switches down to prevent further LLDP mess; going to investigate further tonight. |
|
Česky: V Praze se rozlamala interakce 4 switchu proti sobe, vypada to na cosi s LLDP; museli jsme jako drasticke prozatimni reseni shodit pulku patere, aby se slo pres jednu cestu na jisto, misto dvou volitelnych (kde ani jedna se nevybrala a nefungovala). Dneska v noci to budu bliz zkoumat; je docela mozne, ze prehodime na BGP/10GE i ne-vpsAdminOS produkci, tohle se neda takhle nechavat, kdyby to melo blbnout dal... |
|
Handled by: | Pavel Šnajdr, Richard Marko |
Updates
Date | Summary | Reported by |
---|---|---|
2018-11-30 06:03:14 CET | Pavel Šnajdr | |
State: announced | ||
2018-12-01 04:40:54 CET | English: Workaround in effect
Česky: Docasne skororeseni |
Pavel Šnajdr |
Finished at: 2018-12-01 05:00 CET
State: resolved English: Effectively we don't have a backup now, since enabling full network configuration seems to trigger bug parade in Mikrotik against Cisco SG500. In case of a network device problem, we will have to switch power between the two branches of switches; we will speed up deployment of 10GE networking, so that we have that done by the end of this year, then we'll re-do the gigabit networking from scratch, simplified, without production traffic on them (just for management). Česky: Aktualne bezime na polovinu switchu, v pripade HW vypadku sitovaciho zarizeni bude potreba prepojit na druhou pulku pomoci prepojeni napajeni manualne. Vypada to, ze narazime na bugovou interakci bondingu Mikrotik<->Cisco, ktera se velmi spatne debuguje za behu; cili podstatne urychlime deploy 10GE/BGP sitovani, ktere musime tim padem stihnout zavest idealne do konce roku i pro stavajici produkci. Jeste je 6 serveru, ktere nemaji nainstalovanou 10GE sitovku, jinak uz bychom mohli zapojovat; je to objednane, jakmile to dorazi a bude nakablovano, planujeme po nocich server po serveru konvertovat z OSPF na BGP, seriove po jednom serveru za sebou. Vanoce v datacentru \o/ Ale aspon budeme pripraveni driv na vstup do NIXu :) |