Outage #483

Information

Begins at: 2018-11-30 04:00:00 CET
Duration: 120 minutes
Planned:
State: closed
Type: network
Affected systems: Node vpsadmin.prg
Node node2.prg
Node node3.prg
Node node4.prg
Node node5.prg
Node node6.prg
Node node7.prg
Node node8.prg
Node node9.prg
Node node10.prg
Node node11.prg
Node node12.prg
Node node13.prg
Node node14.prg
Node node15.prg
Node node16.prg
Node node17.prg
Node node18.prg
Node backuper.prg
Node nasbox.prg
Node node1.pgnd
Node node2.pgnd
Summary:
English: Network downtime: L2 issues
Česky: Sitovani nedostupne: problem na L2
Description:
English: We've experienced issues with L2 networking, most likely it was some unexpected Cisco "magic".

Currently we're running with half of the gigabit switches down to prevent further LLDP mess; going to investigate further tonight.
Česky: V Praze se rozlamala interakce 4 switchu proti sobe, vypada to na cosi s LLDP; museli jsme jako drasticke prozatimni reseni shodit pulku patere, aby se slo pres jednu cestu na jisto, misto dvou volitelnych (kde ani jedna se nevybrala a nefungovala).

Dneska v noci to budu bliz zkoumat; je docela mozne, ze prehodime na BGP/10GE i ne-vpsAdminOS produkci, tohle se neda takhle nechavat, kdyby to melo blbnout dal...
Handled by: Pavel Šnajdr, Richard Marko

Updates

Date Summary Reported by
2018-11-30 06:03:14 CET Pavel Šnajdr
State: announced
2018-12-01 04:40:54 CET English: Workaround in effect
Česky: Docasne skororeseni
Pavel Šnajdr
Finished at: 2018-12-01 05:00 CET
State: closed

English:
Effectively we don't have a backup now, since enabling full network configuration seems to trigger bug parade in Mikrotik against Cisco SG500.

In case of a network device problem, we will have to switch power between the two branches of switches;

we will speed up deployment of 10GE networking, so that we have that done by the end of this year, then we'll re-do the gigabit networking from scratch, simplified, without production traffic on them (just for management).

Česky: Aktualne bezime na polovinu switchu, v pripade HW vypadku sitovaciho zarizeni bude potreba prepojit na druhou pulku pomoci prepojeni napajeni manualne.

Vypada to, ze narazime na bugovou interakci bondingu Mikrotik<->Cisco, ktera se velmi spatne debuguje za behu; cili podstatne urychlime deploy 10GE/BGP sitovani, ktere musime tim padem stihnout zavest idealne do konce roku i pro stavajici produkci.

Jeste je 6 serveru, ktere nemaji nainstalovanou 10GE sitovku, jinak uz bychom mohli zapojovat; je to objednane, jakmile to dorazi a bude nakablovano, planujeme po nocich server po serveru konvertovat z OSPF na BGP, seriove po jednom serveru za sebou. Vanoce v datacentru \o/

Ale aspon budeme pripraveni driv na vstup do NIXu :)

Help

Kam hlásit chyby?

Support vpsFree.cz

Support mail: podpora @ vpsfree.cz

Links

IRC
chat.freenode.net #vpsfree

Mailing lists
https://lists.vpsfree.cz/

Knowledge base
Česky: https://kb.vpsfree.cz/
English: https://kb.vpsfree.org/

Sysadmins contacts

Jakub Skokan
IRC: aither at #vpsfree
Phone: +420 775 386 453

Pavel Snajdr (main admin)
IRC: snajpa at #vpsfree
Phone: +420 720 107 791