Jump to content



Photo

Office 365 Exchange/OWA down for 4+ hours

office office365 exchange owa outlook cloud

  • Please log in to reply
31 replies to this topic

#16 Harrison H.

Harrison H.

    Neowinian

  • 591 posts
  • Joined: 21-August 04
  • Location: Florida
  • OS: Windows 8.1
  • Phone: Nokia Lumia 1520

Posted 24 June 2014 - 19:57

You have control over on premise and you know what happened. On premise servers also have less downtime (in a sense) because you are responsible and know what you are doing. Who knows if this downtime was because of a silent push that wasn't suppose to cripple the systems but it did?

 

Hardware Failures usually never take this long. Hardware failures are probably the easiest to fix especially in a data center since you build it to be replaceable. All storage is NAS so if a server goes down you slap a new one in boot it up and it gets provisioned. Where as if a software push goes bad, then your rolling back and that takes forever.

 

So what you are saying is that cloud providers have no idea what they are doing? Downtime is downtime, in the cloud or on premise. In either situation, there is a team that is responsible for getting services back up and running. Microsoft will write up a report in a few days to let us know what happened.

 

Just because you have control over something (on premise), doesn't mean fixing it is any faster than a cloud service provider fixing something.




#17 MorganX

MorganX

    MegaZilla™

  • 4,030 posts
  • Joined: 16-June 04
  • Location: Midwest USA
  • OS: Digita Storm Bolt, Windows 8.1 x64Ce Pro w/Media nter Pack/Core i7 4790K/16GB DDR3 1600/Samsung 850 Pro 256GB x 2 - Raid 0/Transcend M.2 256GB/ASRock Z97E-ITXac/GTX 970
  • Phone: Samsung Galaxy S5 Active

Posted 24 June 2014 - 20:07

So what you are saying is that cloud providers have no idea what they are doing? Downtime is downtime, in the cloud or on premise. In either situation, there is a team that is responsible for getting services back up and running. Microsoft will write up a report in a few days to let us know what happened.
 
Just because you have control over something (on premise), doesn't mean fixing it is any faster than a cloud service provider fixing something.


I think 4 hours downtime is rare in this day and age short of a power failure, and we have generators, though I know that is rare. On premises you have much more control of getting things back up in a hurry if it does happen whereas a cloud service has to restore their whole service. There is no, throwing up some VM heads and restore services ASAP. Or rolling back to a snapshot. With Failover Clustering, VM replication, etc., if 4 hrs. downtime happens it needs to be an act of God or someone is in big trouble. With cloud services, there is no fear on the part of the provider with regards to taking down the overwhelming majority of customers. If they leave they leave.

PS: Reimbursing 4 hours of hosting cost, probably can't compare to the actually cost of 4 hours of lost productivity.

#18 The_Observer

The_Observer

    Apples, Bananas, Rhinoceros!

  • 4,001 posts
  • Joined: 12-April 05
  • Location: New Zealand
  • OS: OS X 10.9
  • Phone: iPhone5s

Posted 24 June 2014 - 20:53

why isnt there a second server(s) or connection to take over when one goes down??? or am i missing something.



#19 Roger H.

Roger H.

    Neowinian Senior

  • 13,044 posts
  • Joined: 18-August 01
  • Location: Germany
  • OS: Windows 8.1
  • Phone: Nexus 5

Posted 24 June 2014 - 21:26

I started getting calls this morning and other people during the day.

Basically everyone is saying they cant send emails but they can receive. I've sent some out and they seemed fine but in general everyone seems to be having that issue. Mostly worried if the 30-50 emails sent today actually went out (when they didn't get the usual quick reply is when they figured something was wrong.).

Crazy day as I just did a migration and thought I screwed up the system. Still tho, let's hope its all good.

#20 Sikh

Sikh

    Neowin Addict!

  • 3,876 posts
  • Joined: 11-March 07
  • Location: localhost
  • OS: Windows 7 / 10.8 / Ubuntu Server
  • Phone: Nexus 5 PA 4.4.2 / iPhone 5

Posted 24 June 2014 - 21:42

I think 4 hours downtime is rare in this day and age short of a power failure, and we have generators, though I know that is rare. On premises you have much more control of getting things back up in a hurry if it does happen whereas a cloud service has to restore their whole service. There is no, throwing up some VM heads and restore services ASAP. Or rolling back to a snapshot. With Failover Clustering, VM replication, etc., if 4 hrs. downtime happens it needs to be an act of God or someone is in big trouble. With cloud services, there is no fear on the part of the provider with regards to taking down the overwhelming majority of customers. If they leave they leave.

PS: Reimbursing 4 hours of hosting cost, probably can't compare to the actually cost of 4 hours of lost productivity.


Thank you for getting my point.

@Harrison: apparently Microsoft doesn't know what its doing. They shouldve had failover working or back up systems. Especially since they sell azure to be a replacement to on premise and their data centers cost a significant amount of money you think they would have a fail over system in case of hardware or software failures.

I know where I work its rare for us to have stuff on the shelves but once management suffered their first money loss failure they bought me backup hardware. If any of our servers fail I have a backup and if our esxi host fails I have temp machines that are able to be booted up and host the VMS. I work for a SMB so now that I've gotten it through managements head I'll be getting another 1-2 hosts next quarter. For failover and load balancing.

Its not impossible to have 99% uptime especially if your as big as Microsoft

#21 -Razorfold

-Razorfold

    Neowinian Senior

  • 9,888 posts
  • Joined: 16-March 06
  • OS: Windows 8
  • Phone: Nokia Lumia 900 / Oneplus One

Posted 24 June 2014 - 21:57

@Harrison: apparently Microsoft doesn't know what its doing. They shouldve had failover working or back up systems. Especially since they sell azure to be a replacement to on premise and their data centers cost a significant amount of money you think they would have a fail over system in case of hardware or software failures.


If you read the article that one of the posters listed:
 

"Current Status: Engineers have mitigated impact by rerouting traffic away from the degraded capacity. Mail flow is now improving and customers will begin to see service recovery as messages are being delivered and email queues drain.


They have backups, obviously, and they're using them.
 
 

Its not impossible to have 99% uptime especially if your as big as Microsoft


So 4 hours lost in an entire year which has 8760 hours doesn't mean 99%?

Overall MS has had pretty dam good uptime across all it's web services. Is it 100%? No of course not but they constantly rate as having one of the highest uptimes in the industry.

#22 MorganX

MorganX

    MegaZilla™

  • 4,030 posts
  • Joined: 16-June 04
  • Location: Midwest USA
  • OS: Digita Storm Bolt, Windows 8.1 x64Ce Pro w/Media nter Pack/Core i7 4790K/16GB DDR3 1600/Samsung 850 Pro 256GB x 2 - Raid 0/Transcend M.2 256GB/ASRock Z97E-ITXac/GTX 970
  • Phone: Samsung Galaxy S5 Active

Posted 24 June 2014 - 22:43

So 4 hours lost in an entire year which has 8760 hours doesn't mean 99%?

Overall MS has had pretty dam good uptime across all it's web services. Is it 100%? No of course not but they constantly rate as having one of the highest uptimes in the industry.


Even I agree with that. Though, 4 hours for a single outage is simply not acceptable for our organization, hence we have generators, etc.

But in general, much thought has to be given to putting and organizations business functions completely in the cloud, especially critical functions and risk must be weighed. There is risk wherever the datacenter is, but you must also weight the loss of control and helplessness when there is an outage. Hosting reimbursement is just a trivial gesture. It's a Godsend for SOHOs and anyone that can't afford on-premises infrastructure or staff. If you can afford it, it's not such a clear cut solution.

#23 Praetor

Praetor

    ASCii / ANSi Designer

  • 3,514 posts
  • Joined: 05-June 02
  • Location: Lisbon
  • OS: Windows Eight dot One dot One 1!one

Posted 24 June 2014 - 23:08

I have several clients using Office 365 and none of them was affected today. Having saying that Microsoft guaranties 99,9% of uptime, so it's 8.76581277 hours per year, or 8 hours, 45 minutes and 56 seconds. Per year. They don't say it can happen O365 to be down like 8 hours straight because that's a very low probability, but it can happen (still under 99,9% uptime).



#24 Praetor

Praetor

    ASCii / ANSi Designer

  • 3,514 posts
  • Joined: 05-June 02
  • Location: Lisbon
  • OS: Windows Eight dot One dot One 1!one

Posted 24 June 2014 - 23:30

Even I agree with that. Though, 4 hours for a single outage is simply not acceptable for our organization, hence we have generators, etc.

But in general, much thought has to be given to putting and organizations business functions completely in the cloud, especially critical functions and risk must be weighed. There is risk wherever the datacenter is, but you must also weight the loss of control and helplessness when there is an outage. Hosting reimbursement is just a trivial gesture. It's a Godsend for SOHOs and anyone that can't afford on-premises infrastructure or staff. If you can afford it, it's not such a clear cut solution.

 

Actually you can implement a hybrid solution with Office 365 and Exchange 2010/13.



#25 Sikh

Sikh

    Neowin Addict!

  • 3,876 posts
  • Joined: 11-March 07
  • Location: localhost
  • OS: Windows 7 / 10.8 / Ubuntu Server
  • Phone: Nexus 5 PA 4.4.2 / iPhone 5

Posted 25 June 2014 - 00:15

...snip...


I definitely read the article. Even after ready what you quoted I don't get why engineers had to do anything and why it took 4-5 hours or whatever time. Load balancing, failover, all these services were created to be automatic. Surely Microsoft had the resources and money. So I'm curious wth really happened!

#26 +warwagon

warwagon

    Only you can prevent forest fires.

  • 27,206 posts
  • Joined: 30-November 01
  • Location: Iowa

Posted 25 June 2014 - 00:20

IME, it's primarily due to Internet Connectivity and not the provider as much. To the enterprise it doesn't matter.

Most on-premises in 24/7 operations don't have 4 hour outages for Email or phone systems (Lync). The threat of losing one's job prevents it. 4 hours in a year isn't that much, but a 4 hour outage is more than enough to be somewhat catastrophic depending on the organization.

The fact that the entity is not in "control" and can't simply make a phone call and say get it back up in an hour or you'll be looking for a job tomorrow is just not enough control for most whose livelihood relies on uptime. Being down 1 hour every quarter may be better than a single 4 hour outage in a year depending on the effect is has on your business.

The cloud definitely has value, but it is not a blanket solution. Much thought has to go into what is viable for "your" organization to move to the cloud.

 

This is why I have ZERO desire to work In a large IT department.



#27 Praetor

Praetor

    ASCii / ANSi Designer

  • 3,514 posts
  • Joined: 05-June 02
  • Location: Lisbon
  • OS: Windows Eight dot One dot One 1!one

Posted 25 June 2014 - 00:22

I definitely read the article. Even after ready what you quoted I don't get why engineers had to do anything and why it took 4-5 hours or whatever time. Load balancing, failover, all these services were created to be automatic. Surely Microsoft had the resources and money. So I'm curious wth really happened!

 

sometimes things go bad; i know some cases of things that shouldn't happen, but they still do happen.



#28 Sikh

Sikh

    Neowin Addict!

  • 3,876 posts
  • Joined: 11-March 07
  • Location: localhost
  • OS: Windows 7 / 10.8 / Ubuntu Server
  • Phone: Nexus 5 PA 4.4.2 / iPhone 5

Posted 25 June 2014 - 01:36

sometimes things go bad; i know some cases of things that shouldn't happen, but they still do happen.


I agree but tier 3 data centers hold up pretty well.

#29 MorganX

MorganX

    MegaZilla™

  • 4,030 posts
  • Joined: 16-June 04
  • Location: Midwest USA
  • OS: Digita Storm Bolt, Windows 8.1 x64Ce Pro w/Media nter Pack/Core i7 4790K/16GB DDR3 1600/Samsung 850 Pro 256GB x 2 - Raid 0/Transcend M.2 256GB/ASRock Z97E-ITXac/GTX 970
  • Phone: Samsung Galaxy S5 Active

Posted 25 June 2014 - 02:09

This is why I have ZERO desire to work In a large IT department.


I'm not so sure it's the actual size of the "IT" department as it is the size of the organization and the criticality of its services.

#30 tsupersonic

tsupersonic

    Neowinian Senior

  • 6,836 posts
  • Joined: 30-September 06
  • Location: USA
  • OS: Win. 8.1 Pro. x64/Mac OS X
  • Phone: iPhone 5S/Nexus 5

Posted 25 June 2014 - 02:16

I definitely read the article. Even after ready what you quoted I don't get why engineers had to do anything and why it took 4-5 hours or whatever time. Load balancing, failover, all these services were created to be automatic. Surely Microsoft had the resources and money. So I'm curious wth really happened!

Sounds familiar. I work in healthcare, and we have redundant servers in two data centers (close proximity though), and while the technical people are trained and ready to failover servers to the other data center, it's the managers who lag. They hate making decisions like that. If the problem can be solved in < 2 hours, we wait it out, which just kills me knowing that redundant hardware could be used. Maybe something like that happened, though at big company like Microsoft, that would be surprising. In my eyes, ###### does happen, but for a company like Microsoft with the resources and money to have redundant data centers and servers around US/worldwide, four hour downtime is simply just unacceptable.