Office 365 Exchange/OWA down for 4+ hours


Recommended Posts

I manage a few companies that use Microsoft Office 365 Exchange Online for business grade email services.

 

Today, the majority of my clients have been down in the middle of the work day for more than 4 hours. The only update we've received from Microsoft is that they're working on it.

 

Anyone else with Exchange issues?

 

Zdnet has an article on it

 

There should be a front page article on Neowin to bring this to the publics attention to force MS to issue a refund like they have before when they can't reach their 99.9% uptime guarantee.

Link to comment
Share on other sites

Well technically 4hrs out of a year isn't bad :p

 

However, i'm not down so it must just be the tenant you are using... there are however also some issues with the system though......

 

post-698-0-87299200-1403635185.png

Link to comment
Share on other sites

Yes it appears you're on a functioning server. The incident affecting Exchange services is EX7211. It's been on going for almost 5.5 hours actually. Right in the middle of the work day!

 

"Current Status: Microsoft has identified an issue in which a portion of capacity responsible for facilitating connectivity to the Exchange Online service has entered into a degraded state. Engineers are actively working on a solution to remediate impact.

 

Current Status: Engineers are continuing to investigate the underlying root cause of degradation to the portion of capacity responsible for facilitating connection requests to the Exchange Online Service."
 

There is no ETA.

 

Link to comment
Share on other sites

This has been affecting most of my users including myself. News outlets are starting to talk about this, I think Neowin should post a front page article too.

News articles talking about this:

https://news.google.com/news?ncl=duXBZOQhGZmOO7Mt6i0zT9Dxs_6aM&q=office+365+exchange&lr=English&hl=en&sa=X&ei=BMepU5PYNpSzsATJjoCYDg&ved=0CBwQqgIwAA

Link to comment
Share on other sites

This will always be a risk of the cloud and relying on a third-party, regardless of who that is. We will be enhancing our on-premises resources with Office 365, not using them as primary, and this is a big reason why.

Link to comment
Share on other sites

Did you really need two posts about this?

I created a second post in a different thread because I wanted this to be in the news section, instead of technical support.

Link to comment
Share on other sites

You should be able to contact MS about refunds if the uptime isn't met.  That's always been the case AFAIK.

In the past, when the downtime has been severe, with many media outlets talking about it, Microsoft has offered credits automatically. This is because they financially back their 99.9% uptime guarantee. My intention was to spread the news to get more people talking about this, in hopes of something similar. Once the service is restored and if no credit is issued automatically, I will speak to them.

Link to comment
Share on other sites

You should be able to contact MS about refunds if the uptime isn't met.  That's always been the case AFAIK.

Or whoever replaces you can request the refund if you're the person who moved critical business communications to the cloud. lol. j/k j/k.

  • Like 1
Link to comment
Share on other sites

Or whoever replaces you can request the refund if you're the person who moved critical business communications to the cloud. lol. j/k j/k.

 

I really don't understand the stigma that cloud services have, especially larger ones such as Azure/Office 365, and Amazon. It's not like on premise servers never have downtime either. ###### happens.

 

Edit: I should say that my company is affected by this downtime. Life will go on though.

Link to comment
Share on other sites

I really don't understand the stigma that cloud services have, especially larger ones such as Azure/Office 365, and Amazon. It's not like on premise servers never have downtime either. ###### happens.

 

You have control over on premise and you know what happened. On premise servers also have less downtime (in a sense) because you are responsible and know what you are doing. Who knows if this downtime was because of a silent push that wasn't suppose to cripple the systems but it did?

 

Hardware Failures usually never take this long. Hardware failures are probably the easiest to fix especially in a data center since you build it to be replaceable. All storage is NAS so if a server goes down you slap a new one in boot it up and it gets provisioned. Where as if a software push goes bad, then your rolling back and that takes forever.

Link to comment
Share on other sites

I really don't understand the stigma that cloud services have, especially larger ones such as Azure/Office 365, and Amazon. It's not like on premise servers never have downtime either. ###### happens.

IME, it's primarily due to Internet Connectivity and not the provider as much. To the enterprise it doesn't matter.

Most on-premises in 24/7 operations don't have 4 hour outages for Email or phone systems (Lync). The threat of losing one's job prevents it. 4 hours in a year isn't that much, but a 4 hour outage is more than enough to be somewhat catastrophic depending on the organization.

The fact that the entity is not in "control" and can't simply make a phone call and say get it back up in an hour or you'll be looking for a job tomorrow is just not enough control for most whose livelihood relies on uptime. Being down 1 hour every quarter may be better than a single 4 hour outage in a year depending on the effect is has on your business.

The cloud definitely has value, but it is not a blanket solution. Much thought has to go into what is viable for "your" organization to move to the cloud.

Link to comment
Share on other sites

You have control over on premise and you know what happened. On premise servers also have less downtime (in a sense) because you are responsible and know what you are doing. Who knows if this downtime was because of a silent push that wasn't suppose to cripple the systems but it did?

 

Hardware Failures usually never take this long. Hardware failures are probably the easiest to fix especially in a data center since you build it to be replaceable. All storage is NAS so if a server goes down you slap a new one in boot it up and it gets provisioned. Where as if a software push goes bad, then your rolling back and that takes forever.

 

So what you are saying is that cloud providers have no idea what they are doing? Downtime is downtime, in the cloud or on premise. In either situation, there is a team that is responsible for getting services back up and running. Microsoft will write up a report in a few days to let us know what happened.

 

Just because you have control over something (on premise), doesn't mean fixing it is any faster than a cloud service provider fixing something.

Link to comment
Share on other sites

So what you are saying is that cloud providers have no idea what they are doing? Downtime is downtime, in the cloud or on premise. In either situation, there is a team that is responsible for getting services back up and running. Microsoft will write up a report in a few days to let us know what happened.

 

Just because you have control over something (on premise), doesn't mean fixing it is any faster than a cloud service provider fixing something.

I think 4 hours downtime is rare in this day and age short of a power failure, and we have generators, though I know that is rare. On premises you have much more control of getting things back up in a hurry if it does happen whereas a cloud service has to restore their whole service. There is no, throwing up some VM heads and restore services ASAP. Or rolling back to a snapshot. With Failover Clustering, VM replication, etc., if 4 hrs. downtime happens it needs to be an act of God or someone is in big trouble. With cloud services, there is no fear on the part of the provider with regards to taking down the overwhelming majority of customers. If they leave they leave.

PS: Reimbursing 4 hours of hosting cost, probably can't compare to the actually cost of 4 hours of lost productivity.

Link to comment
Share on other sites

why isnt there a second server(s) or connection to take over when one goes down??? or am i missing something.

Link to comment
Share on other sites

I started getting calls this morning and other people during the day.

Basically everyone is saying they cant send emails but they can receive. I've sent some out and they seemed fine but in general everyone seems to be having that issue. Mostly worried if the 30-50 emails sent today actually went out (when they didn't get the usual quick reply is when they figured something was wrong.).

Crazy day as I just did a migration and thought I screwed up the system. Still tho, let's hope its all good.

Link to comment
Share on other sites

I think 4 hours downtime is rare in this day and age short of a power failure, and we have generators, though I know that is rare. On premises you have much more control of getting things back up in a hurry if it does happen whereas a cloud service has to restore their whole service. There is no, throwing up some VM heads and restore services ASAP. Or rolling back to a snapshot. With Failover Clustering, VM replication, etc., if 4 hrs. downtime happens it needs to be an act of God or someone is in big trouble. With cloud services, there is no fear on the part of the provider with regards to taking down the overwhelming majority of customers. If they leave they leave.

PS: Reimbursing 4 hours of hosting cost, probably can't compare to the actually cost of 4 hours of lost productivity.

Thank you for getting my point.

@Harrison: apparently Microsoft doesn't know what its doing. They shouldve had failover working or back up systems. Especially since they sell azure to be a replacement to on premise and their data centers cost a significant amount of money you think they would have a fail over system in case of hardware or software failures.

I know where I work its rare for us to have stuff on the shelves but once management suffered their first money loss failure they bought me backup hardware. If any of our servers fail I have a backup and if our esxi host fails I have temp machines that are able to be booted up and host the VMS. I work for a SMB so now that I've gotten it through managements head I'll be getting another 1-2 hosts next quarter. For failover and load balancing.

Its not impossible to have 99% uptime especially if your as big as Microsoft

Link to comment
Share on other sites

@Harrison: apparently Microsoft doesn't know what its doing. They shouldve had failover working or back up systems. Especially since they sell azure to be a replacement to on premise and their data centers cost a significant amount of money you think they would have a fail over system in case of hardware or software failures.

If you read the article that one of the posters listed:

 

"Current Status: Engineers have mitigated impact by rerouting traffic away from the degraded capacity. Mail flow is now improving and customers will begin to see service recovery as messages are being delivered and email queues drain.

They have backups, obviously, and they're using them.

 

 

Its not impossible to have 99% uptime especially if your as big as Microsoft

So 4 hours lost in an entire year which has 8760 hours doesn't mean 99%?

Overall MS has had pretty dam good uptime across all it's web services. Is it 100%? No of course not but they constantly rate as having one of the highest uptimes in the industry.

  • Like 1
Link to comment
Share on other sites

So 4 hours lost in an entire year which has 8760 hours doesn't mean 99%?

Overall MS has had pretty dam good uptime across all it's web services. Is it 100%? No of course not but they constantly rate as having one of the highest uptimes in the industry.

Even I agree with that. Though, 4 hours for a single outage is simply not acceptable for our organization, hence we have generators, etc.

But in general, much thought has to be given to putting and organizations business functions completely in the cloud, especially critical functions and risk must be weighed. There is risk wherever the datacenter is, but you must also weight the loss of control and helplessness when there is an outage. Hosting reimbursement is just a trivial gesture. It's a Godsend for SOHOs and anyone that can't afford on-premises infrastructure or staff. If you can afford it, it's not such a clear cut solution.

Link to comment
Share on other sites

I have several clients using Office 365 and none of them was affected today. Having saying that Microsoft guaranties 99,9% of uptime, so it's 8.76581277 hours per year, or 8 hours, 45 minutes and 56 seconds. Per year. They don't say it can happen O365 to be down like 8 hours straight because that's a very low probability, but it can happen (still under 99,9% uptime).

Link to comment
Share on other sites

Even I agree with that. Though, 4 hours for a single outage is simply not acceptable for our organization, hence we have generators, etc.

But in general, much thought has to be given to putting and organizations business functions completely in the cloud, especially critical functions and risk must be weighed. There is risk wherever the datacenter is, but you must also weight the loss of control and helplessness when there is an outage. Hosting reimbursement is just a trivial gesture. It's a Godsend for SOHOs and anyone that can't afford on-premises infrastructure or staff. If you can afford it, it's not such a clear cut solution.

 

Actually you can implement a hybrid solution with Office 365 and Exchange 2010/13.

Link to comment
Share on other sites

...snip...

I definitely read the article. Even after ready what you quoted I don't get why engineers had to do anything and why it took 4-5 hours or whatever time. Load balancing, failover, all these services were created to be automatic. Surely Microsoft had the resources and money. So I'm curious wth really happened!

Link to comment
Share on other sites

This topic is now closed to further replies.