The NEW Build Your Own Arcade Controls
Main => Forum/Website Discussion => Topic started by: saint on June 07, 2016, 11:31:27 am
-
Short version:
We have a hard drive in the RAID on the server that was failing. It's always the simplest things in the end, but at first it really looked like something gone haywire with journaling on the drives, or mysql going bonkers, or something similar.
We've since taken the bad drive out of the RAID, and spent the last 24 hours or so backing up several hundred gigs of data off-server as a precaution. I've had the forum offline during this time.
We still have more to look at, and need to get the bad drive replaced, so there will be additional downtime in the near future.
Also, while we suspect the RAID was the alpha and omega of the problems, only time will tell if other issues need attention. The server seems responsive enough right now, but I'm still seeing occasional spikes of high CPU usage with corresponding server lag, so I'm not holding my breath just yet.
More updates when I have them. Thanks!
--- saint
-
Trivia:
The forum alone takes up 36 gigs of drive space, not including the SQL database back end.
There are 236,849 attachments stored in the forum as of this writing.
-
Failing hard drive is just about as embarrassing as "problem went away after hours of trouble shooting when I rebooted."
;)
-
Good work Saint. We appreciate you man.
-
Seems snappy now!
Sent from my SM-G386T using Tapatalk
-
Thanks for getting things back up and running. I don't post much here these days, but I do visit frequently.
More than willing to pop a few bucks for a new drive if you need it. LMK where to "donate" money :laugh2:
-
Fingers crossed you have the problem identified. Thanks for all you do Saint. :applaud:
-
Thanks guys :) The credit is all sirwoogie's for figuring out what was going on and doing the heavy lifting.
Appreciate the offer of a donation :) BYOAC is my way of paying it forward and the costs of keeping it running are within my means. Back in the day we took donations, but these days if you're so inclined, please find a charity that appeals to you and donate to them instead. Thanks for the offer :)
-
Thanks Saint and sirwoogie! :cheers:
-
They did *not* get the hard drive replaced today. Sigh... So, some time Wednesday we will be going down again to hopefully get the drive replaced.
-
This forum ranks as being my favorite and most visited, so saint I wanted to take the time thank yourself, sirwoogie and others that may be involved in all their hard work. I hope it hasn't been too stressful.
-
This forum ranks as being my favorite and most visited, so saint I wanted to take the time thank yourself, sirwoogie and others that may be involved in all their hard work. I hope it hasn't been too stressful.
I second that. Indeed, its the only forum I visit...
-
Thanks, nice to hear :)
Status: The failing drive has been replaced. Next step is to re-establish the mirror and wait for the RAID to rebuild.
Then the bad news is apparently the first drive is also showing signs of failure, though it's not nearly as bad as the second was (144 errors vs. 16,000 errors).
So - once we have the mirror rebuilt and the RAID re-established, we're going to do this all again and replace the first drive.
We'll keep the server up until then though :)
Thanks for your patience everyone!
-
Man, when your redundancies are failing it's pretty bad. We must be hammering this poor forum.
-
This particular incarnation of the box is pushing 4 years old I think. We were rather startled when we looked at it a few weeks ago to see that its uptime was over 3 years without a reboot.
-
Glad to hear about progress! :applaud:
DeL
-
Man, when your redundancies are failing it's pretty bad. We must be hammering this poor forum.
It's not uncommon for drives from the same manufacturing run to fail at the same time or within a short period of time. Particularly those within a RAID configuration where access times are similar. I've had this happen more than once in my [previous] career managing server farms.
-
Yeah, anytime I have a raid failure on a single drive I start eyeballing the other drives.
We are rebuilding the raid now. So far the forum still seems responsive but if you notice slowdowns over the next several to many hours today, that's likely the cause.
-
:cheers:
I had pretty much given up. Glad things are looking better for BYAOAC.
-
Thanks to you guys for all your efforts and hosting.
-
Internal Server Error last night... Related to the other drive? maintenance?
-
Internal Server Error last night... Related to the other drive? maintenance?
Nothing new last night.
There were the usual two daily down-times.
1.) 1 A.M. Eastern -- Down for about 1-2 minutes
2.) Shortly after 4 A.M. Eastern -- Down for about 30 minutes
The site followed that pattern before the drive went bad and is back to following it again.
Scott
-
Today I'm doing database maintenance if you see slow downs. I have various jobs running backups overnight, once I get done with this bit of maintenance I'll look at them to see if there's anything I can do about the nightly outage.
-
Yeah, because I'm apparently a vampire or something I'm rarely asleep when it goes down. It had been doing that for quite some time before the server issues, but I thought it was nightly maintenance so I never mentioned it.
-
Don't forget your overseas members - overnight isn't the same everywhere!
Sent from my iPhone using Tapatalk
-
Have you considered moving thing site over to AWS? Way cheaper and easier than trying to maintain hardware someplace.
You could even do a hybrid solution. Host the website and forum on your own server (or even a VPS somewhere) and then host the database in AWS RDS and all the attachments in S3. Saint if you were ever interested in exploring solutions like this, let me know. I could help you set some things up.
-
I haven't, but I'll take a look. Other than this drive failure (which, on a server in constant action for 4 years, isn't surprising), service and speeds have been good with this hosting solution I think. I'll compare costs.