I've gone back and forth through my career from jobs where I was writing code behind the safety net of a QA department to jobs where I was the only gateway between my code and the wild world of users. At the end of last year I quit my job and started working on RescueTime full-time, once again switching from the world of QA-ed code to what amounts to the seat of my pants. While the details of this jump are fresh in my head, I figured I'd share some of the things that I've learned that should help when you are the only one you're relying on to protect your users from bugs, crashes, and downtime.
That's all the advice I have for now. This week we'll be launching a new release of RescueTime with some nifty (and complex) new features. I'll let you know how it goes, and will amend this list accordingly. :)
- When you think you're ready to release a significant batch of new features, wait a couple of days before you deploy to production. The temptation for me is always to release immediately (OMG our users are going to love this! They need it now!), but I've found that by waiting a day or two I always come up with some scenarios, use cases, boundary cases, and bugs that I hadn't thought of. It isn't until you start to get bored with the new features you've written that you really understand what their limits are. It doesn't cost much to take the extra time to let the dust settle in your mind before pushing something out that's not quite ready. Your users who will almost always immediately find what's wrong.
- Deploy to a staging environment first. This seems obvious but when it's just you, it's very tempting to feel like your code is golden and just push it right from your laptop to production. Make sure your staging environment runs on its own database (preferably a reasonably recent copy of production) and test out any database migrations on the staging database. A lot of times I make last minute changes to my code on my machine and think they look good enough to not have to test them again. Staging is your sanity check. Use it to your advantage, because nothing drives you more insane than struggling to fix a botched deploy on the live site.
- Write tests. It's difficult to justify a huge number of tests in the early phases of a new project (are people even going to use this?) but for features that have been especially tricky to nail down or that just are essential to your product working correctly, it really helps to have tests that you can use as a baseline when you need to refactor. RescueTime just went through a significant architecture change and I could've used a few more tests around to make sure we covered our bases.
- Have someone else verify that everything works. Get your other co-founders to use it for a day or, if you have to, contact one of your power users and invite them to use the staging server. You need to have someone use it other than you before launching. There will inevitably be something that breaks caused by doing something you just hadn't thought of doing.
- Branch, tag, then deploy. Don't deploy from trunk. Srsly. You're going to have issues if you do. Take this scenario: You spend a week coding up features for a new release. The day before the deploy there's a bug in the existing code that needs to be fixed immediately. In a rush, you submit a fix to a source file from your trunk enlistment on your machine and deploy it to production. Whoops! Now you have added code from features that haven't been released yet. Not only that, but you've only been using trunk, so you have to go back and back-merge all of your changes. I branch every release that will be deployed to production, and only deploy tags from that release. This makes it easy to roll back, and forces you to have separate code enlistments for each release of the code you're working on. It also forces you to be disciplined with your checkins since each production deploy requires a new tag. Here's a link to a discussion on how to add this functionality to your capistrano deploy configuration.
- Don't push new bits at peak usage. Find a time, usually late at night (7-9PM PST is usually a good bet) when there aren't a huge number of users on your site. That way, if you need an outage (or worse - you have a big problem), it won't affect too many users.
- If you do have a problem - don't panic! Users are grinding away on a gnarly bug on your production server? Oh well, you tried as hard as you could (presuming you followed the previous advice) to release smoothly. Your code is still "beta" after all. Put up an outage page and calmly debug and fix the problem. If you can't in a reasonable amount of time, roll back and fix it tomorrow. The worst thing to do is to panic and turn a small issue into a larger one (like accidentally dropping your DB for example).
That's all the advice I have for now. This week we'll be launching a new release of RescueTime with some nifty (and complex) new features. I'll let you know how it goes, and will amend this list accordingly. :)
- Location:Mountain View


Comments
http://www.joelonsoftware.com/artic
Also, this should be copied onto a RescueTime corporate blog somewhere. It's interesting, and you spent time on it, and so it should be used to get some readers to learn about your company.
When you get big enough, you'll find that 3-5 a.m. eastern is about the only truly off-peak period left for a us-centered company, and there is _no_ good window available for a worldwide company. :( Somebody will solve this problem well one day and they will make a mint. No, rolling bounces don't count, because they can't handle schema changes in a properly tested manner.