(Updated: )

How We Migrated Checkly From Heroku to AWS

Share on social

Table of contents

Several years ago—in 2016, to be exact—we decided to host the Checkly platform on Heroku, since it was the hosted cloud offering (at the time) that best aligned with our needs. In particular, we felt that Heroku offered the best PostgreSQL cloud offering available at the time.

But just as individuals grow and change over their lifetimes, such is the case with Checkly: We reached the point around 2022 that we needed to consider another hosting solution, and decided to move from Heroku to AWS as our hosting provider.

We thought it would be interesting to share some of our thoughts and learnings from this process, so we decided to write a blog post about our experiences.

Hosting on Heroku

As mentioned previously, we had been hosting Checkly on Heroku since 2016. We started with a 300GB data instance with four cores, and had a high availability standby setup if our instance went down. In terms of pricing plans we standard with their standard “Standard 0” pricing plan, and eventually expanded to their “Premium 4” plan. (For more details on pricing, see the Heroku pricing page.)

While we were largely happy with our experience with Heroku, there were some Heroku-specific issues we encountered that necessitated the need to switch to another hosting platform.

In general, one of the biggest reasons behind the switch was that Heroku started to become a burden for our tech stack . More specifically, Heroku’s platform made it difficult for us to upgrade to the latest version of PostgreSQL. We were running PostgreSQL v10—and wanted to upgrade to PostgreSQL 13—and Heroku told us that they were going to discontinue support for PostgreSQL 10 by the end November 2022. Given that PostgreSQL is a vital component of our tech stack, that obviously wasn’t going to work for us!

In addition, Postgres would sometimes go down briefly and then restart. Heroku relied on forced maintenance windows, which made it difficult for us to get support when we needed it. More specifically, here are some of the challenges we faced with our Heroku environment:

  • Heroku PSQL had a smaller set of supported PSQL extensions
  • There was no easy way to upgrade to a newer version of PSQL
  • Heroku PSQL plans had fixed size disk space offerings and it required a complicated migration to move to a bigger plan
  • Many essential tasks required an experienced PSQL engineer to tackle them

Growing Pains: Moving to AWS

For all the reasons outlined above, we decided to move from Heroku to AWS. Moving to AWS promised to solve some of our key pain points: The ability to seamlessly upgrade to newer versions of PostgreSQL as needed and to have more flexibility with regards to maintenance windows, allowing us to set our own maintenance schedule based on our specific needs. After doing some research, we also realized that moving to AWS would relieve the requirement of having to have frequent involvement of senior engineers, as a lot of the require work could be accomplished by more junior engineers.

As we started to put together our plans for the transition, we sketched out a four week timeline for the project. The entire process ended up taking us about five weeks to complete, but our planning process was extensive.  Here’s an outline of the steps we mapped out:

  1. Capacity planning and RDS offering choice
  2. Choose the data replication method
  3. Dry run migration of production
  4. Collect data, improve, document and go back to 4 until thing are within expectation
  5. Migrate dev and staging environment
  6. Migrate production environment

From Heroku to AWS: Moving the Data

Our first—and largest—technical challenge was getting our data out of Heroku and into AWS. Handling large-scale data migration is always scary! Even with backups, there’s always a concern about moving data from one location to another. We also were faced with the fact that migrating data from Heroku to AWS was a new domain that we just stepped into, and was a task that none of us had completed before.

Heroku had some default data exporting options, but after reviewing them we decided that all of the standard options weren’t viable for our specific use case, mainly because options would cause unacceptable amounts of downtime, and we also had 300GB of data to migrate! We ended up doing some research on non-standard export options from Heroku, and found an excellent article on Stackoverflow that outlined how you could work with Heroku to get additional, non-standard data export options that would make our task easier.

In our specific case, we had some issues with converting data types from our PostgreSQL instance to the Amazon Relational Database Service (RDS). Heroku support suggested that we create a middleground EC2 instance running a PostgreSQL demon, then move the data from there to RDS.

We then explored three different approaches for getting our data from Heroku to RDS.

  1. The first approach we investigated was the AWS data migration service. We tried this approach, but the process wasn’t seamless, and there were some errors we couldn’t overcome.
  2. Our second method was to use an extension from PostgreSQL called pglogical. We encountered some admin complexity here, and also had some issues with schema changes, so we decided to try a third option, which was…
  3. …a new type of replication called logical replication, while we were using a previous replication approach called Write Ahead Logging (WAL) replication. We discovered that RDS does not support WAL replication, and native PostgreSQL replication doesn’t replicate database schema or table sequence changes.

Discussions about downtime

We knew that downtime was inevitable, so we focused on how to minimize the downtime as much as possible. We looked at approaching our service downtime in two different ways: One with less downtime and more risk, and the other with more downtime and less risk.

We weighed and debated both options, and then—after lots of debating and deliberation—opted for the approach with greater downtime but enhanced safety.  We didn’t want any surprises with the migration, so we allocated a 30 minute period of downtime, although we only ended up using about 10 minutes of the window.

Our planning and deliberation around the amount of downtime was extensive. If we were any other type of company—other than someone that provides 24/7 monitoring services like we do—you would plan the migration over the weekend. But as a company focused on monitoring services for uptime and reliability, do we really want to take the service offline on the weekend?

Here’s the rationale: You do the migration on Saturday, and somebody doesn't get an essential alert over the weekend. So if the platform is down for 40 minutes on a weekend, and nobody's in the office to realize that, correct? There were some interesting decisions we had to take because we're in the monitoring domain.

After that discussion, we decided the best course of action would be to treat it like an extended maintenance window, so the migration was scheduled for September 12th, 2022 at 7:00 AM UTC.

Announcing the Migration

We wanted to inform our customers of the upcoming extended maintenance window, so we sent an email that concisely explained what was happening.

Internally, we practiced for the migration: We tested the exercise three times on production to make sure that our data and timing was correct. We created and followed a playbook, spelling out every step we would take during the migration. We also pre-migrated some of the data thanks to some design decisions we took in our infrastructure early on, which helped us to accomplish the switchover in less than the scheduled time.

Migration Day!

Like any big migration, part of the work was to define the playbook for the team, ours look like this (omitting the migration scripts and AWS config files):

  1. Announce schema change freeze on Friday
  2. pg_dump Heroku for schemas and push it to RDS on Friday
  3. Disable preboot on on heroku services
  4. Start EC2 PSQL with wal-e recovery
  5. Wait until EC2 PSQL catches up with the latest WAL
  6. Stop Heroku Schedulers
  7. MAINTENANCE WINDOW STARTS
  8. Scale down all daemons to 0
  9. Move all connections to read-only ALERT DATABASE <dbname> SET default_transaction_read_only=on
  10. Promote EC2 PSQL pg_ctl promote -D /database/
  11. Give EC2 PSQL replication rights ALTER ROLE <username> WITH REPLICATION
  12. Set statement_timeout to 0 on EC2 and RDS
  13. Start replication for the tables from EC2 to RDS
  14. Wait until RDS catches up
  15. Sync tables sequences with api/script/sync-sequences.js
  16. Drop the replication subscriptions
  17. Update database url in in the services
  18. Scale up all services again
  19. Enable preboot for checkly-api and checkly-cron-daemon again
  20. Monitor the RDS database and validate new data coming in
  21. Monitor the services and make sure the services are healthy
  22. Restart Heroku Schedulers
  23. MAINTENANCE WINDOW ENDS
  24. After everything is over, kill EC2 and enjoy your day.

While it’s lovely to envision an ‘infrastructure as code’ world where even migrations are just scripts and config that require no human checklists, in the real world migration from one cloud to another will always require a certain amount of human planning and human intervention. This is especially the case when moving up in cloud scale. Sadly no one at AWS is spending all their time writing ‘lift and shift’ guides for migrating complex architecture from heroku, though smaller cloud providers certainly have done that work to help new customers leave AWS.

Mission Accomplished

During the migration, everything went according to plan. We had no hiccups, no glitches, and everything went smoothly and we didn’t have to perform any last-minute minute operations or fire drills.

What we learned from our Heroku to AWS migration

After successfully completing our migration project, we have some learnings that are hopefully useful to others attempting the same project. That includes:

  • Use partitioned tables so you can pre-migrate some of the data. If you have immutable data partitioned, you can migrate data before your maintenance window.
  • Make sure that your timeouts are set correctly
  • It’s better to overspend on the size of your RDS instance and downgrade later if you don’t need that much capacity. It’s much less painful to have more than you need that not enough!
  • Do fire drills - make sure you have an ordered list playbook
  • Test, test, and retest to make sure everything works
  • Downtime is inevitable! Just minimize it as much as you can.

Conclusions

Migrating from one cloud provider to another is a significant undertaking, and for CTOs and operations teams considering such a move, there are several key takeaways from our migration from Heroku to AWS:

  1. Evaluate the “why”: Ensure the decision to migrate is backed by clear, well-understood needs, such as scalability, performance improvements, or cost efficiency. In our case, the inability to upgrade PostgreSQL easily and the lack of flexibility in Heroku's infrastructure were major factors. Come back to this “why” at least once in your process before your decision is final, and don’t let a drive to ‘get it done’ stop you from evaluating if the plan still provides the promised benefit.
  2. Plan Meticulously: A successful migration hinges on detailed planning. From understanding data replication methods to scheduling downtime and creating robust playbooks, every step needs careful consideration to avoid surprises.
  3. Communicate Clearly: Inform stakeholders—including customers—well in advance about potential disruptions. Transparency builds trust, and a well-prepared customer base is more understanding during maintenance windows.
  4. Test Extensively: Conduct multiple dry runs to iron out kinks and simulate the migration as closely as possible to identify potential pitfalls. This approach minimizes risks on the actual migration day.
  5. Balance Risk and Downtime: Consider the trade-offs between shorter downtime with higher risk and longer downtime with lower risk. For us, prioritizing safety with a slightly longer downtime was the right decision.
  6. Leverage Learnings: Post-migration, take stock of what went well and what could have been better. In our case, the value of partitioned tables, well-sized RDS instances, and clearly defined playbooks stood out.

While migrations can seem daunting, with the right preparation and execution, they can unlock the scalability and flexibility needed to support your organization's growth. By sharing our journey, we hope to provide others with a roadmap to approach their own cloud migrations confidently. Remember, migrations are not just technical exercises—they’re strategic moves that require alignment across the technical and business sides of your organization.

Share on social