my first postmortem: an amateur with google apps


I am a Software Engineering Student at Holberton School, San Francisco | contact me or follow me.

Share

This story is about the first major incident that occurred in one of my web applications, chicagoresourcehub.com that caused the app to be down for a long enough extension that it impacted the user's experience. It occurred around 1.5 years ago when I was first beginning my big step into the tech world. While I had been learning to code on small personal projects before this, the application that I was building in this story was for a paying client. To be honest, it was my first and only paying client, which shows how much of a beginner I was, but also how emotionally intense any technical error or service lapse would be. This incident occurred in the middle of a transition in the services I used, in which I switched from Caspio's web application (https://www.caspio.com/), which had a database management tool for integrating data into user friendly and searchable presentations, to using Google's Apps and Google's APIs (https://developers.google.com/). The major change was that with Google (the primary tool that I used was Google's Fusion), I would have a wonderful database like Caspio's, but the entire integration process of presenting and working with the data would have to be coded by myself. The integration was the first major attempt in my life at building a software application with real code besides using HTML, CSS, and WordPress PhP hacks. The code that I used was JavaScript, and I built my integration based off of a javascript fusion tables open sourced library. 

The Postmortem

  • Issue Summary:

    • duration of the outage: ~8 hours: 6:00am CST - 2:00pm CST
    • Impact: 100% of users impacted, the entire search application and integration from chicagoresourcehub.com was down the entire time, which meant that users could not use the application to search, which is what it was meant for.
    • I had disabled an option, which locked out Google API or any user from selecting (i.e. querying) data and working with it such as presenting it to the user that made the request.
  • Timeline:

    • At approximately 1:00pm CST
    • A customer noticed the service was down and submitted a form through the contact form on the website.
    • When I received the notice, I began trying to help the user troubleshoot to see if it was the users problem.  In the past, Caspio's web application would go down at certain moments in the day for brief periods, and Caspio's service was immensely more limited than Google's service and would go down for over usage.  So, I did not immediately assume that the problem was something that I had created and could fix.
    • This initial assumption turned out to mislead the initial investigation, and was a part of my own part and contribution to the problem.  I had incorrectly assumed and responded to a situation in which the customer was to blame for not understanding the application properly or a connected service was down.  Thus, my incorrect misleading solution was to teach the user how to use the application.  With the more experience that I have had, I now realize that even if the custom didn't understand the application, this is still a part of my problem as I have not properly created an application that users can easily understand.  Additionally, I have realized the immense amount of mistakes that I can make as a developer and have learned to be much more open to finding the problems that I have created or not properly handled in my development.
    • The incident did not escalate any further than myself and the customers involved.
    • I eventually resolved the issue by realizing that I had disabled an option in the service that simply needed to be enabled in the Google Fusion Tables options.
  • Root cause and resolution:

    • Fortunately, this issue, while having a devastating impact on my application by shutting it down, was extremely easy to resolve. In specific, the issue, occurred in the Google Fusion table itself, which had become my new database management tool. With this service, I had disabled this checkbox option: Reuse access: ■ Allow downloads, so that nobody would improperly use the thousands of records that I had been maintaining. The impact of this was that any party excluding myself, would not have access to download the data. I did not recognize this error because myself and anyone on the internet was still able to visualize the data from the Google Fusion table itself since it was an open database with no viewing restrictions; they just could not download the data with Google’s services. It turned out that the API that I was using with Google API and the server that the request or query for the data came from would also not have access to any of the information in the database through downloads. At the time, I did not realize that the way the API works is that when a query is made for the data from the web server that had my web application, this query needed to be able to download the data. I guess that in my amateur mind, I thought that the computer would interact with the database the same way that a human would, and since there were no viewing restrictions, a query could be made without restrictions.
    • When the issue was brought to my attention, and I finally realized that I had indeed created the issue, I simply enabled the download option.  Since, I was so intimately aware of the many components to the application, the search to find the error was not very extensive.  I saw that the web application was functioning to serve the website, except that the Google Fusion integration was not loading, and I also noticed that Google Fusion was properly functioning on Google's servers.  Therefore, when I inspected the JavaScript console errors through Chrome developer tools, I was able to find that the error was in the connection between the 2 servers, and so I began to check for other's who had experienced the same issue.  Searching forums and reviewing my settings in the Google Fusion table quickly led to the realization of the quick fix to enable downloads in my Google Fusion Table.
  • Corrective and preventative measures:

    • Since my switch to Google Fusion, this as been the only time my data has been down besides a few 15 minute blocks of scheduled maintenance.  For the most part, Google Fusion has turned out to be a much better solution than Caspio's Web Application.  Since this error was so minor, besides enabling downloads, I have done nothing else to further prevent this issue.
    • This experience and all the other steps that I've taken to manage this web application have taught me a lot about how to better prevent down times and improve user experiences.  I have been much more careful in how I help others to troubleshoot issues realizing that any problem that a user has is also my problem.  No matter what the issue is with a user not being able to understand or access my software solutions, I realize that I have a role in trying to improve the user experience.  While this issue may not have specifically helped me in this category, I have also learned a lot more from managing chicagoresourcehub.com about how my software technology should seek to  provide solutions that are scalable, easy to manage, easy to update, easy to integrate across services and platforms, and easy for customers to work with.
Posted in code, google api, google apps, open data, open source and tagged , , , , , .