Maximize Uptime with Our API and Infrastructure Monitoring

Explore how our advanced API and infrastructure monitoring solution maximizes uptime and performance, ensuring seamless customer experiences in a cloud-driven world.

Oct 21, 2024 | 2 min read

Initially, open-source tools like Prometheus, Grafana, and BlackBox Exporter were used for proactive API monitoring. While helpful, this approach had its downsides: it polled APIs at set intervals, adding unnecessary load, increased costs for external API calls, and didn’t reflect the actual volume of API traffic.

To address these limitations, the transition to real-time API monitoring was done. This eliminates the need for constant polling, reducing the cost of monitoring external APIs, while accurately tracking the number of hits per API. The application now logs key metrics, and all data is then stored in an AWS OpenSearch domain’s index.

Our team deployed and configured a suite of open source monitoring tools, including Grafana, Prometheus, Thanos, and the CloudWatch exporter, onto AWS EKS. Prometheus was strategically configured to scrape data from all virtual machines running our agent. In addition, it was set to capture metrics from the CloudWatch exporter. To optimize cost, the CloudWatch exporter was configured to gather data from specifically tagged resources ensuring efficient yet cost-effective infrastructure monitoring.

We set up a system to monitor our APIs in real time using Grafana. It pulls information from our data logs, but we hit a snag when no data was showing up. After looking into it, we found the issue: Grafana was looking for a time label that didn’t match the one in our logs. Once we adjusted that setting, everything worked perfectly, and the data started flowing as expected.

Real-time dashboards to show important details about API performance, like the number of requests, response times, and any errors were configured. By using special data filters in Grafana, we made it easier to focus on the most relevant information, helping leaders make quick, informed decisions.

Database Monitoring

API Status Code

A major challenge we faced was that Grafana alerts typically rely on numeric data, while we were using JSON log data. To address this, we created a Lucene query that matched our alert conditions and organized the data by metric and timestamp. This transformed the log data into a time series format, enabling precise alerts. After setup, the alert system worked effectively, allowing us to quickly identify and resolve issues.

Alert: Status not 200

We implemented a custom Python service to manage infrastructure and API alerts. This service efficiently creates support tickets in our ITSM tool and sends real-time email and call notifications to the right stakeholders.

In conclusion, this integrated monitoring and alerting solution gives us real-time insights into API and infrastructure performance. It helps us detect and resolve issues quickly, improving our efficiency and decision-making. Ultimately, this drives business continuity and growth.

Written by

Shradha Sarade

Head - Cloud

https://www.linkedin.com/in/shraddha-sarade/

Maximize Uptime with Our API and Infrastructure Monitoring

View More Blogs