Having inspectable and observable governance within a Sitecore environment is like having a lighthouse to guide a ship through a stormy sea. Just as a lighthouse provides clear visibility of the hazards and safe paths in the ocean, inspectable and observable governance provides clear visibility and control of the applications and infrastructure in the Sitecore environment. It helps digital teams navigate through the complexities of the platform, avoid potential hazards, and safely reach their destination of building valuable digital experiences.
Sitecore platform owners need to be seen as business enablers. This means ensuring flawless availability from an end-to-end optimized full-stack and operating a secure and trusted platform for growth.
So, as a platform owner, how can you:
When it comes to having a truly trusted platform that has a full stack security baseline, the ability to execute zero downtime deployments and rollback, the final piece of the puzzle is inspectable, observable governance.
Inspectable and observable governance is important for Sitecore platform owners and their digital teams because it provides visibility and transparency into the entire platform's health and performance, including security, compliance, and operations.
In a Sitecore website tech stack, implementing an effective observability framework involves several key practices. Begin by integrating Sitecore's built-in logging features and leverage the Sitecore log files for insights into application behavior. Utilize Application Insights or similar tools to capture telemetry data, including custom events and exceptions.
Instrument your codebase with Sitecore's logging APIs and incorporate structured logging for enhanced readability. Employ Sitecore's Experience Database (xDB) and Analytics to gather user behavior data, allowing for comprehensive monitoring of user interactions.
Implement centralized logging using tools like ELK Stack or Splunk to aggregate and analyze logs from various Sitecore components. Utilize Application Performance Monitoring (APM) tools to capture performance metrics, ensuring timely identification of bottlenecks and optimizations. Leverage Sitecore's diagnostic tools, such as the Developer Center, to troubleshoot issues and monitor the health of the platform.
You can create custom dashboards in tools like Grafana or Power BI to visualize Sitecore-specific metrics, including content delivery, indexing, and database performance. Set up alerting mechanisms for critical thresholds, providing proactive notifications for potential issues.
Observability is of paramount importance in the context of headless websites due to the decentralized and modular nature of their architecture. In a headless setup, the front-end and back-end are decoupled, relying on microservices and APIs to deliver content and functionality. This decoupling introduces complexities in monitoring and debugging, making observability essential for maintaining optimal performance and resolving issues promptly.
With traditional monolithic architectures, tracking down problems could be relatively straightforward, but headless websites involve numerous independent services interacting to deliver a seamless user experience.
It’s critical that you and your partners regularly review and update the observability strategy to align with Sitecore updates and evolving business requirements. Document configurations, instrumentation practices, and troubleshooting procedures for efficient collaboration within the Sitecore development team.
With years of experience implementing this type of change within your own Digital teams or partner agency, we know that the frustrations can emerge with technical teams that do not have visibility or access to the right metrics.
Without a robust observability framework in a Sitecore website, various risks and challenges may arise, particularly in a headless context. Inadequate visibility into system behavior can lead to delayed issue detection and resolution, affecting user experience and overall performance.
In a headless scenario, where content is delivered via APIs with JSS, SPA or to various front-end channels, the lack of observability may result in difficulties pinpointing issues in the decoupled architecture. Failure to monitor API response times, content retrieval, or external service dependencies could lead to latency and content delivery issues across different channels.
The absence of centralized logging may hinder the identification of errors or anomalies in the headless architecture. For instance, if a content update fails to propagate across channels, the lack of comprehensive logs makes it challenging to trace the issue back to its source.
Inefficient resource utilization, unoptimized queries, or bottlenecks in the headless API layer might go unnoticed without proper metrics and monitoring. This can impact scalability and degrade the overall performance of the Sitecore system.
A deficient observability framework increases the risk of security vulnerabilities, as potential threats and unauthorized access may remain undetected. Without adequate monitoring, it becomes challenging to identify and respond promptly to security incidents, exposing the Sitecore website to potential breaches.
But, if platform owners can implement this change, instead of fighting these frustrations they will have a smooth and efficient process with the tools and processes to allow them to easily monitor the platform and quickly identify and resolve any issues. They’ll have clear communication and collaboration between all stakeholders involved in the platform and ensure everyone is on the same page and working towards the same goals.
An outage, by general definition, is when the website fails to return a 200 response and generate the HTML that you expect. Dataweavers tracking systems are intelligent, test from multiple locations and networks around the globe. It is also designed to tolerate of transient issues such as bad DNS resolution/propagation, general internet congestion, weak mobile connections and even device-level issues that sometimes occur.
This alleviates false positives and avoids the concept of alert fatigue.
However, we do sometimes see micro-outages in the following scenarios:
These scenarios typically only affect a very small number of connections and sometimes the tracking system is one of those connections. From a customer perspective, except where the outage is longer than expected, during the scenarios above, the platform will failover to the secondary endpoint (instance or region). This means there is typically no impact to the majority of end users.
The tracking platform Dataweavers uses records these events, because we are tracking a primary connection to the primary server, and the website check frequency is every 1 minute. This means that if the first connection to the primary server fails, then it will take another 1 minute for it to check again for another connection to the servers.
This does not mean that the website is inaccessible during that time, it simply means that our health checks are set to an interval that allows us to meet a defined SLA.