Monitoring

The Semantic Search has built-in monitoring allowing you to track the execution of queries and administrative tasks. Health checks for the constituent services of the Search are also available. Additionally, the service also has a good-to-go endpoint that offers a quick view of the overall health status of the system. All requests are associated with one or more logging messages, making it easier to keep track of its state.

Health Checks

The health checks can be obtained from the __health endpoint. The health check service also has a cache that refreshes when a certain number of seconds have passed from the last time it was requested (default is 30). The default can be changed by setting the health.cache.invalidation.period configuration parameter. The usage of the cache can also be controlled at runtime by using the boolean URL parameter cache. The default behavior or requests without additional parameters will use the cache.

There are two distinct health checks associated with the Semantic Search:

  • Search health check: Verifies that each dependent component required for the proper execution of the search request is available and in operational state. It checks whether there is a bound SOML schema and whether there is a connection to the Elasticsearch.
  • Elasticsearch indexes health check: Validates that all of the indexes are available and operating normally. It also performs a connection test to Elasticsearch and uses the cluster health request from Elasticsearch to calculate the overall state of the indexes. The specific case for this check is that it will return OK status if Elasticsearch is used with single node (replicas), although Elasticsearch does not recommend such usage.

Each of the described checks has a detailed response. The responses contain the following items:

  • id: The ID is obtained from a set of standard Ontotext IDs which are unique and persistent across the service. All checks are prefixed with 2 to indicate Semantic Search-related problems.
    • Search OK - 2000: There is no issue with the service that handles search requests.
    • Search unavailable - 2001: The Semantic Search is unavailable and cannot process any search requests.
    • SOML not bound - 2002: There is no SOML schema bound to the service.
    • SOML unavailable - 2003: The bound SOML schema could not be loaded for the store. Either the Search could not establish connection to the store or the model was removed from the store.
    • Elastic unavailable - 2004: The Search does not have connection to Elasticsearch.
    • Indexes OK - 2100: There is no issue with the required Elasticsearch indexes, and all of them are available.
    • Remote Elastic unavailable - 2101: The remote Elasticsearch instance is not available and the status of the indexes could not be retrieved.
    • Indexes unavailable - 2102: There was an internal error during the health check procedure. It shows that the service is not available and there are issues with it.
    • Index SOML not bound - 2103: There is no SOML schema bound to the service, thus there are no indexes to check for.
    • Indexes SOML unavailable - 2104: The bound SOML schema could not be loaded, therefore the required indexes could not retrieved for a correct health check.
    • Indexes errors - 2105: There is an issue with one or more indexes and their individual status is not OK.
    • Missing indexes - 2106: There is at least one required index that is missing in Elasticsearch. This may occur when the index was not created or was removed from Elasticsearch for some reason.
  • status: Marks the status of the particular component. Can be ERROR or OK. This parameter should be analyzed together with the impact status for the given health check.
  • severity: Marks the impact of the errors in a given component on the entire system. Can be LOW, `MEDIUM`, or HIGH. LOW severity is returned when there are issues that should not affect the overall behavior of the Search seriously. MEDIUM is returned when the error will lead to issues with other services but not to an unrecoverable state. HIGH severity errors mean that the Search is unusable until they are resolved. Is only returned if a dependent component is not OK.
  • name: A human-friendly name for the check. It can be inferred from the check ID as well.
  • type: A human-friendly identifier for the check. It can be either search or elasticIndexes.
  • impact: A human-friendly short description of the error, providing a quick reference for how the problem will impact the service.
  • description: A description of the check itself and what it covers.
  • troubleshooting: Contains a link to the troubleshooting documentation that offers specific steps to help users fix the problem. If there is no problem, it points to the general __trouble page.

The health checks update dynamically with the state of the overall system. When a given component recovers, its health check will also return to OK.

Beside the described health checks, each request to the __health endpoint returns an overall status field, detailing the state of the system. This is OK if no errors are present, WARNING if errors are present but their impact is not `HIGH`, and ERROR if errors are present and their impact is HIGH.

This is an example of a healthy Search instance:

{
  "status":"OK",
  "healthChecks":[
    {
      "status":"OK",
      "id":"2000",
      "name":"Search service health",
      "type":"search",
      "impact":"Search service operating normally.",
      "troubleshooting":"http://otp-search.com/__trouble",
      "description":"Search service checks.",
      "message":"Search service operating normally."
    },
    {
      "status":"OK",
      "id":"2100",
      "name":"Elastic indexes health",
      "type":"elasticIndexes",
      "impact":"All indexes are available",
      "troubleshooting":"http://otp-search.com/__trouble",
      "description":"",
      "message":"All indexes are available"
    }
  ]
}

Good to Go

The good-to-go endpoint is available at __gtg. The endpoint service also has a cache that refreshes if 30 seconds have passed from the last time it was requested. This is controlled by the boolean URL parameter cache. This parameter also controls whether or not to perform a full health check or to use the health check cache.

The good-to-go endpoint returns OK if the Search is operational and can be used, i.e., the status of the health checks is either OK, or it is WARNING and can be recovered to OK without Search instance restart. The endpoint returns `ERROR` when the status of the health checks is ERROR.

Good-to-go and health checks can be used in tandem in order to enable an orchestration tool for managing the Semantic Search. Below is a sample Kubernetes configuration for the Search that showcases how to utilize good-to-go and health check to monitor the status of your application:

spec:
  containers:
  - name: OTP Search
    image: ontotext/search
    readinessProbe:
      httpGet:
        path: /__gtg?cache=false
        port: 8080
      initialDelaySeconds: 3
      periodSeconds: 10
    livenessProbe:
      httpGet:
        path: /__health
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 30

Tip

We recommend a health check period of at least 10 seconds if not using the cache.

Another good practice is to not set a cache=false if a health check has a period greater than the cache invalidation period. The assumption made here is that the cache will be invalidated anyway, or, if it is not, that another tool using the health checks has refreshed it in the meantime.

This is an example of a Search instance that is good to go:

{
  "gtg": "OK"
}

Troubleshooting

The __trouble endpoint helps troubleshoot and analyze issues with the Semantic Search, outlining common error modes and their resolution. The troubleshooting documentation contains the following components:

  • Important endpoints: An overview of the endpoints supported by the service.
  • Example query requests: Provides a streamlined example of using the Semantic Search.
  • Prerequisites: Lists the skill set that a successful maintainer should have.
  • Resolving known issues: Provides a list of known symptoms together with potential causes and suggested resolution methods.

The troubleshooting endpoint is a starting point for analyzing any issues with the Semantic Search and may often be sufficient for resolving them on its own. If you cannot resolve the issues with the help of this endpoint, please refer to our support.

About

The __about endpoint lists the Search version, its build date, a quick description on what the Semantic Search is, and a link to this documentation.

Semantic Services Monitoring

The Semantic Search monitoring can also be included in the global monitoring performed for all Ontotext Semantic Services. It uses Grafana to monitor the performance and health of the deployed services. See how to use it here.