Semantic Objects Search¶
What’s in this document?
Overview¶
The Semantic Object Search provides a way to index the data from the Semantic Objects Service in Elasticsearch and run queries against it. The Semantic Object Search consists of two services: the Semantic Object Search Service and the Semantic Object Service.
The Semantic Objects Service is responsible for indexing the data. On SOML bind action, the Semantic Object Service will create ElasticSearch Connector instances in GraphDB. These Connectors will ensure that the data from GraphDB is always indexed and up-to-date in Elasticsearch.
The Semantic Object Search Service, on the other hand, provides a GraphQL endpoint over the data in Elasticsearch, allowing easy data consuming. To make a SOML schema searchable via the Search Service, the SOML schema should be first uploaded and bound in the Semantic Objects Service. After that, the schema should be simply bound to the Search Service, which will read it from the Semantic Objects Service SOML storage. The Search Service will assume that all data is already present in Elasticsearch.
See the SOML Search documentation for information on how to configure a SOML schema for the Search Service.
Data Indexing¶
During data indexing, the following happens:
Semantic Objects Service removes any old GraphDB Elasticsearch Connectors (if such are left from previous indexing tasks).
Semantic Objects Service creates new GraphDB Elasticsearch Connectors.
GraphDB performs the indexing to Elasticsearch.
The data indexing process is triggered when:
A new SOML schema is being bound. Calling a
bind
action on an already bound SOML schema does not trigger data indexing.The already bound SOML schema is being updated.
Note
Each update on the bound schema will trigger a data indexing task, which, depending on the data, may be quite time-consuming. Proceed with caution when updating a SOML schema.
The Semantic Objects Service may also trigger deletion of the Elasticsearch GraphDB Connectors, resulting in deletion of the Elasticsearch indices. This is performed when:
A SOML schema is being unbound (you have called a
bind
action on another schema).A SOML schema is being deleted.
Both of these actions will remove only the indices associated with the given SOML schema. All other GraphDB Connector instances and Elasticsearch indices will not be affected.
It should also be noted that setting elasticsearch.indexingEnabled
to false
will not trigger deletion of the indices.
So if you have big indices and plan a lot of SOML updates, a possible solution to avoid rebuilding the Elasticsearch
indices in each update would be to disable the Elasticsearch indexing in the Semantic Objects Service. GraphDB
will continue to update the existing Elasticsearch indices so the data would be up-to-date.
However, any changes in the SOML model would not be applied. Long-term, this is not advisable as the Semantic Search Service may start using a data model that does not correspond to the indexed data, and this may result in various unexpected problems.
Quick Start¶
In order to deploy the Platform with the Semantic Search Service you will need to download the following
docker-compose.yaml
example that starts the Semantic
Search Service along with Semantic Objects Services (Semantic Objects, Workbench, GraphDB and MongoDB),
Elasticsearch and Kibana.
Once you’ve downloaded the compose file(manifest) from this page, you need to following the Quick Start guide using this file (skip the download operation in Docker Compose section of the guide) instead the one define in the guide.
After deploying the Platform, the Search Service will be available at http://localhost:9980
.
To configure the Semantic Search Service use the declarative Platform schema and its configuration options.
GraphQL API¶
The primary API of the Search Service is the /graphql
REST endpoint. It exposes a GraphQL schema
based on the searchable shapes and properties of a bound SOML schema.
The GraphQL schema is tailored to be as close as possible to the Elasticsearch DSL, including queries, sorting, and aggregations.
The /graphql
endpoint is available for both GET and POST method requests.
GET request example for query query all_humans { human_search { hits { human { id } } } }
:
curl --location -X GET \
-H 'Content-Type: application/graphql' \
'http://localhost:9980/graphql?query=query%20all_humans%20%7B%20human_search%20%7B%20hits%20%7B%20human%20%7B%20id%20%7D%20%7D%20%7D%20%7D'
POST request example for raw query query all_humans { human_search { hits { human { id } } } }
:
curl --location -X POST \
-H 'Content-Type: application/graphql' \
--data 'query all_humans { human_search { hits { human { id } } } }' \
'http://localhost:9980/graphql'
POST request example for query query all_humans { human_search { hits { human { id } } } }
as JSON payload:
curl --location -X POST \
-H 'Content-Type: application/json' \
--data '{"operationName": "all_humans", "query": "query all_humans { human_search { hits { human { id } } } }"}' \
'http://localhost:9980/graphql'
Tutorials¶
Queries¶
Paging¶
Sorting¶
Aggregations¶
Monitoring¶
The Search Service has built-in monitoring allowing you to track the execution of queries and administrative tasks. Health checks for the constituent services of the Search are also available. Additionally, the service also has a good-to-go endpoint that offers a quick view of the overall health status of the system. All requests are associated with one or more logging messages, making it easier to keep track of its state.
Health Checks¶
The health checks can be obtained from the __health
endpoint. The health check service also has a cache
that refreshes when a certain number of seconds have passed from the last time it was requested (default is 30).
The default can be changed by setting the health.cache.invalidation.period
configuration parameter. The usage
of the cache can also be controlled at runtime by using the boolean URL parameter cache
. The default behavior or
requests without additional parameters will use the cache.
There are two distinct health checks associated with the Search Service:
Search health check: Verifies that each dependent component required for the proper execution of the search request is available and in operational state. It checks whether there is a bound SOML schema and whether there is a connection to the Elasticsearch.
Elasticsearch indexes health check: Validates that all of the indexes are available and operating normally. It also performs a connection test to Elasticsearch and uses the cluster health request from Elasticsearch to calculate the overall state of the indexes. The specific case for this check is that it will return
OK
status if Elasticsearch is used with single node (replicas), although Elasticsearch does not recommend such usage.
Each of the described checks has a detailed response. The responses contain the following items:
id: The ID is obtained from a set of standard Ontotext IDs which are unique and persistent across the service. All checks are prefixed with
2
to indicate Search Service related problems.Search OK - 2000: There is no issue with the service that handles search requests.
Search unavailable - 2001: The Search service is unavailable and cannot process any search requests.
SOML not bound - 2002: There is no SOML schema bound to the service.
SOML unavailable - 2003: The bound SOML schema could not be loaded for the store. Either the Search could not establish connection to the store or the model was removed from the store.
Elastic unavailable - 2004: The Search does not have connection to Elasticsearch.
Indexes OK - 2100: There is no issue with the required Elasticsearch indexes, and all of them are available.
Remote Elastic unavailable - 2101: The remote Elasticsearch instance is not available and the status of the indexes could not be retrieved.
Indexes unavailable - 2102: There was an internal error during the health check procedure. It shows that the service is not available and there are issues with it.
Index SOML not bound - 2103: There is no SOML schema bound to the service, thus there are no indexes to check for.
Indexes SOML unavailable - 2104: The bound SOML schema could not be loaded, therefore the required indexes could not retrieved for a correct health check.
Indexes errors - 2105: There is an issue with one or more indexes and their individual status is not
OK
.Missing indexes - 2106: There is at least one required index that is missing in Elasticsearch. This may occur when the index was not created or was removed from Elasticsearch for some reason.
status: Marks the status of the particular component. Can be
ERROR
orOK
. This parameter should be analyzed together with the impact status for the given health check.severity: Marks the impact of the errors in a given component on the entire system. Can be
LOW
,`MEDIUM`
, orHIGH
.LOW
severity is returned when there are issues that should not affect the overall behavior of the Search seriously.MEDIUM
is returned when the error will lead to issues with other services but not to an unrecoverable state.HIGH
severity errors mean that the Search is unusable until they are resolved. Is only returned if a dependent component is notOK
.name: A human-friendly name for the check. It can be inferred from the check ID as well.
type: A human-friendly identifier for the check. It can be either
search
orelasticIndexes
.impact: A human-friendly short description of the error, providing a quick reference for how the problem will impact the service.
description: A description of the check itself and what it covers.
troubleshooting: Contains a link to the troubleshooting documentation that offers specific steps to help users fix the problem. If there is no problem, it points to the general
__trouble
page.
The health checks update dynamically with the state of the overall system. When a given component recovers,
its health check will also return to OK
.
Beside the described health checks, each request to the __health
endpoint returns an overall status
field, detailing the state of the system. This is OK
if no errors are present, WARNING
if errors are
present but their impact is not `HIGH`
, and ERROR
if errors are present and their impact is HIGH.
This is an example of a healthy Search instance:
{
"status":"OK",
"healthChecks":[
{
"status":"OK",
"id":"2000",
"name":"Search service health",
"type":"search",
"impact":"Search service operating normally.",
"troubleshooting":"http://otp-search.com/__trouble",
"description":"Search service checks.",
"message":"Search service operating normally."
},
{
"status":"OK",
"id":"2100",
"name":"Elastic indexes health",
"type":"elasticIndexes",
"impact":"All indexes are available",
"troubleshooting":"http://otp-search.com/__trouble",
"description":"",
"message":"All indexes are available"
}
]
}
Good to Go¶
The good-to-go endpoint is available at __gtg
. The endpoint service also has a cache that refreshes
if 30 seconds have passed from the last time it was requested. This is controlled by the boolean URL
parameter cache
. This parameter also controls whether or not to perform a full health check or to use
the health check cache.
The good-to-go endpoint returns OK
if the Search is operational and can be used, i.e., the status
of the health checks is either OK
, or it is WARNING
and can be recovered to OK
without Search
instance restart. The endpoint returns `ERROR`
when the status of the health checks is ERROR
.
Good-to-go and health checks can be used in tandem in order to enable an orchestration tool for managing the Search Service. Below is a sample Kubernetes configuration for the Search that showcases how to utilize good-to-go and health check to monitor the status of your application:
spec:
containers:
- name: OTP Search
image: ontotext/search
readinessProbe:
httpGet:
path: /__gtg?cache=false
port: 8080
initialDelaySeconds: 3
periodSeconds: 10
livenessProbe:
httpGet:
path: /__health
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
Tip
We recommend a health check period of at least 10 seconds if not using the cache.
Another good practice is to not set a cache=false
if a health check has a period greater than the
cache invalidation period. The assumption made here is that the cache will be invalidated anyway, or,
if it is not, that another tool using the health checks has refreshed it in the meantime.
This is an example of a Search instance that is good to go:
{
"gtg": "OK"
}
Troubleshooting¶
The __trouble
endpoint helps troubleshoot and analyze issues with the Search Service, outlining common error
modes and their resolution. The troubleshooting documentation contains the following components:
Important endpoints: An overview of the endpoints supported by the service.
Example query requests: Provides a streamlined example of using the Search Service.
Prerequisites: Lists the skill set that a successful maintainer should have.
Resolving known issues: Provides a list of known symptoms together with potential causes and suggested resolution methods.
The troubleshooting endpoint is a starting point for analyzing any issues with the Search and may often be sufficient for resolving them on its own. If you cannot resolve the issues with the help of the this endpoint, please refer to our support.
About¶
The __about
endpoint lists the Search version, its build date, a quick description on what the Search Service is, and a link to this documentation.
Administration¶
Schema Management API¶
Binding a schema
The PUT /soml/{schema-id}/search
endpoint is used to bind a SOML schema.
Example binding for swapi SOML schema using a cURL request:
curl --location -X PUT 'http://localhost:9980/soml/swapi/search'
Unbinding a schema
The DELETE /soml/{schema-id}/search
endpoint is used to unbind a SOML schema.
Example unbinding for swapi SOML schema using a cURL request:
curl --location -X DELETE 'http://localhost:9980/soml/swapi/search'
Validating a schema
The POST /soml/validate
endpoint is used to validate a SOML schema provided with the request body.
The response is returned in JSON-LD format. If there were errors during validation, they will be returned
with the response along with the original schema.
Example validation for SOML schema using a cURL request:
curl "http://localhost:9980/soml/validate" -X POST -H "Content-Type: text/yaml" -T "/path/to/schema.yaml"
Index information
The GET /soml/info
endpoint is used to return information about existing indices in Elasticsearch.
This endpoint works only if there is a bound schema.
Example cURL request:
curl --location -X GET 'http://localhost:9980/soml/info'
Service Configurations¶
search.storage.location
- Description: Specifies the location where the service will store data related to the active schema. Usually, this is a configuration properties file.Default value:
data
spring.elasticsearch.rest.uris
- Description: Specifies the addresses of Elasticsearch instances to connect to. A comma-separated list.Default value:
http://localhost:9200
search.soml.storage.mongodb.endpoint
- Description: Specifies the address of the MongoDB storage where the SOML documents are stored.Default value:
mongodb://localhost:27017
search.soml.storage.mongodb.database
- Description: Specifies the database name that should be used to store the SOML documents.Default value:
soaas
search.soml.storage.mongodb.collection
- Description: Specifies the collection name that should be used to store the SOML documents. MongoDB collections are analogous to tables in relational databases.Default value:
soml
search.soml.storage.mongodb.connectionTimeout
- Description: The time in milliseconds to attempt a connection before timing out.Default value:
5000
search.soml.storage.mongodb.readTimeout
- Description: The time in milliseconds to attempt to read for a connection before timing out.Default value:
5000
search.soml.storage.mongodb.readConcern
- Description: The Mongo client read concern configuration. For more information, see the Mongo documentation for Read Isolation (Read Concern).Default value:
majority
Possible values:default
(Mongo default),local
,majority
(Search Service default),linearizable
,snapshot
,available
search.soml.storage.mongodb.writeConcern
- Description: The Mongo client write concern configuration. For more information, see the Mongo documentation for Write Acknowledgement (Write Concern).Default value:
majority
Possible values:acknowledged
(Mongo default),w1
,w2
,w3
,unacknowledged
,journaled
,majority
(Search Service default),tag-name
orin the formw=tag-name/server-number, [wtimeout=timeout]
. Example:w=2, wtimeout=1000
search.soml.storage.mongodb.applicationName
- Description: Assign an application name to be displayed in the Mongo logs.Default value:
search
soml.storage.mongodb.serverSelectionTimeout
- Description: Specifies how much time (in milliseconds) to block for server selection before throwing an exception.Default value:
5000
logging.level.com.ontotext.platform.search
- Description: Specifies the console log level for the Platform Search Service.Default value:
INFO
graphdql.federation.enabled
- Description: Specifies if the Search Service will be used in federation mode.Default value:
false
Semantic Objects Service Configuration¶
elasticsearch.indexingEnabled
- Description: Enables Elasticsearch indexing.Default value:
false
elasticsearch.host
- Description: Specifies the address of the Elasticsearch instance for the Semantic Objects Service and GraphDB to connect to.Default value:
n/a
elasticsearch.externalHost
- Description: Specifies the address of the Elasticsearch instance for the Semantic Objects Service to connect to. If not specified, the value of
elasticsearch.host
will be used. Useful only if the Semantic Objects Service and GraphDB are in different networks.Default value:elasticsearch.host
elasticsearch.indexCreateSettings
- Description: Index settings to be used directly when creating the Elasticsearch indices.Default value:
n/a
elasticsearch.connectorCreateSettings
- Description: GraphDB Elasticsearch Connector Creation Parameters to be used for the Connector instances.Default value:
n/a
search.maxNestingLevel
- Description: Specifies the maximum allowed value defined in
search.type.nestingLevel
configurations in SOML objects and property definitions.Default value:5
Possible values: Positive integer values
With a complex SOML schema and a large amount of data, it is easy to start hitting the Elasticsearch default limits. So setting the following properties to larger values may be needed:
elasticsearch.indexCreateSettings.index.mapping.nested_objects.limit: 10000
elasticsearch.indexCreateSettings.index.mapping.nested_fields.limit: 50
elasticsearch.indexCreateSettings.index.mapping.total_fields.limit: 1000
Note
If your SOML schema creates indices that are too big, increasing the Elasticsearch limits is not always a solution, as this will affect the performance. Reducing the index scope to only the mandatory data is always advisable.