Semantic Object Modeling

Introduction

The Semantic Search extends the Semantic Objects by introducing easy to configure and use full text indexing capabilities over RDF data. This is done by simplifying the GraphDB Connector management for Elasticsearch, allowing you to index data and perform Elasticsearch queries over the GraphQL protocol. Both the Semantic Objects and the Semantic Search use the Semantic Object Modeling Language (SOML) as a common configuration language.

In this section, we will see how to configure the SOML for Elasticsearch indexing.

See more about SOML here.

Search Configuration

To simplify and expose the majority of the functionalities, configurations, and options provided by Elasticsearch, SOML defines its own search configuration (or model). This model can be placed in the global SOML configuration, object level, or properties. This allows you to define the structure of the searchable data, giving you freedom to configure it in ways that best fit your needs.

Here is an example of the search configuration with all configurable fields:

search:
  index: true
  analysis:
    analyzer: standard
    lang:
      bg:
        analyzer: standard
      en:
        analyzer: standard
      ....
  • index: the main field that shows whether the object or the property is searchable or not. The allowed values for the field are true or false (yes or no are also acceptable). If this field is set to false, the object or the property is considered not searchable and all other fields in the search configuration are ignored.
  • analysis: provides a way to define the way in which unstructured text should be handled in Elasticsearch. The description of this configuration field is parsed and translated when the type mapping between SOML and Elasticsearch is created. For more information on how it should be configured, what values are allowed, etc., check the analyzer-related sections.

All objects and properties are considered not searchable by default, until they are marked as searchable. Only user-defined objects can be searchable. All internal and system objects are ignored during execution of the search-related processes.

SOML Global Search Configuration

SOML allows defining of the search configuration on schema level. Its purpose is to provide default values for fields that are not specified in the configuration of the objects and/or properties. The definition on schema level can be used as global search configuration for all defined objects and properties. When the Semantic Objects process the SOML and parse the objects and properties definitions, they attempt to retrieve any missing or undefined search configuration field from the global definition, if such is defined. The inheritance and merging processes are described in the sections below.

Simple example of the search configuration on schema level is as follows:

id: /soml/example
creator: http://ontotext.com
created: 2020-12-01
config:
  search:
    analysis:
      analyzer: keyword

The configuration defines the analysis section, which will be used as default for all text properties marked as searchable, but do not define their own analysis. It will simply be used as a common fallback for text properties that do not have their own configuration. All of them will behave in the same way and will have the same analyzer applied to them.

The requirement for configuration on global level is that, if analyzer is not defined in the type mapping for a specific text property, Elasticsearch will generate one for it, but the client will have no information about it until the actual data is inserted. Furthermore, for most users, it is not necessary to have in-depth knowledge of Elasticsearch, as their focus will mainly be on the SOML that they are constructing.

Note

If the analysis section is not defined on schema level and you have text fields without an analyzer set, then standard analyzer will be used by default.

Defining an Object as Searchable

To make an object searchable, its definition should be simply extended with search configuration as shown in the example below:

Human:
  prefix: "human/"
  descr: "A Homo Sapiens"
  inherits: Character
  # this is search configuration on object level. It marks the object and its properties as searchable
  search:
    index: true
  props:
    mass: {descr: "Mass in kilograms", range: decimal}

Following the example above, Human is marked as a searchable object, which means that all of its properties will also be searchable. In this case, the specific property mass and all inherited properties from Character will become searchable.

When SOML is bound to the Semantic Objects and contains searchable objects, an Elasticsearch index is created for each object. The index name is formed from the name of the object. It is transformed to lower case characters due to specific name conventions in Elasticsearch, which do not allow the use of the name as it is. Additionally, the service will add a prefix otp-. The main reason for adding it is easy recognition and processing of automatically generated indexes. Using the object from the example above, the generated index for it will be named otp-human. If the object was called, for example, FilmRelease, its index will be otp-filmrelease, and so on.

The configuration of the object properties will be used to generate type mapping for that index. If a property is explicitly set as not searchable (index field value is no), it will be excluded from the mapping and from data synchronization processes.

Note

Properties can be set as not searchable only if they are not inherited from a parent object that has been marked as searchable.

Filtering

Filters control what objects are being indexed in object level search configuration. For example:

Human:
  inherits: Character
  search:
    index: true
    filter: '{homeworld: {name: {RE: "Tatooine"}}}'

This configuration will index only Human-s from Tatooine. All other Humans-s will be ignored.

The filters follow the regular filter syntax used for queries. However, there are still some minor differences between the two. This is because the Search filters are not evaluated in SPARQL but as GraphDB connector entityFilter-s, which are less expressive.

  • Multiple checks on a nested object

    In a GraphQL query, there is a difference between the following queries:

    Query 1:

    query {
      planet (where: {AND: [{resident: {name: {EQ: "BB8"}}}, {resident: {height: {EQ: "66.0"}}}]}) {
        id
        resident {
          name
          height
        }
      }
    }
    

    and

    Query 2:

    query {
      planet (where: {resident: {AND: [{name: {EQ: "BB8"}}, {height: {EQ: "66.0"}}]}}) {
        id
        resident {
          name
          height
        }
      }
    }
    

    The first one will match even if the conditions for name and height are met by 2 different residents. The second query will look for a single resident that matches both conditions (and if such does not exist, it will not return any results).

    If the same filters are used in the Search configuration, both will be evaluated as Filter 1 and will return results.

Defining a Property as Searchable

The search configuration can also be applied on property level to explicitly configure a specific behavior. The configuration has the same structure as that for objects.

It allows you to fine-tune how the specific property is handled in Elasticsearch, enabling you to set different analyzers for a property present in different objects, if the property definition comes from a parent object or a schema properties section.

When a property has no search configuration defined, but the object that it belongs to has one, the configuration from the object is used as configuration for the property as well. This is standard behavior for the objects and their properties so that objects can be marked as searchable with all of their properties.

To better understand how the search configuration is handled on different levels and how it is transferred or merged from different objects and properties, let’s have a look at the following SOML snippet:

id: /soml/example
creator: http://ontotext.com
created: 2019-06-15
updated: 2019-06-16

config:
  search:
    analysis:
      analyzer: standard

prefixes:
  so: "http://www.ontotext.com/semantic-object/"
  dct: "http://purl.org/dc/terms/"
  gn: "http://www.geonames.org/ontology#"
  owl: "http://www.w3.org/2002/07/owl#"
  rdf: "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  rdfs: "http://www.w3.org/2000/01/rdf-schema#"
  skos: "http://www.w3.org/2004/02/skos/core#"
  void: "http://rdfs.org/ns/void#"
  wgs84: "http://www.w3.org/2003/01/geo/wgs84_pos#"
  xsd: "http://www.w3.org/2001/XMLSchema#"

specialPrefixes:
  base_iri: https://starwars.org/resource/
  vocab_iri: https://starwars.org/vocabulary/
  vocab_prefix: voc

properties:
  height: {descr: "Height in metres", range: decimal, search: {index: false}}
  mass: {descr: "Mass in kilograms", range: decimal}
  desc: {label: "Description", range: langString, search: {analysis: {lang: {en: {analyzer: english}}}}}
  additionalInfo: {descr: "Additional information", range: string, search: {index: true}}

objects:
  Character:
    kind: abstract
    descr: "A character in a film"
    name: voc:name
    typeProp: "rdf:type"
    props:
      voc:name: {min: 1}
      desc: {}
      friend: {descr: "Character's friend", max: inf, range: Character}
      mass: {search: {index: true}}

  Droid:
    prefix: "droid/"
    descr: "A droid/robot with Artificial Intelligence"
    inherits: Character
    search:
      index: true
    props:
      primaryFunction: {label: "primary function", descr: "e.g translator, cargo", min: 1, nonNullable: true}
      height: {descr: "Droid height in metres", search: {index: true}}

  Human:
    prefix: "human/"
    descr: "A Homo Sapiens"
    inherits: Character
    search:
      index: true
    props:
      height: {}
      mass: {descr: "Mass in kilograms", range: decimal}
      additionalInfo: {}

Starting from the top, we have global search configuration for analyzer of text properties. This means that every text property that does not have analyzer defined either in their configuration or inherited from the object configuration, will have their index analyzer set to standard. In our example, the object Human uses the common property additionalInfo, which does not override the analysis configuration, and the analyzer information will be retrieved from the global configuration.

Now let’s have a closer look at Character, which is an abstract type and is inherited from the rest of the objects in the example. There is no search configuration on object level for Character, which means that it is not a searchable object and the system will not generate an index for it. Although the property mass is defined as searchable, it will be ignored because there is no index in which to store the data for it.

Looking at the Droid object, we can see that it is a searchable object, but it does not define any additional information about the analyzers of the text fields. It also inherits the Character object, which means that all properties that are defined within the parent will be applicable for the Droid as well. Additionally, the object uses height, which comes from the common property section, but its search configuration needs to be overridden in order to make the property searchable. Another property which demonstrates the inheritance behavior is desc. It is defined in the common property section, defined in Character, which transfers it to Droid. It defines its own search configuration with analyzer definition for en language, but its index field is not described. In this case, the index field is merged from the object configuration, which in this case is index: true, which will make the property searchable for Droid objects.

To see another interesting point in the properties inheritance and the search configuration merging mechanism, let’s check the Human definition. It is a searchable object, but not all of its properties are searchable. In this case, the height property will not be indexed because it is explicitly defined as not searchable in the common properties section. Therefore, it will not be overridden by the Human search configuration. The additionalInfo property of the Human object is searchable, but it has no analysis definition, so it will use the one from the global definition. All of the inputted values for it will be analyzed with the standard analyzer.

Warning

Due of limitations in the GraphQL schema generation, overriding of properties that are explicitly set as searchable (index: true) in the parent object is not allowed. Such cases will be detected and reported as an error when the SOML schema is validated.

Defining an Object as Nested in Parent Index

By default, the index of an abstract object will contain only the properties of that object. For example:

objects:
  Character:
    kind: abstract
    search:
      index: true
    props:
      eyeColor:
  Human:
    inherits: Character
    props:
      mass: { search: {index: true }}

The Character index will contain only the properties eyeColor and type (coming from Object). This means that although all Humans will be indexed in otp-character, their mass property will not be part of the index.

We can control this behavior using the nested setting of the search configuration. Its value is false by default, and it is not inherited.

By defining an object as nested, we include its searchable properties in its parents’ indexes. Example:

objects:
  Character:
    kind: abstract
    search:
      index: true
    props:
      eyeColor:
    eyeColor:
  Human:
    inherits: Character
    search:
      nested: true
    props:
      mass: { search: {index: true }}

Notice that we have added nested: true to the Human object. This means that all Human objects in the otp-character index will have their mass indexed as well. However, this does not mean that Human will have its own index.

In Elasticsearch, the mappings for the nested properties are prefixed with the concrete object name. So for the example above, the mapping in Elasticsearch will be Human_mass.

As already mentioned, only the searchable properties of an object are included in the parent’s index. By default, all properties of searchable objects are searchable, and all properties of objects that are non-searchable are non-searchable. In the example above, Human object is nested: true but effectively index: false, so all its properties are non-searchable by default. In order to nest the mass property in the parent index, we need to explicitly set it as mass: { search: {index: true }}.

In the next example, Human is not only nested but also searchable, so all its properties are searchable by default. With the following configuration, only eyeColor and Human_mass will be included in otp-character, as cybernetics is explicitly set to index: false:

Character:
  kind: abstract
  search:
    index: true
  props:
    eyeColor:

Human:
  inherits: Character
  search:
    index: true
    nested: true
  props:
    mass:
    cybernetics: {search: {index: false}}

Also note that, for an object’s properties to be included in a parent’s index, all objects in the hierarchy between the parent and the child have to be marked as nested: true. For example, in the configuration below, Human_mass will not be included in the HasWikidataLink index, because Character is not set to nested: true and Character is the link between HasWikidataLink and Human:

HasWikidataLink:
  kind: abstract
  search:
    index: true

Character:
  kind: abstract
  inherits: HasWikidataLink
  search:
    index: true
  props:
    eyeColor:

Human:
  inherits: Character
  search:
    nested: true
  props:
    mass: { search: {index: true }}

Configuring Elasticsearch Analyzer

To make text structured and searchable, Elasticsearch performs a process called text analysis. One of the most important parts of this process is configuring a text analyzer.

The Semantic Objects allow each indexed text property to use a different analyzer. Both the Elasticsearch built-in analyzers and custom ones can be used.

Default Analyzers

The default analyzer for all string properties is standard, while the default analyzer for all langString and stringOrLangString properties depends on their language.

By default, each lang determined by the lang.validate configuration for a given property will use the correct Language Analyzer, if one exists.

The valid languages are determined by the lang configuration of the property. For example:

Human:
  props:
    desc: {range: langString, lang: {validate: "en,fr"}}

This means that Human.desc has two valid languages: en and fr. Therefore, two separate properties will be created in Elasticsearch for the two languages.

Note that if the lang is not recognized or there is no language analyzer for it, it will default to the standard analyzer.

Example:

Human:
  props:
    desc: {range: langString, lang: {validate: "an"}}

As there is no built-in language analyzer for Aragonese in Elasticsearch, the default analyzer for the property will still be standard.

Setting a Non-default Analyzer

Setting a non-default analyzer can be performed on four levels:

  • SOML level
  • Object level
  • Property level
  • Property lang level (for langString and stringOrLangString properties)

SOML Level Configuration

A non-default analyzer for all properties can be configured on SOML level. This is done in the config section of the SOML:

config:
  search:
    analysis:
      analyzer: stop

This will set the stop analyzer for all text properties.

Human:
  props:
    desc: {range: langString, lang: {validate: "en,fr"}}
    lastname: {range: string}

In the example above, both lastname and desc will use the stop analyzer for indexing.

Object Level Configuration

A non-default analyzer can be set on object level as well. It will take precedence over the SOML configured analyzer. For example:

Human:
  search:
    index: true
    analysis:
      analyzer: keyword
  props:
    desc: {range: langString, lang: {validate: "en,fr"}}
    lastname: {range: string}

Similarly to the SOML level configuration, this will make both desc and lastname use the keyword analyzer.

Property Level Configuration

An analyzer can also be set directly on property level. Example:

Human:
  search:
    index: true
  props:
    desc: {range: langString, lang: {validate: "en,fr"}, search: {analysis: {analyzer: keyword}}}
    lastname: {range: string, search: {analysis: {analyzer: keyword}}}

Again, both desc and lastname will use the keyword analyzer when indexing.

Property Lang Level Configuration

The last and most specific level of configuration is that on lang level. It is applicable only for the langString and stringOrLangString properties. If you try to make such a configuration on a string property, this will result in an error.

Example configuration:

Human:
  search:
    index: true
  props:
    desc: {range: langString, search: {analysis: {lang: {fr: {analyzer: keyword}}}}}

This will index all values with lang fr with the keyword analyzer instead of the default french one.

Analyzer Inheritance

The order in which the analyzers are applied is:

  • Default value (either standard or an existing language analyzer for lang properties)
  • SOML level configuration
  • Object level configuration
  • Property level configuration
  • Property lang level configuration

Note that the lang properties will use their default values (the built-in language analyzer-s) only when there is no explicit configuration for an analyzer on any of the upper levels.

The analyzers are also inherited from parent classes. So in the following case:

objects:
  Character:
    kind: abstract
    search:
      analysis:
        analyzer: keyword
  Human:
    inherits: Character

Human will use the keyword as its analyzer.

LangString Analyzers

LangString analyzers have some specifics that should be noted.

In Elasticsearch, a separate property is created for each valid language that uses a specific analyzer. The valid languages are a combination of:

  • The explicitly set languages in the search.analysis.lang configuration section
  • The positive exact match languages set in the lang.validate section of the property

So for example for the following property:

desc: {range: langString, lang: {validate: "en"}, search: {analysis: {lang: {fr: {analyzer: keyword}}}}}

We will create two separate properties in Elasticsearch: for en and fr languages. The en values will be indexed with the default english analyzer, while the fr values will get indexed with the explicitly set keyword analyzer. An example of the exact mapping that would be generated in Elasticsearch can be found in the Literals mapping section.

Only exact match language codes use a language analyzer by default, as Elasticsearch does not support wildcard language codes (e.g., en~, ~US, etc).

So if we have a property like this:

desc: {range: langString, lang: {validate: "en~,fr"}}

we will set a default french analyzer only for the fr language. All values that would normally match the en~ wildcard (en-US, en-UK, etc.) will default to the standard analyzer.

Object Properties

We have already covered how to configure a property as searchable in order to include it in the Elasticsearch index. However, with object properties things are a bit more complex. Let’s take a look at the following SWAPI hierarchy:

properties:
  pilot:
    kind: object
    max: inf
    range: Character
  starship:
    kind: object
    max: inf
    range: Starship
  vehicle:
    kind: object
    max: inf
    range: Vehicle

objects:
  Character:
    search: {index: true}
    name: rdfs:label
    props:
      birthYear: {}
      starship: {}
      vehicle: {}
  Starship:
    search: {index: true}
    name: rdfs:label
    props:
      cargoCapacity: {}
      pilot: {}
  Vehicle:
    search: {index: true}
    name: rdfs:label
    props:
      crew: {}
      pilot: {}

We can see that Character, Starship, and Vehicle are all searchable. We need to decide on what depth we want to index the data, for example: are we going to index Character -> starship -> cargoCapacity or even Character -> starship -> pilot -> birthYear property chains.

For a property to be indexed, it needs to meet one of the following criteria:

  • Be one of the mandatory properties - id, __typename, and name for the Nameable Objects.
  • The nesting level of the object containing the property should be above 0, and the property should have search: {index: true}.

Nesting Level

The nesting level indicates the depth to which a certain Object property will get indexed. This setting is not relevant for Scalar and Literal properties.

All properties have default nestingLevel of 0. A non-default value can be set in the search.type section of a Property / Object / SOML. The value must always be in the range of [0, 5].

Example of configuring nestingLevel:

id: /soml/swapi
config:
  search:
    type:
      nestingLevel: 1
objects:
  Character:
    search: {index: true, type: {nestingLevel: 0}}
    name: rdfs:label
    props:
      starship: {search: {type: {nestingLevel: 2}}}
      vehicle: {}
      desc: {}
  Starship:
    search: {index: true}
    name: rdfs:label
    props:
      pilot: {search: {type: {nestingLevel: 0}}}
  Vehicle:
    search: {index: true}
    name: rdfs:label
    props:
      crew: {}
      pilot: {search: {index: false}}

Inheriting the nestingLevels works as you would expect.

  • Character.vehicle will have nestingLevel: 0 inherited from the Character object.
  • Character.starship has explicit nestingLevel: 2, which overwrites the Character value.
  • Vehicle.pilot will inherit nestingLevel: 1 from the SOML config section, although it will not be indexed because of the index: false setting.

To understand how the indexing works, we need to have a look at the notion of effective nestingLevel. At the root level properties, the effective nestingLevel is the same as the nestingLevel. However, when traversing deeper in the object, the effective nestingLevel decrements by 1 on each new step.

So for the Character index:

  • Character.starship has effective nestingLevel: 2, Character.vehicle has effective nestingLevel: 0. These are the root level properties for the Character index, so their effective nestingLevel is the same as the nestingLevel.
  • Character.starship.pilot will have a effective nestingLevel: 1 (Character.starship - 1), Character.vehicle.pilot has effective nestingLevel: 0 (Character.vehicle - 1).
  • Character.starship.pilot.starship and Character.starship.pilot.vehicle will have a effective nestingLevel: 0.

All properties deeper in the chain (e.g., Character.vehicle.pilot.starship or Character.starship.pilot.vehicle.pilot) have negative effective nestingLevel and are therefore excluded from indexing.

We notice that during indexing, only the nestingLevel configuration of the root object’s properties is taken into account. This is why Character.starship.pilot has effective nestingLevel: 1, while in the Starship index Starship.pilot has nestingLevel: 0.

Determine Which Properties Will Get Indexed

As already mentioned, for a property to be indexed, it should either:

  • Be one of the mandatory properties - id, __typename, and name for the Nameable objects.
  • The nesting level of the object containing the property should be above 0, and the property should have search: {index: true}.

So let’s have a look at the above example again.

  • As Character is the root level object, it will have all its searchable properties indexed - id, __typename, name, vehicle, starship, desc.
  • Character.vehicle has effective nestingLevel: 0, meaning that only its mandatory properties will get indexed - id, __typename, name.
  • Character.starship has effective nestingLevel: 2, meaning that all its mandatory or searchable fields will get indexed - id, __typename, name, pilot.
  • Character.starship.pilot has effective nestingLevel: 1, so all mandatory fields and starship, vehicle and desc will get indexed.
  • Character.starship.pilot.vehicle will have only the mandatory fields and crew indexed (Vehicle.pilot is not searchable)
  • Character.starship.pilot.starship has effective nestingLevel: 0, so only its mandatory fields will be indexed.

You can see that setting nestingLevel: 2 on Character.starship will cause the indexing to go as deep as Character.starship.pilot.starship.id. If you have one-to-many properties in the chain, you could see how this could easily dramatically increase the size of the documents and indices, which will result in a huge performance impact. If for example each Character has 10 Starships and each Starship has 10 Pilots, in each document in the Character index we will have:

  • 1 Character object
  • 10 Character.starship objects
  • 100 Character.starship.pilot objects
  • 1000 Character.starship.pilot.starship objects

So it is always advisable to be careful when using the nestingLevel setting, especially when configured on SOML/Object level.

Validation

Several validations are done over the search configurations. Initially, when the SOML schema is uploaded/created, the configurations are checked for structure correctness and for whether the assigned values of the different fields are allowed. When misconfiguration on some level is detected, the schema will be rejected and a corresponding error message will be returned.

The second validation is performed during SOML schema binding in the Semantic Search. It consists of dry generation of a GraphQL schema. If a problem is detected during this process, the schema will not be bound to the Semantic Search and a corresponding error message will be returned.

Like the Semantic Objects, the Semantic Search exposes validation endpoint where you could validate the schema before even creating and binding it to the Semantic Objects. Read more about the validation operation in the Administration section.

Additionally, the validation process will return warnings for any misconfiguration of the search for objects and properties. For example, if some decimal property have been defined as searchable but there is no scale factor set, the validators will detect this, fall back to the default scale factor, and return a warning notifying you of it.

SOML To Elasticsearch Type Mapping

The Semantic Objects handle the generation of the type mapping between SOML datatypes and Elastic datatypes. Elasticsearch requires such mapping in order to know how to handle different data loaded for the indexes. See more in the Mapping section of the Elasticsearch documentation.

The mapping between the SOML and Elasticsearch types defined in the Semantic Search is as follows:

SOML type Elastic type
iri keyword
boolean boolean
string text
langString Literals mapping
int integer
integer keyword
double double
decimal scaled_float where the scaling_factor is passed via scale_factor property characteristic
long long
unsignedLong keyword
unsignedInt long
unsignedShort integer
unsignedByte short
short short
byte byte
positiveFloat float
nonPositiveFloat float
negativeFloat float
nonNegativeFloat float
positiveInteger keyword
nonPositiveInteger keyword
negativeInteger keyword
nonNegativeInteger keyword
dateTime date with yyyy-MM-dd'T'HH:mm:ss format
date date with yyyy-MM-dd format
time date with HH:mm:ss format
year date with yyyy format
yearMonth date with yyyy-MM format

SOML also defines several union types that are represented by nested mapping in Elasticsearch. The mapping for them is as follows:

SOML union type Elasticsearch type
stringOrLangString Literals mapping
dateOrYearOrMonth nested described as exclusive combinations of date, year, and yearMonth

Elasticsearch does not support unlimited (arbitrary length) digits and unlimited precision numbers, so they will be encoded in text form. This allows the data to be returned in the format in which it was inputted, and thus enable you to decide how to transform or process it.

The unsigned datatypes are represented with the next biggest Elasticsearch type, because they are not natively supported by Elastic either. unsignedLong is an exception here, as it is represented by the keyword type.

The mapping generation process is done for each index that will be created by the Semantic Search. The process is triggered on successful SOML binding in the service. When the SOML is updated, the mappings for the indexes are updated as well. If the schema update introduces a new object, the service will generate new index and type mapping for it.

Note

If there is a SOML schema bound to the Semantic Search and it is restarted, the service will attempt to retrieve the SOML and update the index type mappings in order to avoid desynchronization between the data with which the Semantic Objects and the Semantic Search work.

Literals mapping

Literal properties (langString or stringOrLangString) consist of value and lang and are mapped as type: nested objects in Elasticsearch.

An example mapping:

{
  "rdfs_label": {
    "type": "nested",
    "properties": {
      "lang": {
        "type": "keyword",
        "store": true
      },
      "value": {
        "type": "text",
        "store": true,
        "analyzer": "standard"
      }
    }
  }
}

The lang property is always of type keyword, while the value property is of type text. You can also notice how the analyzer is being set as well.

As mentioned in the LangString Analyzers section, different analyzers can be assigned per language. So if for example the rdfs:label property has english analyzer configured for the en language, the mapping would look like this:

{
  "rdfs_label": {
    "type": "nested",
    "properties": {
      "lang": {
        "type": "keyword",
        "store": true
      },
      "value": {
        "type": "text",
        "store": true,
        "analyzer": "standard"
      },
      "value_en": {
        "type": "text",
        "store": true,
        "analyzer": "english"
      }
    }
  }
}

Note that all en values will be present both in value and value_en properties. The only difference is that one will be analyzed using the property default analyzer, while the other will use the english one. This way, if you perform a query over the rdfs_label.value property, the english values will not be excluded.

This is a part of a document in Elasticsearch indexed with the mapping above:

{
  "rdfs_label": [
    {
      "value": "Bib Fortuna"
    },
    {
      "value_en": "Bib Fortuna",
      "lang": "en",
      "value": "Bib Fortuna"
    },
    {
      "lang": "de",
      "value": "Bib Fortuna"
    }
  ]
}

The first object has no lang, as the value in the database is a simple string. Although in the second object value_en and value look exactly the same, they are analyzed with different analyzers and may behave differently when being queried.

Object Properties Mapping

Similar to the Literals, the object properties are mapped as type: nested in Elasticsearch.

This is an example of the mapping for Character.film property:

{
  "film": {
    "type": "nested",
    "properties": {
      "__typename": {
        "type": "keyword"
      },
      "id": {
        "type": "keyword",
        "store": true
      },
      "name": {
        "type": "text",
        "store": true,
        "analyzer": "standard"
      },
      "type": {
        "type": "keyword",
        "store": true
      }
    }
  }
}

The id and _typename properties are always indexed and are of type keyword.

Explicit Elasticsearch Type (Type Overriding)

The flexibility of SOML provides a way to override the standard type mappings for Elasticsearch when the user defines their model. This allows a given property to be treated in different ways in Elasticsearch.

The explicit type setting is done via the search.type configuration field. The field is represented by a dictionary with a specified field called name. The name value is the actual Elasticsearch type that will be used in the index mapping generation.

For convenience, there is a short form of the type configuration shown in the following example:

id: /soml/example

properties:
    eyeColor:
      range: string
      search:
        type: keyword     # short form

    mass:
      range: decimal
      scaleFactor: 30
      search:
        type:
          name: keyword   # standard form

   information:
     range: langString
     max: 50

objects:
  Character:
    kind: abstract
    search:
      index: true
    props:
      eyeColor:
      bestFriend:
        range: Character
        max: 1
      eyesight:
        range: double
        max: 1
        search:
          type:
            name: decimal
            scale_factor: 100
      awards:
        max: inf
        range: string
        search:
          type: keyword

  Human:
    inherits: Character
    search:
      index: true
    props:
      mass:
        search:
          type: default
      awards:
        search:
          type: default
      information:

Here, we can see several features that come with the search.type configuration. There are also some constraints that should be kept in mind when defining the model. To understand this better, let’s inspect the example in detail.

The type field can be used on properties defined in the properties section on schema level. In this example, three properties are defined: eyeColor, mass, and information. As you have already noticed, the eyeColor and mass have their type explicitly set to keyword. This means that when their Elasticsearch mapping is generated, they will be mapped to keyword type and not to the types based on the SOML type, in this case string and decimal. The definition of the type in the eyeColor uses the short form type: keyword. This is convenient when the specified type does not require any more configurations.

To contrast this, let’s check Character.eyesight, where additional configuration for the scale factor should be provided for the decimal types in the Elasticsearch.

In this sense, the type field is very flexible as all additional configurations required for the described type can be added as key-value pairs without any constraints on the number of additional configurations. This is very useful when you have your own custom types/analyzers in Elasticsearch and want to use them for specific properties.

The inheritance rules of the search.type field are no different from the other properties characteristics. However, there are some constraints to keep in mind when defining your SOML.

To better explain their behavior, let’s look at the expanded Human objects from this example:

Human:
  inherits: Character
  search:
    index: true
  props:
    eyeColor:
      range: string
      search:
        type: keyword
    mass:
      range: decimal
      scaleFactor: 30
      search:
        type: default
    bestFriend:
      range: Character
      max: 1
    eyesight:
      range: double
      max: 1
      search:
        type:
          name: decimal
          scale_factor: 100
    awards:
      max: inf
      range: string
      search:
        type: default
   information:
     range: langString
     max: 50

The properties eyeColor, bestFriend, eyesight, and information are inherited with all of their characteristic, as they are defined either from the parent object Character or the properties section on the schema level.

The differences are in the properties mass and awards, where the search.type is set to a special default value. This special value is provided in order to reuse all other configurations of the property and keep the standard mapping to the Elasticsearch type. This means that it overrides the override and notifies the system that the property should be treated as a normal property without any type overrides or changes. In this case, the system will use the standard mapping defined for the SOML type of the property to calculate the actual Elasticsearch type when generating the mappings.

This is useful when a given property has many characteristics or is defined on the schema level, but for some objects the standard type should be kept.

Note

If a Human object has child object(s), they will reuse the characteristic for its properties.

Constraints

The following constraints are currently in place for the search.type field configuration:

  • It can be used on properties with scalar types or language type properties.
  • Object properties have their own search.type values that are strictly designed for relational data. The allowed values are join or nested.

Note

Currently, only type: keyword is supported as type configuration. As the Semantic Search evolves, additional support for other types will be introduced.

Administration

This section provides information about the available REST endpoints provided by the Semantic Search, as well as short descriptions of the functionality, parameters, responses, and examples. For consistency with the rest of the documentation, the addresses from the examples are bound to localhost.

SOML Management

The Semantic Search schema management consists of several operations over a SOML that is already created in the Semantic Objects. It only reads the provided SOML in order to generate index data for GraphDB Connectors, Elasticsearch index information and to provide a search GraphQL endpoint. The operations that are currently available in the Semantic Search over SOML are:

Bind

The bind operation is used to activate a specific SOML schema to the service. Currently, the Semantic Search operates with only one schema at a time. Upon binding a schema the service will start to create the all required indexes or update the one already existing if changes in their configurations and structure is detected.

To bind a SOML to the service, execute:

curl 'http://localhost:9980/soml/{id}/search' -X PUT

If the binding is successful, this request will produce the following response:

{
    "@context": {
        "@vocab": "http://ontotext.com/ontology/status/",
        "@base": "http://data.ontotext.com/",
        "xsd": "http://www.w3.org/2001/XMLSchema#",
        "hydra": "http://www.w3.org/ns/hydra/core#"
    },
    "@type": "SOML",
    "@id": "/soml/sample-with-search-configs",
    "input": "id:          /soml/sample-with-search-configs\nlabel:       sample1\n\nprefixes:\n  xsd:       http://www.w3.org/2001/XMLSchema#\nspecialPrefixes:\n  base_iri:  http://example.com/data/\n  vocab_iri: http://example.com/ontology#\n  vocab_prefix: ont\n\nobjects:\n\n  Human:\n    search: { index: true }\n    props:\n      mass: {range: int}\n\nrbac:\n  roles:\n    Admin:\n       description: \"Administrator role, can read, write and delete objects and schema\"\n       actions: [\n        \"*/*/*\"\n       ]",
    "bound": true
}

If there is an error or the schema is invalid, the request will return a corresponding status code and details about the error that occurred.

Unbind

The unbind operation is used to deactivate the SOML schema from the service and remove all created indexes with the bind operation. Normally, it should be followed up by a bind operation, otherwise the service will remain without a SOML, which will make it unusable. The unbind operation is invoked with the following request:

curl 'http://localhost:9980/soml/{id}/search' -X DELETE

Currently, the two possible returned responses are either the unbound SOML ID, or, in the event of a problem during execution, an error:

{
    "@context": {
        "@vocab": "http://ontotext.com/ontology/status/",
        "@base": "http://data.ontotext.com/",
        "xsd": "http://www.w3.org/2001/XMLSchema#",
        "hydra": "http://www.w3.org/ns/hydra/core#"
    },
    "@type": "SOML",
    "@id": "/soml/sample-with-search-configs",
    "bound": false
}

If you try to unbind a SOML that is currently not bound to the service, it will respond with an error informing you that the requested schema is not currently bound.

Validate

The service exposes a schema validation functionality that can be used to check the validity of specific SOML before binding it. The endpoint is also useful when you create a SOML schema in the Semantic Objects and want to use it for the search operations as well. The search validation can be included in the unified validation step, or separately during SOML update or create.

Use the following validation request:

curl --location -X POST 'http://localhost:9980/soml/validate' \
    -H 'Content-Type: text/yaml' \
    -H 'X-Request-ID: some-uuid-correlation-id' \
    -H 'Accept: application/json' \
    -T soml-schema.yaml

The validation process consists of two phases:

  1. SOML structure check: will return an error in case of incorrect SOML structure, i.e., when the YAML file contains errors.
  2. SOML-GraphQL generation: checks whether the service can generate a correct GraphQL schema from the given SOML. If an error is detected, it will be returned in the response.

If the validation passes successfully, the inputted SOML will be returned as response.

Info

The info operation checks what and how many indexes have been created for the currently bound SOML schema. The request is as follows:

curl 'http://localhost:9980/soml/info' -X GET

The response will return the ID of the bound SOML and the IDs of all indexes created from that SOML. For example, if we have bound a SOML with two objects, each of which is marked as searchable, the response will look like this:

{
  "@context":{
    "@vocab":"http://ontotext.com/ontology/status/",
    "@base":"http://data.ontotext.com/",
    "xsd":"http://www.w3.org/2001/XMLSchema#",
    "hydra":"http://www.w3.org/ns/hydra/re#"
  },
  "@type":"info",
  "soml" : "/soml/example"
  "indexes" : [
    "otp-object1",
    "otp-object2"
  ]
}

If there is no bound schema, the request will return an error that a bound SOML was not found.

Status

The status operation checks can be used to determine if indexing is currently running. If the indexing has been completed an empty response will be returned. The request is as follows:

curl 'http://localhost:9980/soml/status/all' -X GET

The response will return the ID of the currently performed operation that could be one of the following:

  • CHECK_REQUIREMENTS - a verification is performed that GraphDB and Elasticsearch are accessible and properly configured.
  • PREPARE_ELASTIC - deploy a __typename resolution pipeline to Elasticsearch
  • CREATE_CONNECTOR - create GraphDB Connector and corresponding Elasticsearch index with name found in the id value.
  • DROP_CONNECTOR - drops GraphDB Connector and corresponding Elasticsearch index with name found in the id value.
  • CREATE_ALIASES - register the opt-root alias to all previously created indexes

Here is an example response:

{
  "@context": {
    "@vocab": "http://ontotext.com/ontology/status/",
    "@base": "http://data.ontotext.com/",
    "xsd": "http://www.w3.org/2001/XMLSchema#",
    "hydra": "http://www.w3.org/ns/hydra/core#"
  },
  "@type": "SOMLState",
  "data": [
    {
      "id": "/soml/example",
      "operation": "MANAGE_INDEXES",
      "duration": "188109",
      "subOperation": [
        {
          "id": "otp-example-type",
          "operation": "CREATE_CONNECTOR",
          "thread": "processing-pool-108-thread-1",
          "duration": "188034",
          "count": "1"
        }
      ]
    }
  ]
}

The information returned by the service is displayed in the Workbench interface in the active status area in the upper right.

Federation

The Semantic Search, just like the Semantic Objects, supports GraphQL federation. Federation is a mechanism to combine multiple GraphQL endpoints and schema into a single aggregate endpoint and composite schema.

Under Semantic Search federation, the Semantic Search indexes the same data as it does normally, so you can use it for query parameters, sorting, aggregations. However, its schema is modified so it can be used in federated queries with the Semantic Objects and external schemas.

When using Semantic Search federation, all Elasticsearch-specific fields and the ID are fetched from the Semantic Search. Then the other fields of the given object will be fetched from the other federated schemas.

Consider the following query:

query federatedSearch {
  book_search(query: {
    bool: {
      must: [
        {
          match: {
            title: {
              query: "Sith",
              boost: 2.0
            }
          }
        }
      ]
    }
  })
  {
    max_score                   # Obtained from the Semantic Search
    hits {
      score                     # Obtained from the Semantic Search
      book {
        id                      # Obtained from the Semantic Search
        title                   # External
        pagesCount              # External
        charactersIncluded {
          id                    # External
        }
      }
    }
  }
}

The schema for Book would be:

type Book implements WorkOfArt @extends @key(fields : "id") {
  id: ID @external
}

This query would return all books where the title contains the word “Sith”. Since the field id is always considered the key for a given query, it would be returned for the book object. All other queries which are part of the Semantic Objects and/or external services would be sourced from there.

The id field is always the key for any standard interface or object.

The Root type is not part of the Semantic Objects schema and, as such, its fields are not marked as external, and the type is not marked as extends.

Warning

Using a type or interface named Root in one of your own schemas would break the federation.

Backwards Compatibility

When federation is enabled, it is backwards compatible. This means that you would have the same expressive power of the Semantic Search if you query it directly rather than through the federated endpoint.

There are some known issues to consider, however:

  • inheritance - the Apollo gateway does not support deep inheritance of objects. Interfaces do not implement other interfaces.
  • inheritance on Root - the Root object is not external, as it is declared only in the Semantic Search. At the same time, it does not implement all field arguments of Object as defined within the Semantic Objects. This is why in federation mode it does not implement Object.
  • external directive - in order to allow duplicate fields, such as Character.name, character must be marked as extends and the field as external. While this is important for the Apollo gateway to resolve the schema, the external keyword does not factor in when Semantic Search processing is concerned.
  • _empty field - in the rare case that an object does not use the id field, the _empty directive would be used since an object must have at least one field.