Backend

The Metadata Studio Backend is a Java Spring Boot application deployed in Tomcat and running in a Docker container. It is easily extensible as it is based on the Ontotext Platform Semantic Objects in order to provide a GraphQL endpoint based on a Metadata Studio schema.

An initial base SOML schema is loaded by default in Metadata Studio that can be overridden on demand. Additionally, the schema can be changed through GraphQL mutations.

Interface

The interface of the Metadata Studio Backend matches that of the Semantic Objects. The SOML endpoint is used to manage the schemas, and the GraphQL endpoint is used to access or modify the Metadata Studio objects.

SOML endpoint

The /soml endpoint gives access to the Metadata Studio schema. It is described in detail in the Schema Management section of the Semantic Objects.

It is important that the schema that you send as a request body to the POST and PUT requests satisfies certain restrictions. Check out the Metadata Studio schema section for specifics of the schema.

See a configuration example of a schema request body customized for our Knowledge Net use case, which defines a specific simple document class and a Person inline annotation class that links to Wikidata people concepts.

GraphQL endpoint

The Metadata Studio Backend exposes a /graphql endpoint that handles the requests based on a GraphQL schema which is dynamically adapted to incoming mutations applied over the base SOML schema.

The /graphql endpoint satisfies the standard GraphQL specification.

See sample GraphQL queries for Document annotations and Inline annotations that fetch Metadata Studio objects.

Troubleshooting endpoints

The following endpoints, grouped into a Troubleshooting Controller, provide system and health information:

Health

The /__health endpoint returns an aggregation of all registered health check with a detailed description of each health check’s purpose, impact on the service, severity, and link to troubleshooting instructions. It returns JSON data summarizing the current health status of the application. The status code is 200 unless the /__health endpoint itself is in error.

Verb URL template Mime Type Supported Status Codes
GET /__health : application/json 200
cURL request
curl -X GET --header 'Accept: application/json' 'http://{host}:{port}/{context}/__health'
Healthy response

The response JSON includes a status for each registered health check as well as an aggregated status field:

    {
    "status": "OK",
    "healthChecks": [
        {
            "status": "OK",
            "id": "1200",
            "name": "SPARQL checks",
            "type": "sparql",
            "impact": "SPARQL Endpoint operating normally, writable and populated with data.",
            "troubleshooting": "http://omds-service:8080/__trouble",
            "description": "SPARQL Endpoint checks.",
            "message": "SPARQL Endpoint operating normally, writable and populated with data."
        },
        {
            "status": "OK",
            "id": "1300",
            "name": "SOML checks",
            "type": "soml",
            "impact": "SOML bound, service operating normally.",
            "troubleshooting": "http://omds-service:8080/__trouble",
            "description": "SOML checks.",
            "message": "SOML bound, service operating normally."
        },
        {
            "status": "OK",
            "id": "1350",
            "name": "SOML RBAC checks",
            "type": "soml-rbac",
            "impact": "SOML RBAC schema is created, service operating normally.",
            "troubleshooting": "http://omds-service:8080/__trouble",
            "description": "SOML RBAC checks.",
            "message": "SOML RBAC schema is created, service operating normally."
        },
        {
            "status": "OK",
            "id": "1400",
            "name": "Query service",
            "type": "queryService",
            "impact": "Query service operating normally.",
            "troubleshooting": "http://omds-service:8080/__trouble",
            "description": "Query service checks.",
            "message": "Query service operating normally."
        },
        {
            "status": "OK",
            "id": "1500",
            "name": "Mutations service",
            "type": "mutationService",
            "impact": "Mutation service operating normally.",
            "troubleshooting": "http://omds-service:8080/__trouble",
            "description": "Mutation service checks.",
            "message": "Mutation service operating normally."
        }
    ]
}

The status can have two values:

  • OK - all health checks are passing
  • ERROR - any of the health checks are failing

Good-to-go

The /__gtg endpoint is based on the health check status results. One or more errors imply that the service is not good to go. The endpoint emits a 200 OK response if the application is considered healthy, and 503 Service Unavailable if it is unhealthy.

This endpoint is intended for making routing decisions and for providing DevOps engineers with simple access to service availability.

Verb URL template Mime Type Supported Status Codes
GET /__gtg : application/json 200, 503
cURL request
curl -X GET --header 'Accept: application/json' 'http://{host}:{port}/{context}/__gtg'
Response

The response includes a status for all of the service’s dependencies. If all the checks have passed successfully, you will get the following response with status code 200:

{
  "gtg": "OK"
}

If there is an error in some of the checks returned by the /__health endpoint, the following response with status code 503 is returned:

{
  "gtg": "UNAVAILABLE"
}

About

The /__about endpoint returns JSON data describing the application, providing links to all relevant supporting operational documentation resources. It includes the running version and the build date.

cURL request
curl -X GET --header 'Accept: application/json' 'http://{host}:{port}/{context}/__about'
Response
{
    "buildDate": "2022-08-03T10:13:25.446Z",
    "description": "The Ontotext Platform is a GraphQL interface for interacting with RDF data. It is used to expose your information in a more human-readable fashion.",
    "version": "3.0.0-SNAPSHOT",
    "documentation": "http://platform.ontotext.com/"
}

Troubleshooting documentation

The /__trouble endpoint renders the trouble.md document of the service.

Verb URL template Mime Type Supported Status Codes
GET /__trouble : application/json 200
cURL request
curl -X GET 'http://{host}:{port}/{context}/__trouble'

Schema

The model of the objects with which Metadata Studio is defined in a schema. The Metadata Studio schema is divided into two parts:

  • Base definition of the Metadata Studio classes: This is a fixed description of the Metadata Studio objects that the UI strongly depends on. In addition, it defines base RBAC user roles, which can be extended on demand. The up-to-date schema is defined in the source code and it is highly recommended to leave this part of the schema as is.

Currently, the base schema contains the definitions of the following classes:

  • Users and roles
  • Projects
  • Corpora
  • Documents
  • Annotations
  • Concepts
  • SavedReports
  • Annotation Services
  • Project-specific extension of the SOML classes: Using Ontotext Platform inheritance, it defines the domain-specific non-abstract Document, Annotation, and Concept classes with which the user would like to work in Metadata Studio. This is supported for the following abstract classes:

    • Documents
    • Annotations (both Inline and Document annotations)
    • Concepts

In addition, custom extensions of the schema are applied during the instantiation of Metadata Studio to handle Corpus annotation with text mining API services, Quality evaluation reports, and Corpus Labels and Concepts reports. The restrictions defined in the schema are also translated into SHACL constraints to guarantee the integrity of the data.

Annotate corpus with text mining API services

The Gold Standard creation is a process that facilitates the development of text mining API services for solving specific NLP tasks. Metadata Studio provides the capability to connect to a third-party TA service over HTTP/HTTPS, to process the documents in a corpus through that service, and to enable the inspection of the produced annotations as a result. It enables you to evaluate the quality of the produced annotations or use the generated annotations as a bootstrap for the annotation process.

The connection with the TA service is handled by GraphDB’s Text Mining plugin. The annotation service is registered by a Metadata Studio administrator user through a GraphQL mutation sent to the Metadata Studio/GraphQL endpoint.

The mutation registers the annotation service with:

The AnnotationService object has a label and a serviceId. The label controls the label with which the annotationService will be visualized in the UI under the Annotation services drop-down.

All registered annotation services are listed under the Annotate corpus drop-down.

Registration query

The registration query is a query that instantiates the GraphDB Text Mining plugin. It specifies the URL to the text mining service, the headers that must be sent during annotation requests, and any specific transformations that should be applied over the annotation response.

For more information, see this example registration query.

Annotation query

Upon selection of a specific annotation service for a particular corpus, the Metadata Studio backend splits the documents from the corpus into batches of ten documents. It then sends the documents from each batch to the text mining API service to generate annotations for these documents.

The annotation query defines how the documents should be sent for annotation and how the response should be stored in GraphDB. It is entirely configurable by the user, which makes this process compatible with any third-party services accessible through HTTP, which produce annotations with text position offsets.

For more information, see this example annotation query.

Evaluation reports

An information extraction algorithm development is an heuristic task based on trial and error. The user engineer starts with an assumption and tries to refine it with many small iterations, leaving only positive changes. The Evaluation Reports are a core Metadata Studio functionality measuring an algorithm’s performance (correctness) by comparing the annotations created by a manual annotator (ground truth annotation set) to those produced by the text mining API algorithm (evaluation annotation set). Ultimately, it should present all True Positive (correct), False Positive (spurious), False Negative (missed results), and True Negative (correctly skipped) matches.

To support maximum use cases, the evaluation is organized around a sequence of three independent functions:

Report = Aggregate(Match(Select(Annotation)))

where:

Select

The select function determines what documents and annotations should be included in the report. It also includes any annotations metadata features that are required to compare the annotations from the base and target annotation set.

Document filters

By default, reports are executed against all documents in the corpus. In addition, Metadata Studio supports the following document filters:

  • document type: if my corpus contains different types of documents, I should be able to include any subset of these document types in my report.
  • document field: I should be able to include only documents that have specific value for a specific document field. This allows creating a report for a specific document or from a specific source.

Annotation filters

The annotation filters determine which annotations will be evaluated in the report. The annotations are first filtered based on their source (creator). The user chooses a “ground truth” source, which is considered as the base source, and an “evaluation” source. The annotations are then grouped by their annotation type. Multiple annotation types can be included in a report. The report’s calculation should produce separate evaluation measures per annotation type.

Equation

The equation function determines when two annotations are considered the same. An optional set of fields can be chosen for consideration during this comparison for each annotation type. The application supports two modes of comparison:

  • Strict: requires that all the annotation type, features, and offsets strictly match.
  • Weak: requires that the annotation type and features strictly match, but the offsets of the base annotation and the target annotation can overlap.

Aggregation

During aggregation, the following definitions are satisfied:

Let \(D\) be the set of all documents in a given corpus. Then for a given document \(t ∈ D\), we define the following:

  • Correct/True Positive (\(TP_t\)): the number of matching annotations between the ground truth and the predictions for the document \(t\).
  • Spurious/False Positive (\(FP_t\)): the number of annotations predicted by an annotator that do not match an annotation in the ground truth for the document \(t\).
  • Missing/False Negative (\(FN_t\)): the number of annotations in the ground truth that do not match an annotation predicted by an annotator for the document \(t\).

The Aggregation function combines the ratios between the TP, FP, and FN as a single comparable number. The popular measures for the performance of an information extraction algorithm are:

\(Precision_t =\dfrac{TP_t}{TP_t + FP_t}\)

\(Recall_t= \dfrac{TP_t}{TP_t + FN_t}\)

\(F1.0_t = \dfrac{2 \times Precision_t \times Recall_t}{Precision_t + Recall_t}\)

Further, there are two different ways of aggregating these measures for the whole corpus:

  • Micro (default): when all annotations from all documents are grouped together and later calculated.
  • Macro: when the precision and recall of each document is calculated independently and then the final number is average.

\(Macro\_Precision = \dfrac{\sum_{t \in D} Precision_t}{|D|}\)

\(Macro\_Recall = \dfrac{\sum_{t \in D} Recall_t}{|D|}\)

\(Macro\_F1.0 =\dfrac{2 \times Macro\_Precision * Macro\_Recall}{Macro\_Precision + Macro\_Recall}\)

\(Micro\_Precision = \dfrac{\sum_{t \in D} TP_t}{\sum_{t \in D}(TP_t + FP_t)}\)

\(Micro\_Recall=\dfrac{\sum_{t \in D}TP_t}{\sum_{t \in D}(TP_t + FN_t )}\)

\(Micro\_F1.0 = \dfrac{2 \times Micro\_Precision \times Micro\_Recall}{Micro\_Precision + Micro\_Recall}\)

Labels and Concepts report

It is important to ensure that the evaluation measures are representative of the quality of the text mining API algorithm in general.

The Labels and Concepts report enables you to assess if the corpus is а good representative sample for the variety of the content on which the text mining API will be applied, as well as for the coverage of the different annotation types and concepts from the reference taxonomy. It provides a deep-dive view into the content of the corpus and the generated annotations, and reveals the most frequent labels and concepts linked in the corpus grouped by type and source.

In addition, the report gives an overview of the most frequently co-occurring labels and concepts, which helps you to get to know the content better and to potentially discover trends and links between the reference data concepts.

Moreover, it is useful for detecting deviations and gaps in the corpus coverage and inconsistencies in the annotations.