Architecture & Components

Goals

Metadata Studio is a tool that facilitates the process of creating gold standard corpora for text mining tasks. Its main goal is to lower the cost of developing reference standards and evaluate the performance of text mining algorithms based on them with well-established measures (Precision and recall, F-score, Jaccard index).

The main design principles of Metadata Studio ordered by priority are as follows:

  1. Self-service product supporting multiple upgrade paths and installation deployments, such as standalone on-premise deployment, including over a pre-existing GraphDB instance or a cloud-managed service access.
  2. Reducing the cost of quality control for text analysis services.
  3. Enterprise-ready.

Check the impact mapping table to see all epics and how they map to the strategic goals.

System Context

The wider context of Metadata Studio and its ecosystem look as follows:

https://lucid.app/publicSegments/view/47756037-f911-4069-873d-35d1b7acbc39/image.jpeg

Containers

Metadata Studio is composed of four main containers - Metadata Studio UI Client, Metadata Studio Backend, GraphDB, and a component providing OAuth2 implementation for handling the Metadata Studio security.

https://lucid.app/publicSegments/view/7bdcbd0c-402d-4b71-8522-305e7425ae6f/image.jpeg

The Metadata Studio Backend steps on top of the Ontotext Platform to provide a GraphQL interface to the Metadata Studio domain objects from the GraphDB Knowledge Base. The Metadata Studio UI uses the GraphQL endpoint to fetch the data to be visualized and to apply mutations over the existing data based on the user actions.

Data Model

The main objects in the Metadata Studio world are:

  • Users: To access Metadata Studio, you need to be authorized as a Metadata Studio user. Each user has a specific role that determines what actions they can execute in Metadata Studio.
  • Projects: A higher-level abstraction that comprises multiple corpora (for example “Training datasets” and “Evaluation datasets”). Currently, all projects use the same annotation set.
  • Corpora: A named group of documents meant to be maintained and used together (e.g., “2022 Training dataset” or “Organizations Evaluation dataset”).
  • Documents: The documents you would like to annotate.
  • Annotations: What annotations you would like to assign to your documents (e.g., Person annotations, Organization annotations, PersonCEOOrganization relation annotation, etc.). Annotations can be either Document annotations (such that apply to the whole content of the document) or Inline annotations (such that apply to a specific subset of the document).
  • Concepts: In case you would like to perform entity linking to concepts from your knowledge base, these are the specific concept classes that your annotations point to (e.g., Person, Organization).
  • SavedReports: The statistics or evaluation reports that you can create and save for a corpus.
  • Annotation Services: A set of third-party text mining API services that can be used to produce annotations. These can then be measured against the Gold Standard annotations to evaluate the quality of the text mining API service, or can be used as a bootstrap for the annotation process.

The following depict the base RDF model of Metadata Studio:

  • TimeSensitive classes:

    https://lucid.app/publicSegments/view/05945dc5-54a6-4f29-936b-a4678043e4e0/image.jpeg
  • NamedEntity classes:

    https://lucid.app/publicSegments/view/e43a46f5-3029-43ca-8579-0ac658b9d6f7/image.jpeg

Document model

Currently, Metadata Studio supports plain text content of the documents. The topic is discussed in-depth in the Metadata Studio 3.0 document and annotation formats.

Using plain text document content allows for modeling inline annotations using text position offsets. This enables out-of-the-box integration with the GraphDB Text Mining plugin model and with most of the well-established text mining API services - Ontotext’s CES, spaCy, Google NLP, Amazon Comprehend, IBM Watson, and others.

Metadata Studio supports creating custom document models with custom fields by extending the built-in Document class.

Annotations model

Metadata Studio supports two types of annotations over documents - Inline annotations, which are bound to a specific subset of the text defined through start and end positional offsets, and Document annotations, which are applied to the whole document.

Each annotation can be either one of:

  • a simple annotation with one or multiple literals or concepts
  • a composite annotation pointing to other annotations – for example, a relation annotation of type EmployeeOf that points to the person and organization annotations of the employee and the employer. This flexible model provides compatibility with multiple use cases and ability to define rich and complex models out of the box.

By default, Metadata Studio starts without predefined annotation classes. The user must extend the Metadata Studio schema with their domain-specific annotation classes.

Each custom annotation class must extend one of the base annotation classes – either DocumentAnnotation (assigned as document tags) or InlineAnnotation (assigned to a specific substring of the document).

Besides the properties inherited from the base annotation class, each custom annotation can have custom properties. Each property must be defined with its property characteristics. A subset of the property characteristics supported by the Ontotext Platform Semantic Objects as well as a new property characteristic called meta, are also supported in Metadata Studio. See more about them here.