Automatic Generation of Temporal Data Provenance From Biodiversity Information Systems
Although the significance of data provenance has been recognized in a variety of sectors, there is currently no standardized technique or approach for gathering data provenance. The present automated technique mostly employs workflow-based strategies. Unfortunately, the majority of current information systems do not embrace the strategy, particularly biodiversity information systems in which data is acquired by a variety of persons using a wide range of equipment, tools, and protocols.
This article presents an automated technique for producing temporal data provenance that is independent of biodiversity information systems. The approach is dependent on the changes in contextual information of data items. By mapping the modifications to a schema, a standardized representation of data provenance may be created. Consequently, temporal information may be automatically inferred.
The research methodology consists of three main activities: database event detection, event-schema mapping, and temporal information inference. First, a list of events will be detected from databases. After that, the detected events will be mapped to an ontology, so a common representation of data provenance will be obtained. Based on the derived data provenance, rule-based reasoning will be automatically used to infer temporal information. Consequently, a temporal provenance will be produced.
This paper provides a new method for generating data provenance automatically without interfering with the existing biodiversity information system. In addition to this, it does not mandate that any information system adheres to any particular form. Ontology and the rule-based system as the core components of the solution have been confirmed to be highly valuable in biodiversity science.
Detaching the solution from any biodiversity information system provides scalability in the implementation. Based on the evaluation of a typical biodiversity information system for species traits of plants, a high number of temporal information can be generated to the highest degree possible. Using rules to encode different types of knowledge provides high flexibility to generate temporal information, enabling different temporal-based analyses and reasoning.
The strategy is based on the contextual information of data items, yet most information systems simply save the most recent ones. As a result, in order for the solution to function properly, database snapshots must be stored on a frequent basis. Furthermore, a more practical technique for recording changes in contextual information would be preferable.
The capability to uniformly represent events using a schema has paved the way for automatic inference of temporal information. Therefore, a richer representation of temporal information should be investigated further. Also, this work demonstrates that rule-based inference provides flexibility to encode different types of knowledge from experts. Consequently, a variety of temporal-based data analyses and reasoning can be performed. Therefore, it will be better to investigate multiple domain-oriented knowledge using the solution.
Using a typical information system to store and manage biodiversity data has not prohibited us from generating data provenance. Since there is no restriction on the type of information system, our solution has a high potential to be widely adopted.
The data analysis of this work was limited to species traits data. However, there are other types of biodiversity data, including genetic composition, species population, and community composition. In the future, this work will be expanded to cover all those types of biodiversity data. The ultimate goal is to have a standard methodology or strategy for collecting provenance from any biodiversity data regardless of how the data was stored or managed.