Semantic Harmonisation of Numeric Data from Open Government Data
Open tabular data published as part of the open government initiatives typically contain a spatial dimension, a temporal dimension and the actual numeric data capturing information such as health indicators, pollution readings, sanitation status etc. "Semantic Harmonisation" of numeric data entails linking numeric data columns with web-accessible semantic entities from an ontology - a machine readable knowledge representation. These semantic entities are embedded in a knowledge graph, allowing integration of information from disparate sources under common semantic definitions across spatial and temporal dimensions. Multiple research efforts have contributed to recovering semantics of numeric columns in tables, however they are either restricted to a single domain or rely on the existence of numeric data as linked data tuples in known ontologies. We present a novel yet simple approach using a supervised machine learning classifier (Random Forests) and semantic web techniques to generate semantics for numeric columns in tabular data. This approach has been tested with encouraging results for over 100 tabular datasets from data.gov.in (Indian Open Government Data Portal) downloaded from multiple domains such as "Health and Family Welfare", "Agriculture", "Environment" etc. We also present a use case for this work, being implemented in collaboration with the ministries of the Government of Karnataka for knowledge aggregation and dissemination of sustainable development data.
Helmholtz Research Programs > PACES II (2014-2020) > TOPIC 4: Research in science-stakeholder interactions > WP 4.3: Providing information - enabling knowledge