DATA INTEGRATION WITH PRIVACY, CONSISTENCY AND QUALITY CHECKS FOR BETTER INSIGHT
by Kristina Linke (comments: 0)
D4DAIRY DATA INTEGRATES FARM-RELATED DATA GATHERED BY MULTIPLE PARTIES WHILE RESPECTING PRIVACY AND DATA QUALITY CONTRAINTS
Detailed and valuable data related to farm daily business, efficiency and animal health reside in disconnected data silos making it difficult to design accurate predictive models and integrate those in high-quality decision tools. The solution developed within D4Dairy connects and integrates
all project-related subsystems and data sources into a single database called D4Dairy Data. It now includes data streams from RDV, high-resolution sensor data coming from industrial partners, automatic milking systems, and housing climate. D4Dairy Data runs in a data center for further use
beyond D4Dairy project duration. The system implements data anonymization, data fusion and statistical data quality checks. Although the core system is built from standard components, it integrates novel algorithms that make the system unique and tailored for dairy-specific data.
Location privacy. Data sharing is crucial for compiling highquality data sets to improve predictive models. However,
there are also legitimate privacy concerns. Public and shared data are usually pseudonymized, meaning that all unique identifiers, such as identities of farms and their locations, names and identification numbers are removed. Past research, yet in a different context, provides evidence that solely removing these identifiers is not enough: the data itself contains information about the data provider. A linkage attack combines two data sets and links similar or identical records with each other. Location privacy is particularly important when multiple parties or farms share their context or behavioral data. Even if no location information is explicitly shared, it is still possible to infer the location from the shared data with high certainty. This can be realized by combining detailed sensor data with publicly available local weather. We showed that the data coming from cow activity sensors can be used to localize a cow within a country using linkage attack. We also implemented a data protection mechanism to prevent such linkage attacks on shared sensor data by relaxing weather dependency in the data with machine learning.
Data quality. Towards data validation and data quality assurance, we designed a statistical sensor data processing framework, which leverages co-dependency between data quality and model robustness to detect performance issues of data-driven predictive models in the field. We showed
that distribution shifts in the input data impact the quality of the model and presented an indicator capable of detecting such shifts in the wild. The framework allows improving the quality of cow lameness predictions on the D4Dairy field data by up to 62%.
Data consistency and harmonization. D4Dairy Data is used to check data consistency, harmonize, and resample streams of raw data coming from industrial partners. Within the project, the system was used to provide researchers a consistent view of the data generated by disconnected systems, in a preferred temporal resolution and with matched timestamps.
Impact and effects. While being built from standard software components, D4Dairy Data includes several unique algorithms, such as data protection and data quality assurance mechanisms, to facilitate data usage and create valuable insights for the benefit of the farmers and the dairy industry. This goes beyond pure data exchange solutions and makes D4Dairy Data stand out when compared to other data integration systems that leave data quality challenge to the stakeholders or system’s end users.
Contact: Prof. Olga Saukh, Institut für Technische Informatik, TU Graz, firstname.lastname@example.org