This was done as part of the end-term project for the course CS 848: Data Cleaning using Machine Learning
The project is still under development at the Data Systems Group Lab. My contribution to the project was limited to system optimization, dataset curation and experimentation scripts
Since the project is not yet public, my description of the project will only be limited to idea and not actaul implementation details.
Data Cleaning has much important when data is extracted from unreliable sources and also when data is combined from multiple sources. There has been much work in finding automated, efficient and reliable ways to clean data in relational databases by means of functional dependendancies, conditional functional dependencies and denial constraints. This project brings the idea of denial constraints to RDF data which is composed of interconnected triples and not fixed dimension tuples. By virtue of the structure of data and also the because of the fact that RDF data is schemaless, discovering denial constraints is relatively tough and non-trivial.
The project, in a larger domain, beigns by defining a concrete semantics of denial constraint rules for RDF datasets using a subset of SHACL (SHApe Constriant Language). It then traverses the graph node by node enumerating properties that exist and also its value. Based on the datatype of the property and values, it ennumerates features for a property like if its categorical or numerical. Given all the found features, a space of all possible rules can be defined. Finally, a search algorithm searches this space of all denial constraint rules across all the nodes and different types, to list down the handful of rules that have a significant support from the input data.
My contribution to the project were manifold:
I would like to thank Prof. Ihab Ilyas and Mina Farid for giving me the opportunity to work on their project.