EMaas - Entity Matching-as-a-Service

EMaas - Entity Matching-as-a-Service

Entity Matching-as-a-Service (EMaaS) targets the problem of identifying records that refer to the same entity in the real world. This task is known to be challenging due to its pair-wise comparison nature, especially when the datasets involved in the matching process have a high volume (Big Data). Since the EM task has critical importance for data cleaning and integration, e.g., to find duplicate points of interest in different databases, the importance of the efforts focused on the challenges and possible solutions of how EM can benefit from modern parallel computing programming models, such as Apache Spark, has grown considerably nowadays.

Contact Person: 

User Categories: 
Infra managers
Network managers
Chief Technology Officer (CTO)
Chief Financial Officer (CFO)

Community / market sectors: 
EMaaS targets every scientific/industrial sector interested in a service to process large-scale geo/spatial data matching.

Specific needs / Value Proposition: 
- Automatic parallelization of sequential code without the need to adopt any specific Application Programming Interface (API). Support to Java, C/C++ and Python. The same code can be executed transparently with regards to the underlying infrastructure. - Automatic scaling and elasticity features so the number of available resources can be adapted to the actual execution needs. - Interoperability with different cloud providers to run computational loads on multi cloud environments without the need of code adaptation. - Availability of tools that ease: the COMPSs applications implementation by means of an Integrated Development Environment (IDE); the application deployment in distributed infrastructures by means of the Programming Model Enactment Service (PMES); and the monitoring of executions by means of the Monitoring and Tracing tools.

Specific benefits / Value Proposition: 
The EMaaS service will attend the requests from applications/systems interested in submitting Entity Matching tasks to the cluster environment. To this end, the service will establish a connection to the Hadoop Eco-system to perform the necessary operations such as submitting artifacts (e.g. datasets) to the HDFS or starting the execution of Spark jobs.

Owner (incl. Joint ownership): 
Barcelona Supercomputing Center

Owner type: Academia/Research

Delivery Model: 
Open source

EU EUBrasilCloudFORUM is funded by the European Commission under the Cooperation Programme, Horizon 2020 grant agreement No 689495. Este projeto é resultante da 3a Chamada Coordenada BR-UE em Tecnologias da Informação e Comunicação (TIC), anunciada pelo Ministério da Ciência, Tecnologia e Inovação (MCTI), no âmbito to acordo de subvenção Número 689495. Privacy Policy | Disclaimer Terms of use