Built a platform to find and analyze content across traditional data silos to derive new value-driven insights.
Partner : Elsevier
Website : https://www.elsevier.com/
Technologies Used : Scala, Apache-Spark, Quetzal, Spark-ML, GridGain(Apache Ignite), Postgresql, Akka, Akka-http, Cassandra, SOLR, Zeppelin
Domain : Semantic Web
Elsevier is a world-leading provider of information solutions that enhance the performance of science, health, and technology professionals, empowering them to make better decisions, and deliver better care. They want to make analysis easier for everyone, enabling them to manage their work more efficiently and spend more time making breakthroughs.
Elsevier provides products and services which help researchers, governments and universities, and healthcare professionals to make new discoveries, evaluate and improve their research strategies and providing insight for physicians to find the right clinical answers. Their goal is to expand the boundaries of knowledge for the benefit of humanity.
Elsevier publishes 430,000 peer-reviewed research articles annually
Elsevier's major segment of customers are drug companies across the world and Drug discovery is a complex process. Cost to develop one new drug is $2.6 Billion and approval rate for drugs entering clinical development is less than 12%. The attrition rate for drug candidates, that is the number of candidates you start with for each successful launch can be in the order of 10,000:1.
Scientists rely on knowledgebases related to pharmacology, medicine, chemistry and biology as well as experimental data like clinical trials, experimental publications, test performed on similar candidates etc. Some of these are purchased while some are developed over a period within the company. Scientists spend an incredible amount of their expensive time on searching through these knowledgebases. Take for example a simple question "What are the compounds that are similar in structure as benzene, has boiling point more than 40 Degrees F and has no side effects on people with lymphoma". The question require joining information from chemistry, medicine and pharmacology. By "Joining", we mean, understanding the question as if its human and bringing information from different domains and join them to provide a definitive answer.
Customer envisioned "A platform" that can join knowledge from different domains, make it searchable and the search engine react as if its a human, by understanding the question, parsing it into a machine readable query and crawl through the databases and bring results along with the accuracy at which the answer is likely to answer customer's questions. That platform is ELSSIE, that's what Knoldus built it for Elseiver.
ELSSIE is a platform that connects information from multiple sources stored in the format of knowledge graph and maintained by Elsevier's Subject Matter Experts (SMEs). Elssie enables users to find and analyze content across traditional data silos to derive new value driven insights.
The ultimate goal of ELSSIE is to make complex information at the fingertips of the scientists so that they can carry on drug invention at a rapid pace.
For accomplishing this, the solution needs to be able to ingest multiple structured and unstructured content, store it as queriable structured data, semantically understand and generate relationships by recognizing entities and concepts, interpret stored data and offer graph query capabilities and provide an API to integrate with external applications and finally make it easy for scientists to search for information.
ELSSIE as a final solution included the following components:
- Ingestion Layer provides ability to ingest structured sources like DBpedia and unstructured sources like scientific publications. Most difficult part of this layer is to achieve ability to construct structured knowledge from unstructured data using NLP. For example, a scientific journal article might refer to "Oxygen", which ELSSIE should recognize as a chemical element and tag appropriately. This is built using integration of Apache Spark with Stanford NLP libraries.
- Data Lake Layer consists of storing the structured knowledge generated from ingest pipelines in a central repository built using apache cassandra. ELSSIE knowledge consists of large number of 'Triples' that make up a large graph. These triples are staged in data lake and loaded into in-memory database (Grid gain) so that the performance is within the stringent boundaries. Entitlements is the sublayer within data lake that controls what part of knowledge should be accessible to whom. This access metadata information itself is stored as triples, such that the query engines can interpret and provide information.
- Query Layer provides a way to ask graph questions (SPARQL Queries) and retrieve results. Lot of innovation and research is invested in how to parse SPARQL query and bring results from a key value store. Knoldus built the parser that converts the graph queries into equivalent KV store retrievals from in-memory database. This layer leveraged a paper published by IBM and extended the concept. Knoldus proven performance by using LUBM (Lehigh university benchmark) queries.
- Search Layer provides the API to perform searches on data lake. This connects the google like free form search with the definitive query capability provided by Query Layer, thus enriching and enhancing the use of the product. Search is fed with "Clusters" or "Topics" of knowledge generated by ML pipelines, and make the facets in search much more meaningful. This allowed scientists to search for "Sugar" and the results will be shown in the context of "Diabetes" or "Cell energy" or "Recreational Drinks".
- Machine Learning (ML) Layer provided a way for curating the content, verifying the output generated by humans, measuring the algorithm's accuracy, experimenting new models, testing and fixing issues. ML is the primary driver for two purposes. First, for ingesting the content generated by various sources. The sources for ELSSIE are diverse, from nicely structured content like DBpedia all the way to scanned pdf documents. Second function is to make search more intelligent. Extensive NLP is deployed to understand the incoming and ever growing content. Pipelines implemented several clustering (Latent Direchlet ) and classification (Multi class classification) algorithms. ML Layer and Ingestion layers are closely tied together.
In summary, ELSSIE project used Apache Spark, Apache Hadoop, Apache Cassandra, Apache Kafka, Apache Solr, Apache grid gain, All built on AWS. Several innovations like dynamically scaled Apache Spark and Hadoop clusters, extending QUERTZL using Antlr parsers, using LDA along with NLP to find entities in text and their contextual meaning instead of hard literary meaning are achieved.
The technology stack and architecture met the SLAs which were required for the platform. ELSSIE save a lots of manual effort and time while fetching the relevant data from the research papers.
Get In Touch
If you are looking to build a similar platform or product with modern data architecture with data lake, Knoldus is here to help. We are proven, experienced certified Databricks, Datastax and Lightbend Partner, available for partnering to make your product a reality. Get in touch with us or just send us an email on email@example.com