Enable Huawei to implement different functionalities and integration support with presto and hive in CarbonData
Partner : Huawei Technologies.
Website : https://carbondata.apache.org/
Technologies Used : Scala, Java, Apache-Spark, Spark-Streaming, Presto, Hive, Hadoop, AWS S3
Domain : Data Storage and processing on Big Data
Apache CarbonData is an indexed columnar data format for fast analytics on big data platform, e.g. Apache Hadoop, Apache Spark, etc. Knoldus enable Huawei to work in collaboration with them to implement different functionalities or integration support with different technologies including presto and hive in CarbonData. The below diagram illustrates carbondata file structure:
As a part of this project there were various challenging requirements from Huawei which included exploration of various domains, including backend, frontend as well as continuous integration while ensuring backward compatibility of the older versions when the newer versions were getting rolled out very frequently. Knoldus worked along with Huawei Team to help Carbon data in becoming a Apache licensed project from an incubating project.
Knoldus worked closely with the huawei team and helped in building the crucial functionalities, some of which are listed below:
- Development of Dictionary Generation Tool for CarbonData.
- Pre-aggregate functionality to improve performance of aggregation queries.
- CarbonData integration with Presto, Hive, Flink and S3 technologies.
- Setting up of continuous Integration via Jenkins.
- Creation of Performance Testing tool to do benchmarking.
- Achieving zero bugs with Automation Testing.
- Development of Apache Carbondata website and its maintenance.
- Automation of documentation from Git to website.
- Development and enhancement in core packages of CarbonData.
- Benchmarking Carbon data against available file formats like Parquet and ORC, against frameworks like Spark, Presto and Impala and against different storage systems like Hadoop, S3 and Kudu.
Knoldus provides a file-format which is faster and efficient in processing and querying on big data. The CarbonData clients are now able to speed up their system by utilising the features of CarbonData.
One of the CarbonData client - Hulu (one of the first users of CarbonData. IIt is a North American video industry Internet Company) With the CarbonData platform they able to filter out 2% to 5% of data for aggregation, the filters are majority based on 5 to 10 columns and the result set has 100+ columns. The finger grained index, columnar groups provided in CarbonData render speedier results than Parquet/ ORC in this Use Case.
Get In Touch
If you are looking to build a Reactive Product with Scala, Akka, Play Framework or a Big Data Solution leveraging Spark , Knoldus is here to help. We are a proven, experienced Certified Lightbend Partner, available for partnering to make your product a reality. Get in touch with us here, Follow us @Knolspeak or just send us an email on email@example.com