Zero-Cost, Arrow-Enabled Data Interface for Apache Spark

2021 IEEE International Conference on Big Data (Big Data)(2021)

引用 3|浏览34
暂无评分
摘要
Distributed data processing ecosystems are widespread and their components are highly specialized, such that efficient interoperability is urgent. Recently, Apache Arrow was chosen by the community to serve as a format mediator, providing efficient in-memory data representation. Arrow enables efficient datamovement between data processing and storage engines, significantly improving interoperability and overall performance. In this work, we design a new zero-cost data interoperability layer between Apache Spark and Arrow-based data sources through the Arrow Dataset API. Our novel data interface helps separate the computation (Spark) and data (Arrow) layers. This enables practitioners to seamlessly use Spark to access data from all Arrow Dataset API-enabled data sources and frameworks. To benefit our community, we open-source our work and show that consuming data through Apache Arrow is zero-cost: our novel data interface is either on-par or more performant than native Spark.
更多
查看译文
关键词
Arrow-enabled data interface,Apache Spark,distributed data processing ecosystems,efficient interoperability,Apache Arrow,in-memory data representation,efficient data movement,storage engines,zero-cost data interoperability layer,Arrow-based data sources,Arrow Dataset API,novel data interface,computation,access data,native Spark
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要