[Python] 대용량 데이터 처리 및 분석을 위한 PyArrow (Apache Arrow)

PyArrow란?

인메모리 분석을 위한 개발 플랫폼인 Apache Arrow의 Python용 라이브러리이다. Apache Arrow는 빅 데이터 시스템이 데이터를 빠르게 처리하고 이동할 수 있도록 하는 일련의 기술이 포함되어 있다. 최신 하드웨어에서 효율적인 분석 작업을 위해 구성된 플랫 및 계층적 데이터에 대해 표준화된 언어 독립적 열 메모리 형식을 지원한다.

호환성

PyArrow는 현재 Python 3.7, 3.8, 3.9 및 3.10과 호환됩니다. Windows, macOS 및 다양한 Linux 배포판(Ubuntu 16.04, Ubuntu 18.04 포함)에서 지원하고, 64비트 시스템을 권장합니다.

설치

pip install pyarrow

Windows에서 핍 휠 가져오기 문제가 발생하면 Visual Studio 2015용 Visual C++ 재배포 가능 패키지를 설치해야 할 수 있습니다 .

파일 읽기

URI로 파일 시스템을 유추하거나, filesystem 명시적으로 선언하여 pyarrow.parquet.read_table() 함수로 파일을 읽을 수 있다.

from pyarrow import fs
import pyarrow.parquet as pq

# reading file.
# using a URI -> filesystem is inferred
pq.read_table("s3://my-bucket/data.parquet")
# using a path and filesystem
s3 = fs.S3FileSystem(..)
pq.read_table("my-bucket/data.parquet", filesystem=s3)

파일쓰기

FileSystem 인터페이스를 사용하면 파일을 읽거나 쓰기를 위해 파일을 열 수 있으면 파일류 객체처럼 코드를 구현할 수 있습니다.

from pyarrow import fs
import pyarrow as pa

local = fs.LocalFileSystem()

with local.open_output_stream("test.arrow") as file:
   with pa.RecordBatchFileWriter(file, table.schema) as writer:
      writer.write_table(table)

NumPy 와의 호환

# NumPy to Arrow
import numpy as np
import pyarrow as pa
data = np.arange(10, dtype='int16')
arr1 = pa.array(data)
arr1

# Arrow to NumPy
arr2 = pa.array([4, 5, 6], type=pa.int32())
view = arr2.to_numpy()
view

Pandas의 한계를 극복한 5가지 라이브러리: Dask, Vaex, Modin, Cudf, Polars

2023.09.21 - Pandas의 한계를 극복한 5가지 라이브러리: Dask, Vaex, Modin, Cudf, Polars

dataframe을 apply 연산을 멀티프로세스로 처리하는 방법을 소개합니다.

dataframe에서 복잡하고, 오래 걸리는 연산이 필요한 경우 유용한 방법이 될 수 있습니다.

[python] dataframe apply() multiprocessing

<관련 링크>

PyArrow 공식 문서 : https://arrow.apache.org/docs/python/getstarted.html

Apache Arrow Github : https://github.com/apache/arrow

'Data Science > Python' 카테고리의 다른 글

[Python] Mysql Connection Pooling 만들기 (0)	2022.02.16
[Python] Mysql에서 멀티 행을 Update & Insert 하는 코드 (0)	2022.02.16
[PYTHON] 유용한 정규식 모음 (0)	2021.09.16
[ETL실전] Python으로 CSV 파일을 DB(Mysql)에 임포트하기 #1 (0)	2021.08.03
[스크랩] opencv_tutorials (0)	2021.07.21

[Python] 대용량 데이터 처리 및 분석을 위한 PyArrow (Apache Arrow)

PyArrow란?

호환성

설치

파일 읽기

파일쓰기

PyArrow Cookbook

PyArrow Document

API Reference

'Data Science > Python' 카테고리의 다른 글

전체 카테고리

블로그 인기글

태그

전체 방문자

티스토리툴바

PyArrow란?

호환성

설치

파일 읽기

파일쓰기

PyArrow Cookbook

PyArrow Document

API Reference

'Data Science > Python' 카테고리의 다른 글

전체 카테고리

최근 글

최근댓글

블로그 인기글

태그

전체 방문자

티스토리툴바