[PySpark3] UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not wor

Notice

Recent Posts

Recent Comments

Link

« 2026/01 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

지구정복

[PySpark3] UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not wor 본문

데이터 엔지니어링 정복/Spark

[PySpark3] UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not wor

noohhee 2025. 5. 26. 18:41

728x90

PyArrow는 내부적으로 timestamp를 엄격하게 처리한다.

근데 Spark와 Pandas는 보통 timezone정보가 없는 naive timestamp를 쓴다.

여기서 naive timestamp는 "2025-05-26 14:00:00"이런 값을 의미한다.

pyarrow에서 사용하는 timestamp는 시간대 정보가 포함된 값을 의미한다.

2025-05-26 09:37:07.223083+00:00

그래서 pyarrow와 pandas-on-spark를 같이 사용하면 타임스탬프를 처리하는 과정에서 에러가 발생할 확률이 높다.

따라서 미리 Pyarrow에서 pandas-on-spark 가 사용하는 타임스탬프로 맞춰주기 위해 환경변수를 드라이버와 익스큐터들에게 정의해줘야 한다.

근데 이 설정은 스파크 옵션으로 미리 설정이 안되고 무조건 서버단에서 export로 정의하거나 python 실행후 os.environ으로 설정해줘야 한다.

따라서 아래 파이썬 코드를 실행한다.

import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

추가로 livy를 사용중이라면 livy-env.sh에 환경변수를 추가해줘야 한다.

그리고 추가로 pandas <-> spark dataframe을 변환하는 작업도 있다면 아래 설정도 추가해준다.

spark.sql.execution.arrow.pyspark.enabled=true

--conf spark.sql.execution.arrow.pyspark.enabled=true

동작 Arrow 미사용 Arrow 사용
toPandas() 20초 3초
createDataFrame() 15초 2초

추가적으로 만약 주피터 커널을 직접 사용중이라면 아래처럼 주피터 커널파일인 kernel.json에 추가해준다.

# cat kernel.json
{
"argv": [
  "/mypython/py38-16/bin/python3",
  "-m",
  "ipykernel_launcher",
  "-f",
  "{connection_file}"
],
"display_name": "Public_Spark3(Python3.8)",
"language": "python",
"env": {
  "PYSPARK_PYTHON": " /mypython/py38-16/bin/python3 ",
  "SPARK_HOME": "/usr/spark3-client",
  "HADOOP_CONF_DIR":"/etc/hadoop/conf",
  "PYTHONPATH": "/usr/spark3-client/python/lib/py4j-0.10.9.7-src.zip:/usr/spark3-client/python/:",
  "PYTHONSTARTUP": "/usr/spark3-client/python/pyspark/shell.py",
  "PYARROW_IGNORE_TIMEZONE": "1",
  "PYSPARK_SUBMIT_ARGS": "--master yarn --driver-memory 4g pyspark-shell"
}
}

728x90

저작자표시 동일조건 (새창열림)

'데이터 엔지니어링 정복 > Spark' 카테고리의 다른 글

[Spark] Structured Streaming 정리 (0)	2025.10.28
[Spark] org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8.0 GiB: 11.3 GiB. (1)	2025.07.01
[Spark] Spark Streaming (DStreams) 기본 개념 (1)	2025.04.23
[Spark] 자주 사용하는 PySpark 코드들을 정리하자! (0)	2025.04.22
[Spark & Hive] Spark로 Hive Managed Table에 Write시 Error \| org.apache.hadoop.hive.ql.metadata.HiveException: Load Data failed for {임시파일경로} as the file is not owned by hive and load data is also not ran as hive (1)	2025.04.22

'데이터 엔지니어링 정복/Spark' Related Articles

Comments

지구정복

[PySpark3] UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not wor 본문

[PySpark3] UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not wor

'데이터 엔지니어링 정복 > Spark' 카테고리의 다른 글

티스토리툴바