[Spark & Hive] Spark로 Hive Managed Table에 Write시 Error | org.apache.hadoop.hive.ql.metadata.HiveException: Load Data failed for {임시파일경로} as the file is not owned by hive and load data is also not ran as hive

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

지구정복

[Spark & Hive] Spark로 Hive Managed Table에 Write시 Error | org.apache.hadoop.hive.ql.metadata.HiveException: Load Data failed for {임시파일경로} as the file is not owned by hive and load data is also not ran as hive 본문

데이터 엔지니어링 정복/Spark

[Spark & Hive] Spark로 Hive Managed Table에 Write시 Error | org.apache.hadoop.hive.ql.metadata.HiveException: Load Data failed for {임시파일경로} as the file is not owned by hive and load data is also not ran as hive

noohhee 2025. 4. 22. 13:46

728x90

Spark3.4.1, Hive3.1.3사용중.

Spark에서 Hive managed table에 쓰기 작업(Insert 등)시 에러가 발생한다.

실행한 쿼리는 다음과 같다.

spark = SparkSession.builder \
    .enableHiveSupport() \
    .getOrCreate()
    

#PySpark에서 Hive managed table create
q = """
CREATE TABLE test.user_info (
    id INT,
    name STRING,
    age INT
)
STORED AS PARQUET
"""
 
spark.sql(q).show()


#Insert쿼리 실행
q = """
INSERT INTO test.user_info2 (id, name, age)
VALUES
    (1, 'Alice', 25),
    (2, 'Bob', 30),
    (3, 'Charlie', 28)
"""
 
spark.sql(q).show()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/bigdata/current/spark3-client/python/pyspark/sql/session.py", line 1440, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
  File "/bigdata/current/spark3-client/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/bigdata/current/spark3-client/python/pyspark/errors/exceptions/captured.py", line 175, in deco
    raise converted from None
pyspark.errors.exceptions.captured.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Load Data failed for hdfs://nameservice1/user/hive/warehouse/test.db/user_info/.hive-staging_hive_2025-04-21_16-13-48_416_9103076023287701259-1/-ext-10000/part-00000-a48115d0-ae5b-49bb-b2bb-263bd05f107c-c000 as the file is not owned by hive and load data is also not ran as hive

에러 내용은 해당 Hive manged table에 쓰기작업시 임시로 저장하는 데이터 경로의 소유주가 'hive'계정이 아니며 load data가 'hive'계정으로 실행되지 않았다는 에러이다.

이 에러가 발생하는 원인은 다음과 같다.

1. hive-site.xml에 hive.server2.enable.doAs=true이기 때문에 Spark에서 .enableHiveSupport()시 Spark를 실행한 계정으로 Hive Session이 실행된다.

만약 hive.server2.enable.doAs=false 일 경우 모든 Hive Session은 Hive계정으로 실행되기 때문에 위와 같은 에러는 발생하지 않을 것이다.

하지만 현재 내가 운영하는 빅데이터 클러스터의 경우 hive.server2.enable.doAs=true 여야만 한다.

(OpenLDAP연동되었으며 Ranger로 각 사용자 권한관리중이기 때문이다.)

2. hive.server2.enable.doAs=true인 경우 hive-site.xml의 hive.load.data.owner 설정값을 확인한다.

보통 이 설정값이 hive.load.data.owner=hive 인 경우이다.

따라서 단순히 임시방편으로 이 문제를 해결하려면 위 hive.load.data.owner 값을 Hive Session을 생성하는 계정으로 잠깐 바꿔주거나

Spark Session생성시 아래 Spark Config를 설정하면 된다.

pyspark3 --conf spark.hive.load.data.owner={실행계정}

위 내용과 같은 Hive 공식 지라는 다음과 같다.

https://issues.apache.org/jira/browse/HIVE-25381

또한 아래 Hive 공식 지라에선 3.1.0 에서 해결됐다는 것 같은데 확인이 필요하다...

https://issues.apache.org/jira/browse/HIVE-19928

https://issues.apache.org/jira/browse/HIVE-20066

728x90

저작자표시 동일조건 (새창열림)

'데이터 엔지니어링 정복 > Spark' 카테고리의 다른 글

[Spark] Spark Streaming (DStreams) 기본 개념 (1)	2025.04.23
[Spark] 자주 사용하는 PySpark 코드들을 정리하자! (0)	2025.04.22
[Spark] Dynamic Allocation 사용 (0)	2025.04.18
[Spark] Iceberg 테이블 Drop시 Error \| [CANNOT_RECOGNIZE_HIVE_TYPE] Cannot recognize hive type string: "TIMESTAMP WITH LOCAL TIME ZONE" (0)	2025.04.18
[Spark] Dynamic partition strict mode requires at least one static partition column Error (2)	2025.04.15

'데이터 엔지니어링 정복/Spark' Related Articles

Comments

지구정복

[Spark & Hive] Spark로 Hive Managed Table에 Write시 Error | org.apache.hadoop.hive.ql.metadata.HiveException: Load Data failed for {임시파일경로} as the file is not owned by hive and load data is also not ran as hive 본문

[Spark & Hive] Spark로 Hive Managed Table에 Write시 Error | org.apache.hadoop.hive.ql.metadata.HiveException: Load Data failed for {임시파일경로} as the file is not owned by hive and load data is also not ran as hive

'데이터 엔지니어링 정복 > Spark' 카테고리의 다른 글

티스토리툴바