데이터 엔지니어링 정복/Spark
[Spark] org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8.0 GiB: 11.3 GiB.
noohhee
2025. 7. 1. 11:11
728x90
반응형
spark 3.4.1
내부적으로 브로드캐스트 조인시에 브로드캐스트될 테이블이 너무 클 경우 아래 에러가 발생한다.
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/my/current/spark3-client/python/pyspark/sql/session.py", line 1440, in sql return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self) File "/usr/my/current/spark3-client/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1323, in __call__ File "/usr/my/current/spark3-client/python/pyspark/errors/exceptions/captured.py", line 169, in deco return f(*a, **kw) File "/usr/my/current/spark3-client/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o84.sql. : org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8.0 GiB: 11.3 GiB. at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotBroadcastTableOverMaxTableBytesError(QueryExecutionErrors.scala:2366) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:217) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) |
이럴 경우 조인하는 쿼리를 튜닝하는 것이 제일 좋겠지만 어쩔 수 없이 브로드캐스트 조인을 해야된다면 아래 설정값을 설정한다.
spark.sql.~ 관련 설정은 런타임중에 변경가능하다.
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
728x90
반응형