'데이터 엔지니어링 정복/Spark' 카테고리의 글 목록

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

728x90

목록데이터 엔지니어링 정복/Spark (8)

지구정복

[Spark] Spark Streaming (DStreams) 기본 개념

공식 문서를 번역한 글입니다. 버전은 Spark 3.5.5 기준https://spark.apache.org/docs/latest/streaming-programming-guide.html 1. Note이제는 Spark Streaming은 레거시이고 사용되지 않는다.신 Spark Structured Streaming가 사용된다.프로그래밍 가이드는 아래를 참고한다.https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html 2. Overview다양한 소스(ex: Kafka, Kinesis, TCP sockets, ETC)에서 오는 실시간성 데이터들을 복잡한 알고리즘이나 여러가지 Functions들(ex: map, reduc..

데이터 엔지니어링 정복/Spark 2025. 4. 23. 16:56

[Spark] 자주 사용하는 PySpark 코드들을 정리하자!

pyspark3 --master yarn --deploy-mode client \ --executor-memory 20g \ --executor-cores 5 --num-executors 30 \ --conf spark.pyspark.python=/usr/bin/python3.7 \ --conf spark.pyspark.driver.python=/usr/bin/python3.7 \ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/bin/python3.7 \ --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=/usr/bin/python3.7 \ --conf spark.driver.maxResultSize=3g --dr..

데이터 엔지니어링 정복/Spark 2025. 4. 22. 14:03

[Spark & Hive] Spark로 Hive Managed Table에 Write시 Error | org.apache.hadoop.hive.ql.metadata.HiveException: Load Data failed for {임시파일경로} as the file is not owned by hive and load data is also not ran as hive

Spark3.4.1, Hive3.1.3사용중. Spark에서 Hive managed table에 쓰기 작업(Insert 등)시 에러가 발생한다.실행한 쿼리는 다음과 같다.spark = SparkSession.builder \ .enableHiveSupport() \ .getOrCreate() #PySpark에서 Hive managed table createq = """CREATE TABLE test.user_info ( id INT, name STRING, age INT)STORED AS PARQUET""" spark.sql(q).show()#Insert쿼리 실행q = """INSERT INTO test.user_info2 (id, name, age)VALUES (1..

데이터 엔지니어링 정복/Spark 2025. 4. 22. 13:46

[Spark] Dynamic Allocation 사용

Spark 3.4.1 사용중이고 리소스매니저는 Yarn을 사용중이다. Dynamic allocation은 Executor들에게만 적용된다.아래 설정들을 spark-defaults.conf에 해줘야 한다.만약 spark-thrift-sparkconf.conf에 해주면 Thrift server통해서 실행되는 Spark Job에만 적용된다. 공식문서 (3.4.1)https://archive.apache.org/dist/spark/docs/3.4.1/configuration.html#dynamic-allocationhttps://archive.apache.org/dist/spark/docs/3.4.1/job-scheduling.html Dynamic Allocation을 설정하는 방법은 두 가지가 있다.Th..

데이터 엔지니어링 정복/Spark 2025. 4. 18. 16:43

Prev 1 2 Next

목록데이터 엔지니어링 정복/Spark (8)

지구정복

티스토리툴바