Spark Core

RDD, Lazy Execution, Distributed Data Processing, Transformation

RDD (Resilient Distributed Dataset)

RDD ๋Š” ๋ถ„์‚ฐํ™˜๊ฒฝ์—์„œ ์›Œ์ปค๋…ธ๋“œ์— ํŒŒํ‹ฐ์…˜๋˜์–ด ์žˆ๋Š” ๋ถ„์‚ฐ๊ฐ์ฒด์ด๋‹ค. ๋ณ€๊ฒฝ ๋ถˆ๊ฐ€๋Šฅํ•˜๊ณ , ์—ฐ์‚ฐ ์ˆœ์„œ๋ฅผ ๊ธฐ์–ตํ•˜๊ณ  ์žˆ์–ด์„œ ํŒŒํ‹ฐ์…˜์ด ๊นจ์ง€๋ฉด ๋‹ค์‹œ ๋ณต๊ตฌํ•  ์ˆ˜ ์žˆ๋Š” ์žฅ์•  ๋ณต๊ตฌ ๊ธฐ๋Šฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

RDD ๋Š” ๋‚ด๊ณ ์žฅ์„ฑ(fault-tolerant) ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ์— ์ค‘๊ฐ„์‚ฐ์ถœ๋ฌผ์„ ์ €์žฅํ•  ์ˆ˜ ์žˆ๊ณ , RDD ์—์„œ ์ œ๊ณตํ•˜๋Š” API ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์กฐ์ž‘ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ณดํ†ต Low-level transformation API, Action API ์‚ฌ์šฉํ•ด์•ผ ํ•˜๊ฑฐ๋‚˜, ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ(unstructured data)๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ RDD๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ •ํ˜• ๋ฐ์ดํ„ฐ(structured data)๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ๋Š” Dataset์„ ์‚ฌ์šฉํ•œ๋‹ค.

Lazy Execution

Lazy Execution์€ RDD ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋งŒ๋“ค์–ด์ง€๋Š”์ง€ Lineage ์ •๋ณด๋ฅผ ๋จผ์ € ๋งŒ๋“ค๊ณ  Job ์„ ์‹คํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ž์›์„ ๊ณ ๋ คํ•ด์„œ ์ตœ์ ์˜ Job์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

Driver ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๊ณ  Spark ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰ํ•˜๋ฉด, Driver ๋Š” ์ฝ”๋“œ๋ฅผ ๋ถ„์„ํ•˜์—ฌ Transformation ์ฝ”๋“œ๋กœ Job ์‹คํ–‰๊ณ„ํš(Execution Plan, Lineage ์ •๋ณด)์„ ์„ธ์šด ํ›„, Action ์ฝ”๋“œ๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ Job ์„ ์‹คํ–‰ํ•œ๋‹ค.

Distributed Data Processing

Spark์—์„œ ํŒŒ์ผ์‹œ์Šคํ…œ์œผ๋กœ HDFS์™€ ๊ฐ™์€ ๋ถ„์‚ฐํŒŒ์ผ์‹œ์Šคํ…œ์„ ์‚ฌ์šฉํ•˜๋ฉด, ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ ํŒŒํ‹ฐ์…˜์œผ๋กœ ์ชผ๊ฐœ์ ธ์„œ ๋ถ„์‚ฐ๋˜์–ด ์ €์žฅ๋˜๊ธฐ ๋•Œ๋ฌธ์— ํŒŒํ‹ฐ์…˜ ๋งˆ๋‹ค Transformation ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ž‘์—…์€ Data Locality๋ฅผ ๊ณ ๋ คํ•ด ํŒŒํ‹ฐ์…˜ ๋ณ„๋กœ ์ˆ˜ํ–‰๋˜๋ฉฐ, Shuffle์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ์— ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ค๋ฅธ ๋จธ์‹ ์œผ๋กœ ์ „์†ก๋œ๋‹ค.

Narrow vs Wide transformation

RDD ๋Š” Transformation API(map, filter, join, etc)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„์‚ฐ๋˜์–ด ์žˆ๋Š” ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ๋ฅผ ์กฐ์ž‘ํ•  ์ˆ˜ ์žˆ๋‹ค.

Narrow transformation API ๋Š” ์…”ํ”Œ๋งํ•˜์ง€ ์•Š๊ณ  ์ž‘์—… ๋…ธ๋“œ์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋งŒ ๊ฐ€์ง€๊ณ  ์ž‘์—…ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋น ๋ฅด๋ฉฐ, ํŒŒํ‹ฐ์…˜์ด ๊นจ์ ธ๋„ ์ž‘์—…๋…ธ๋“œ์—์„œ ๋ณต์›์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

Wide transformation ์€ ์…”ํ”Œ๋ง์ด ํ•„์š”ํ•˜๋ฉฐ, ๋ฐ์ดํ„ฐ๊ฐ€ ๊นจ์กŒ์„ ๋•Œ Network IO ๋น„์šฉ์ด ๋ฐœ์ƒํ•œ๋‹ค. Wide transformation API๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ, ์ฒดํฌํฌ์ธํŠธ ํ•ด์ฃผ๋Š” ๊ฒƒ์ด ์ข‹์„ ์ˆ˜ ์žˆ๋‹ค.

์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” Wide transformation API ๋Š” ์ฃผ์˜ ๊นŠ๊ฒŒ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋ฉฐ Narrow transformation API๋ฅผ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•ด์•ผ ํ•œ๋‹ค.

Narrow, Wide transformation ๊ณผ action API(Operation)์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž.

์ฐธ๊ณ ์ž๋ฃŒ

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/

Last updated

Was this helpful?