티스토리 뷰

반응형
SMALL

HDFS(Hadoop Distributed File System) : Cluster에 데이터를 저장

Map Reduce : Cluster의 데이터를 처리

 

RDD(Resilent Distributed Dataset)

- Core data structure in Spark

- Distributed, resilient, immutable(수정이 안됨)

- lazy evaluated : evaluation command가 나올 때 evaluation된다.

- Abstract Data Set

- Distribution은 System이 수행함.

- Fault가 발생하면 System이 복구함.

 

Big Issues in Distributed System

Fault Tolerant : Distributed PC에 고장이 난 경우에 이를 자동 복구할 수 있는 방안

Hadoop : multiple copies

Spark : Lineage

 

map(), filter(), reduce() : list를 list로 변환해주는 함수

map()

list [x, y, z] -> [f(x), f(y), f(z)] modified list

filter()

list [x, y, z] -> [x, y] if condition is true

 

groupBy() : Key로 묶는다.

lambda function : small anonymous function

사용 목적 : Memory에 남기고 싶지 않을 때

 

** map() Example

items = [ 1, 2, 3, 4, 5 ]
squared = list(map(lambda x: x**2, items))
print(squared)

** map() Example - tuple / set

names = ['krunal', 'ankit', 'rushabh', 'dhaval', 'nehal']
convertedTuple = tuple(map(lambda s: str(s).upper(), names))
print(covertedTuple)

strings = ['krunal', 'ankit', 'rushabh', 'dhaval', 'nehal']
convertedSet = set(map(lambda s: str(s).upper(), strings))
print(convertedSet)

 

filter() function

Filter extracts each element in the sequence for which the function returns True

Syntax : filter(function, iterable)

 

* range() : Memory를 잡지 않음

 

reduceByKey(), groupByKey() Example

wordsCountWithReduce = wordPairsRDD.reduceByKey(lambda x, y : x + y).collect()
print(wordsCountWithReduce)

wordsCountsWithGroup = wordPairsRDD.groupByKey().map(lambda x :(x[0], sum(x[1]))).collect()
print(wordsCountsWithGroup)

위 두 코드는 같은 결과를 출력해주는 두 가지 방식을 나타낸 것이다.

 

Lazy Evaluation

Same holds for other transformations - they are lazy

they compute result only when accessed.

 

Cogroup()

Given two keyed RDDs, groups all values with the same key

returns triple (Key, X-values, Y-values) for every key where X-values are all values found under the key k in X, and Y-values are similar.

 

Join()

Given two keyed RDDs, returns all matching items in two datasets

triple (k, x, y), (k, x) in X, (k, y) in Y

leftOuterJoin, rightOuterJoin, fullOuterJoin

 

Drivers and Executors

Driver delegates tasks to executors to use cluster resources.

In local mode, executors are collocated with the driver

In cluster mode, executors are located on other machines

 

반응형
LIST

'컴퓨터공학 > 빅데이터시스템' 카테고리의 다른 글

PySpark - Aggregation and Join  (0) 2019.12.07
반응형
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
«   2024/11   »
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
글 보관함