티스토리 뷰
HDFS(Hadoop Distributed File System) : Cluster에 데이터를 저장
Map Reduce : Cluster의 데이터를 처리
RDD(Resilent Distributed Dataset)
- Core data structure in Spark
- Distributed, resilient, immutable(수정이 안됨)
- lazy evaluated : evaluation command가 나올 때 evaluation된다.
- Abstract Data Set
- Distribution은 System이 수행함.
- Fault가 발생하면 System이 복구함.
Big Issues in Distributed System
Fault Tolerant : Distributed PC에 고장이 난 경우에 이를 자동 복구할 수 있는 방안
Hadoop : multiple copies
Spark : Lineage
map(), filter(), reduce() : list를 list로 변환해주는 함수
list [x, y, z] -> [f(x), f(y), f(z)] modified list
list [x, y, z] -> [x, y] if condition is true
groupBy() : Key로 묶는다.
lambda function : small anonymous function
사용 목적 : Memory에 남기고 싶지 않을 때
** map() Example
items = [ 1, 2, 3, 4, 5 ]
squared = list(map(lambda x: x**2, items))
** map() Example - tuple / set
names = ['krunal', 'ankit', 'rushabh', 'dhaval', 'nehal']
convertedTuple = tuple(map(lambda s: str(s).upper(), names))
strings = ['krunal', 'ankit', 'rushabh', 'dhaval', 'nehal']
convertedSet = set(map(lambda s: str(s).upper(), strings))
filter() function
Filter extracts each element in the sequence for which the function returns True
Syntax : filter(function, iterable)
* range() : Memory를 잡지 않음
reduceByKey(), groupByKey() Example
wordsCountWithReduce = wordPairsRDD.reduceByKey(lambda x, y : x + y).collect()
wordsCountsWithGroup = wordPairsRDD.groupByKey().map(lambda x :(x[0], sum(x[1]))).collect()
위 두 코드는 같은 결과를 출력해주는 두 가지 방식을 나타낸 것이다.
Lazy Evaluation
Same holds for other transformations - they are lazy
they compute result only when accessed.
Given two keyed RDDs, groups all values with the same key
returns triple (Key, X-values, Y-values) for every key where X-values are all values found under the key k in X, and Y-values are similar.
Given two keyed RDDs, returns all matching items in two datasets
triple (k, x, y), (k, x) in X, (k, y) in Y
leftOuterJoin, rightOuterJoin, fullOuterJoin
Drivers and Executors
Driver delegates tasks to executors to use cluster resources.
In local mode, executors are collocated with the driver
In cluster mode, executors are located on other machines
'컴퓨터공학 > 빅데이터시스템' 카테고리의 다른 글
PySpark - Aggregation and Join (0) | 2019.12.07 |
- Total
- Today
- Yesterday
- 스위프트
- apple
- Elliotable
- 컬렉션
- watchos
- android
- 상속
- Apple Watch
- 오토레이아웃
- Kotlin
- 코틀린
- Auto Layout
- XCode
- Rxjava
- 안드로이드
- databinding
- 함수형프로그래밍
- 함수형
- 아이폰
- C++
- java
- Swift
- retrofit
- Reactive programming
- 애플워치
- 알고리즘
- ios
- CloudComputing
- SwiftUI
- Notissu
일 | 월 | 화 | 수 | 목 | 금 | 토 |
1 | 2 | |||||
3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |