Drunken DWCraft: Druid Ingestion & Querying

[ Druid Data set ]

Druid의 모든 데이터는 Time 기준의 Segment라는 단위로 저장하고, 이 segment는 Timestamp, Dimension, Measure 이 세가지 필수 요소로 구성되어 있다.

- Timestamp : 모든 query는 time을 기반으로 실행
- Dimension : event의 string 속성들
- Measure : 실제 집계할 컬럼.

[ Data Ingestion ]

1. hadoop dir(http://bisnapshotm01.ssgbi.com:50070/explorer.html#/)에 create external로 데이터를 떨군다.

2. JSON 형식의 데이터 로드 쿼리를 overlord node(8090)로 submit 한다.

> disp_ctg_load.json
{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "static",
"paths": "/apps/hive/warehouse/snapshot.db/disp_ctg_item_all_column/"
}
},
"dataSchema": {
"dataSource": "DISP_CTG_ITEM__ORG_ITEM",
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "day",
"queryGranularity": "day",
"intervals": [ "2017-01-01/2018-01-01" ]
},
"parser": {
"type": "hadoopyString",
"parseSpec": {
"format": "tsv",
"columns": [
"기준일자",
"전시카테고리ID",
"표준카테고리ID",
"상품ID",
.........
],
"delimiter": "\t",
"dimensionsSpec": {
"dimensions": [
"기준일자",
"전시카테고리ID",
"표준카테고리ID",
....
]
},
"timestampSpec": {
"format": "auto",
"column": "기준일자"
}
}
},
"metricsSpec": [
{
"name": "상품ID",
"type": "hyperUnique",
"fieldName": "상품ID"
}
]
},
"tuningConfig": {
"type": "hadoop",
"partitionsSpec": {
"type": "hashed",
"targetPartitionSize": 5000000
},
"jobProperties": { }
}
}
}

> http curl command
curl -X 'POST' -H 'Content-Type:application/json' -d @disp_ctg_load.json bisnapshotd01.ssgbi.com:8090/druid/indexer/v1/task

이 때, segmentGranularity <= queryGranularity 이 성립.
(granularity : all, none, second, minute, fifteen_minute, thirty_minute, hour, day, week, month, quarter and year.)

overlord submit 후에, MR JOB이 2번 실행된다.
1) Determine partitions
: 설정 정보(Granularity, tuningConfig 등)에 따라 segment의 shard 개수를 결정하는 작업이다.

2) Index Generation
: index segment를 수행하는 작업이다. (segment 생성)
time 단위로 데이터를 sharding 하며, sharding 된 데이터를 segment라고 한다.
이 과정이 끝나면 broker node로 해당 데이터 소스에 대한 query가 가능하다.

* ingestion 과정에서 map, reduce 메모리 오류가 나서, 각각 16GB로 늘려주었다.

3. Delete data
: 3단계가 필요하다.

1) disable datasource
curl -X 'DELETE' "bisnapshotd01.ssgbi.com:8081/druid/coordinator/v1/datasources/DISP_CTG_ITEM__ORG_ITEM"

2) delete data - 삭제할 interval을 설정해주어야 한다. (coordinator에 kill task 올라옴)
curl -X 'DELETE' bisnapshotd01.ssgbi.com:8081/druid/coordinator/v1/datasources/DISP_CTG_ITEM__ORG_ITEM/intervals/2017-01-01T00:00:00Z_2018-01-01T00:00:00Z

3) enable datasource
curl -X 'POST' "bisnapshotd01.ssgbi.com:8081/druid/coordinator/v1/datasources/DISP_CTG_ITEM__ORG_ITEM"

* Granularity

> 아래 세가지는 query 시, granularity
1) simple granularity
: all, none, second, minute, fifteen_minute, thirty_minute, hour, day, week, month, quarter and year

2) duration granularity
"granularity" : {"type" : "duration", "duration": 3600000, "origin": "2012-01-01T00:30:00Z"}
-> "2012-01-01T00:30:00Z" 이 시각을 기준으로, 1시간 단위 집계

3) period granularity
"granularity" : {"type" : "period ", "period ": P3M, "origin": "2012-01-01T00:00:00Z"}
-> "2012-01-01T00:00:00Z" 이 시각을 기준으로, 3개월 단위 집계

> 아래는 data ingestion 시, segmentGranularity

Enum Constants
Enum Constant and Description
`ALL`
`DAY`
`FIFTEEN_MINUTE`
`FIVE_MINUTE`
`HOUR`
`MINUTE`
`MONTH`
`NONE`
`QUARTER`
`SECOND`
`SIX_HOUR`
`TEN_MINUTE`
`THIRTY_MINUTE`
`WEEK`
`YEAR`

=> "granularitySpec": {
"type": "period",
"segmentGranularity": {"type":"period", "period":"P3D"}, -- ISO 8601 방식
"queryGranularity": "day",
"intervals": [ "2017-01-01/2017-01-08" ]
},

~~왜 때문인지 열흘 ingestion 하는 데, map,reduce 메모리를 많이 잡아먹음. period 때문 ?~~
~~(개발서버에서 일주일치 올리는데 1:46:33 소요.)~~

> 2017-01-01/2017-01-08, P3D로 올렸는데, 실제로는 2016-12-31/2017-01-03/2017-01-06으로 나누어져 올라감
> 2017-01-01부터 3DAYS로 나누려면, segmentGranularity에 "origin": "2017-01-01"를 설정해주면 된다.

groupby query 결과,

/*
3day segment로 올렸을때,
*/

1)
SELECT 기준일자, COUNT(DISTINCT 상품ID) FROM HDFS_DISP_CTG_ITEM_ALL_COLUMN_20180403 WHERE 기준일자 BETWEEN '20170101' AND '20170107' GROUP BY 기준일자 ORDER BY 기준일자

{
"queryType": "groupBy",
"dataSource": "DISP_CTG_ITEM__ORG_ITEM_0403",
"dimensions": [
],
"granularity": "day",
"aggregations": [
{
"type": "distinctCount",
"name": "상품수",
"fieldName": "상품ID"
}
],
"intervals": [
"2017-01-01T00:00:00/2017-01-08T00:00:00"
]

}

-- 오차율 0%

2)
SELECT COUNT(DISTINCT 상품ID) FROM HDFS_DISP_CTG_ITEM_ALL_COLUMN_20180403 WHERE 기준일자 BETWEEN '20161231' AND '20170102'

{
"queryType": "groupBy",
"dataSource": "DISP_CTG_ITEM__ORG_ITEM_0403",
"dimensions": [
],
"granularity": {"type":"period", "period":"P3D"},
"aggregations": [
{
"type": "distinctCount",
"name": "상품수",
"fieldName": "상품ID"
}
],
"intervals": [
"2017-01-01T00:00:00/2017-01-08T00:00:00"
]

}

-- 오차율 0%

3) segmentGranularity (P3D) 보다 작은 단위 (약수단위)로 GROUPBY SELECT 했을 때
오차율 0% -> 예를 들면, segmentGranularity=P6D면, 1,2,3,6일 단위로 GROUPBY 하면 정합성 100%

"granularity": {"type":"period", "period":"P1D"},

SELECT COUNT(DISTINCT 상품ID) FROM HDFS_DISP_CTG_ITEM_ALL_COLUMN_20180403 WHERE 기준일자 ='20170101'

-- 오차율 0%

* ISO 8601 형식
기간 : 기간 표현의 시작을 알리는 기간지정자 P(period)로 시작. T는 시간표현 앞에 오는 시간 지정자.
= P<date>T<time>
-> P[n]Y[n]M[n]DT[n]H[n]M[n]S / P[n]W 형식으로 표현된다.

[ Querying ]
: HTTP REST 방식으로 Broker node(8082)에 쿼리한다. (query : json format)

curl -X POST 'bisnapshotm01.ssgbi.com:8082/druid/v2/?pretty' -H 'Content-Type:application/json' -d @query.json -w %{time_total} > out.json

json 형식의 쿼리는 'Druid Distinct count 성능테스트'에서 함께 설명.

* 참고
http://druid.io/docs/0.12.0/design/index.html
https://www.slideshare.net/freepsw/olap-for-big-data-druid-vs-apache-kylin-vs-apache-lens
http://www.popit.kr/time-series-olap-druid-%EC%9E%85%EB%AC%B8/

Drunken DWCraft

2018년 3월 28일 수요일

Druid Ingestion & Querying

댓글 없음:

댓글 쓰기

블로그 보관함