미세먼지 데이터 분석 (2)

멈추지 않는 달팽이 2024. 8. 30. 01:58

이전 글에 이어서 이번 글에서는 해외의 콤팩트 시티의 데이터를 사용해 시계열 예측을 하는 모델을 만들어 볼 것입니다.

먼저 예측을 위해 데이터를 8:2의 비율로 나눠줍니다.

#train:test = 8:2

train_size <- floor(0.8 * nrow(cleaned_data))
train_df <- cleaned_data[1:train_size, ]
test_df <- cleaned_data[(train_size + 1):nrow(cleaned_data), ]

그다음 xgboost를 사용하기 위해 라벨링과 matrix 형태로 변환 시켜줍니다.

#train 데이터 라벨링

train_label <- train_df$pm25
train_df <- train_df %>% select(-c(pm25, date))

#test 데이터 라벨링

test_label <- test_df$pm25
test_df <- test_df %>% select(-c(pm25, date))

#xgboost를 위해 matrix 형태로 바꿔주고 xgb.DMatrix를 사용해 모델에 적합한 형태로 변환

train_xgb <- xgb.DMatrix(as.matrix(train_df), label = train_label)
test_xgb <- xgb.DMatrix(as.matrix(test_df))

최적의 하이퍼 파라미터 값을 추정하기 위해 random search를 진행합니다.

이 과정에서 시간이 오래 걸리니 빠르게 처리해야 하시는 분들은 trainControl 부분에서 number와 repeats 값을 적절히 낮춰주시면 시간이 줄어듭니다.

random_params <- data.frame(
  nrounds = sample(50:500, 10),
  max_depth = sample(1:10, 10, replace = TRUE),
  eta = runif(10, min = 0.01, max = 0.3),
  gamma = runif(10, min = 0, max = 5),
  colsample_bytree = runif(10, min = 0.5, max = 1.0),
  min_child_weight = sample(1:10, 10, replace = TRUE),
  subsample = runif(10, min = 0.5, max = 1.0)
)

#이 부분에서 수정해 주시면 됩니다

random_search <- train(
  x = as.matrix(train_df),
  y = train_label,
  method = "xgbTree",
  trControl = trainControl(method = "repeatedcv", number = 5, repeats = 7, verboseIter = TRUE),
  tuneGrid = random_params,
  verbose = TRUE
)

best_param <- random_search$bestTune

param <- list(
  nrounds = best_param$nrounds,
  max_depth = best_param$max_depth,
  eta = best_param$eta,
  nthread = 4,
  objective = "reg:squarederror",
  eval_metric = "rmse",
  gamma = best_param$gamma,
  colsample_bytree = best_param$colsample_bytree,
  min_child_weight = best_param$min_child_weight,
  subsample = best_param$subsample,
  lambda = 1.3,
  alpha = 0.5
)

위에서 lambda와 alpha 값은 과적합을 억제하기 위해 삽입한 부분이니 코드를 활용하실 때에는 빼거나 적절한 수치로 맞추시면 됩니다.

nround 값은 중요하기 때문에 cv model을 활용하여 한 번 더 최적의 값을 찾습니다.

#최적의 nround 값 탐색

cv_model <- xgb.cv(params = param, 
                   data = train_xgb, 
                   nfold = 5, 
                   nrounds = 1000, 
                   early_stopping_rounds = 100, 
                   verbose = 1, 
                   metrics = "rmse")

best_nrounds <- cv_model$best_iteration

위의 과정들을 거쳐 학습을 진행합니다.

model_fin <- xgboost(params = param, data = train_xgb, nrounds = best_nrounds, verbose = 1, booster = "dart")

train_pred_prob <- predict(model_fin, train_xgb)

여기서 xgboost의 booster에 dart를 사용한 것은 과적합을 억제하기 위해서이나 사용하실 때, 저 부분을 누락 시키셔도 됩니다.

마지막으로 훈련 데이터의 평균 편향을 계산하여 결괏값에 추가하여 예측 정확도를 상승시킵니다.

# 훈련 데이터에서 편향 계산

train_bias <- mean(train_label - train_pred_prob)

# 훈련 데이터 편향 출력

print(paste("Train Bias:", train_bias))

pred_prob <- predict(model_fin, test_xgb)

results <- data.frame(
  date = cleaned_data[(train_size + 1):nrow(cleaned_data), "date"],
  actual = test_label,
  predicted = pred_prob + train_bias
)

각종 수치들을 활용하여 모델을 평가합니다.

rmse_value <- rmse(results$actual, results$predicted)
mae_value <- mae(results$actual, results$predicted)
r2_value <- cor(results$actual, results$predicted)^2
n <- nrow(results)
p <- ncol(train_df)
adj_r2_value <- (1 - (1 - r2_value) * ((n - 1) / (n - p - 1)))


performance_metrics <- data.frame(
  RMSE = rmse_value,
  MAE = mae_value,
  R2 = r2_value,
  Adjusted_R2 = adj_r2_value
)

print(performance_metrics)

마지막으로 그래프를 그려 결과를 확인합니다.

# ggplot2를 사용하여 시각화

ggplot(results, aes(x = date)) +
  geom_line(aes(y = actual, color = "Actual")) +
  geom_line(aes(y = predicted, color = "Predicted")) +
  labs(title = "Actual vs Predicted PM2.5 Levels",
       x = "Date",
       y = "PM2.5") +
  scale_color_manual(values = c("Actual" = "blue", "Predicted" = "red")) +
  theme_minimal()
  
# Feature importance 계산

importance_matrix <- xgb.importance(feature_names = colnames(train_df), model = model_fin)

# Feature importance 시각화

xgb.plot.importance(importance_matrix, main = "Feature Importance")

이번 글에서는 해외 콤팩트 시티의 데이터를 활용하여 시계열 예측을 해보았습니다. 다음 글에서는 한국의 콤팩트 시티와 일반 도시의 데이터 전처리에 대해서 다뤄보도록 하겠습니다.

궁금한 부분이나 문제점이 있다면 댓글 남겨주시면 감사하겠습니다.

'R' 카테고리의 다른 글

미세먼지 데이터 분석 (4) (0)	2025.01.19
미세먼지 데이터 분석 (3) (0)	2024.10.30
미세먼지 데이터 분석 (1) (0)	2024.08.03
R로 주식 거래 정지 기업 예측하기 (2) (0)	2024.07.28
R로 주식 거래 정지 기업 예측하기 (1) (0)	2024.07.26

현재글미세먼지 데이터 분석 (2)

R로 시작하는 놀이터

모델링, R언어, R language, Kaggle, Xgboost, 데이터 분석, 머신러닝, 전처리, machine learning, 주식 분석, 분류분석, 시계열, 미세먼지 데이터, R 언어, 부스팅, 분류 분석, 캐글, 콤팩트 시티, r, 환경 데이터,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

R로 시작하는 놀이터