[NLP]IMDB 영화리뷰 감정분석

정리/NLP 세션

[NLP]IMDB 영화리뷰 감정분석

gyubinc 2023. 1. 24. 00:15

캐글 Bag of Words Meets Bags of Popcorn대회
리뷰가 긍정인 경우 **1**을 부정인 경우 **0**
[Reference]
캐글 경진대회 : https://www.kaggle.com/c/word2vec-nlp-tutorial
인프런 강의 : https://www.inflearn.com/course/nlp-imdb-%ED%8C%8C%EC%9D%B4%EC%8D%AC-%EC%9E%90%EC%97%B0%EC%96%B4-%EC%B2%98%EB%A6%AC/dashboard

코드는 Colab에서 작성되었습니다.

# 내 드라이브에 바로가기를 만들어서 공유문서를 연결할 수 있다
from google.colab import drive
drive.mount('/gdrive', force_remount=True)

import pandas as pd

"""
header = 0 은 파일의 첫 번째 줄에 열 이름이 있음을 나타내며
delimiter = \t 는 필드가 탭으로 구분되는 것을 의미한다.
quoting = 3은 쌍따옴표를 무시하도록 한다.
"""
# QUOTE_MINIMAL (0), QUOTE_ALL (1),
# QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).

# 레이블인 sentiment 가 있는 학습 데이터
train = pd.read_csv('/gdrive/MyDrive/~~~/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)
# 레이블이 없는 데스트 데이터
test = pd.read_csv('/gdrive/MyDrive/~~~/testData.tsv', header=0, delimiter='\t', quoting=3)

print(train.shape)
print(test.shape)

print(f'train 데이터의 features : {train.columns.values}')
print(f'test 데이터의 features : {test.columns.values}')

print(train.shape)
print(test.shape)

데이터 전처리

데이터 전처리를 위해서는 우선 데이터의 형태를 살펴보아야 한다.

데이터의 형태는 데이터의 수집 방식에 따라 달라질 수 있는데 데이터가 만약 웹스크래핑과 같은 형태로 html 내 데이터를 수집한 경우 태그가 제거되지 않을 수 있다.

이 과정에서 양 쪽 끝의 태그가 제거되어 있더라도 내부까지 제거되어 있는지 실제 데이터를 보며 판단할 필요가 있다.

1. BeautifulSoup을 통해 HTML 태그 제거

!pip install bs4
!pip3 show BeautifulSoup4
from bs4 import BeautifulSoup

example1 = BeautifulSoup(train['review'][0], 'html5lib')
print(train['review'][0][:700])
example1.get_text()[:700]

2. 정규표현식을 통해 알파벳 이외의 문자를 공백으로 치환

# 정규표현식을 사용해서 특수문자를 제거
import re
# 소문자와 대문자가 아닌 것은 공백으로 대체한다.
# ^ : not을 의미
letters_only = re.sub('[^a-zA-z]', ' ', example1.get_text())
letters_only[:700]

3. 토큰화

데이터를 분석하기 위해서는 하나의 문장을 여러 갈래로 나누어야 한다.

나누는 기준은 원하는 형태에 따라 문장, 단어 등으로 표현할 수 있고 각 단위를 토큰이라고 한다.

# 모두 소문자로 변환한다.
lower_case = letters_only.lower()
# 문자를 나눈다. => 토큰화
words = lower_case.split()
print(len(words))
words[:10]

4. NLTK 데이터를 사용해 Stopword 제거

!pip install nltk
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords.words('english')[:10]
# stopwords 를 제거한 토큰들
words = [w for w in words if not w in stopwords.words('english')]
print(len(words))
words[:10]

nltk 내에 존재하는 english stopword는 총 179개이다.

불용어의 개수는 정의하기에 따라 다르며, stopword가 많다고 무조건 좋은 것도 아니며 적절한 수준의 stopword를 제거해야 한다.

또한, 추출하고자 하는 용어가 stopword 내에 존재하는지 확인할 필요가 있다.

(ex, 분석을 위해 again라는 단어가 필요한데 해당 stopwords에는 again이 제거되도록 설계되어 있다면 별도로 'again'을 해당 stopword 사전에서 제거해야한다.)

5. Snowball Stemmer를 통한 어간 추출

출처 : [어간 추출 - 위키백과](https://ko.wikipedia.org/wiki/%EC%96%B4%EA%B0%84_%EC%B6%94%EC%B6%9C)

- 어간 추출(語幹抽出, 영어: stemming)은 어형이 변형된 단어로부터 접사 등을 제거하고 그 단어의 어간을 분리해 내는 것

- "message", "messages", "messaging" 과 같이 복수형, 진행형 등의 문자를 같은 의미의 단어로 다룰 수 있도록 도와준다.

- stemming(형태소 분석): 여기에서는 NLTK에서 제공하는 형태소 분석기를 사용한다. 포터 형태소 분석기는 보수적이고 랭커스터 형태소 분석기는 좀 더 적극적이다. 형태소 분석 규칙의 적극성 때문에 랭커스터 형태소 분석기는 더 많은 동음이의어 형태소를 생산한다.

[참고 : 모두의 데이터 과학 with 파이썬(길벗)](http://www.gilbut.co.kr/book/bookView.aspx?bookcode=BN001787)

# 포터 스태머의 사용 예
stemmer = nltk.stem.PorterStemmer()
print(stemmer.stem('maximum'))
print("The stemmed form of running is: {}".format(stemmer.stem("running")))
print("The stemmed form of runs is: {}".format(stemmer.stem("runs")))
print("The stemmed form of run is: {}".format(stemmer.stem("run")))

# 랭커스터 스태머의 사용 예
from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
print(lancaster_stemmer.stem('maximum'))
print("The stemmed form of running is: {}".format(lancaster_stemmer.stem("running")))
print("The stemmed form of runs is: {}".format(lancaster_stemmer.stem("runs")))
print("The stemmed form of run is: {}".format(lancaster_stemmer.stem("run")))

# 처리 전 단어
words[:10]

from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer('english')
words = [stemmer.stem(w) for w in words]

# 처리 후 단어
words[:10]

Lemmatization 음소표기법

언어학에서 음소 표기법 (또는 lemmatization)은 단어의 보조 정리 또는 사전 형식에 의해 식별되는 단일 항목으로 분석 될 수 있도록 굴절된 형태의 단어를 그룹화하는 과정이다. 예를 들어, 동음이의어가 문맥에 따라 다른 의미를 갖는데

1) *배*가 맛있다.

2) *배*를 타는 것이 재미있다.

3) 평소보다 두 *배*로 많이 먹어서 *배*가 아프다.

위에 있는 3개의 문장에 있는 "배"는 모두 다른 의미를 갖는다.

레미타이제이션은 이때 앞뒤 문맥을 보고 단어의 의미를 식별하는 것이다. 영어에서 meet은 meeting으로 쓰였을 때 회의를 뜻하지만 meet 일 때는 만나다는 뜻을 갖는데 그 단어가 명사로 쓰였는지 동사로 쓰였는지에 따라 적합한 의미를 갖도록 추출하는 것이다.

- 참고 :

- [Stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

- [Lemmatisation - Wikipedia](https://en.wikipedia.org/wiki/Lemmatisation)

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

print(wordnet_lemmatizer.lemmatize('fly'))
print(wordnet_lemmatizer.lemmatize('flies'))

words = [wordnet_lemmatizer.lemmatize(w) for w in words]

# 처리 후 단어
words[:10]

+파이썬의 딕셔너리 자료형(set과 동일)은 해시테이블 구조를 사용하기 때문에 탐색연산의 시간복잡도가 O(1)이다 (list는 O(N)) 따라서 set이나 dictionary 구조를 사용해 구성할 수 있는 데이터라면 그렇게 코드를 작성하는 것이 좋다.

최종 전처리 코드를 하나로 묶어준다.

def review_to_words(raw_review):
    # 1. HTML 제거
    review_text = BeautifulSoup(raw_review, 'html.parser').get_text()
    # 2. 영문자가 아닌 문자는 공백으로 변환
    letters_only = re.sub('[^a-zA-Z]', ' ', review_text)
    # 3. 소문자 변환
    words = letters_only.lower().split()
    # 4. Stopwords를 세트로 변환
    # 파이썬에서는 리스트보다 세트로 찾는게 훨씬 빠르다.
    stops = set(stopwords.words('english'))
    # 5. Stopwords 제거
    meaningful_words = [w for w in words if not w in stops]
    # 6. 어간추출
    stemming_words = [stemmer.stem(w) for w in meaningful_words]
    # 7. 공백으로 구분된 문자열로 결합하여 결과를 반환
    return(' '.join(stemming_words))

벡터화 - BoW

[Bag of words model - Wikipedia](https://en.wikipedia.org/wiki/Bag-of-words_model)

주머니 속에 단어들을 넣고 각 문장마다 구문 상관없이 단순히 주머니에 있는 단어들이 몇 번 나오는지 세어주는 방식이다.

다음의 두 문장이 있다고 하자.

(1) John likes to watch movies. Mary likes movies too.

(2) John also likes to watch football games.

위 두 문장을 토큰화 하여 가방에 담아주면 다음과 같다.

["John",

"likes",

"to",

"watch",

"movies",

"Mary",

"too",

"also",

"football",

"games"]

그리고 배열의 순서대로 가방에서 각 토큰이 몇 번 등장하는지 횟수를 세어준다.

(1) [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]

(2) [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]

=> 머신러닝 알고리즘이 이해할 수 있는 형태로 바꿔주는 작업이다.

N-gram

BoW기법은 단어의 갯수만을 벡터화하기 때문에 단어의 순서가 달려져도 같은 문장으로 인식한다는 한계가 있다. 이를 방지하기 위해 사용한다

단어 가방을 n-gram을 사용해 bigram 으로 담아주면 다음과 같다.

["John likes",

"likes to",

"to watch",

"watch movies",

"Mary likes",

"likes movies",

"movies too",]

=> 여기에서는 CountVectorizer를 통해 위 작업을 한다.

Bow기법 적용

num_reviews = train['review'].size
clean_train_reviews = []
for i in range(0, num_reviews):
     if (i + 1) % 5000 == 0 :  #실행이 잘되는지 확인하기 위해 5000개 실행될때마다 확인문구
         print('Review {} of {}'.format(i+1, num_reviews))
     clean_train_reviews.append(review_to_words(train['review'][i]))

%time train['review_clean'] = train['review'].apply(review_to_words)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

# 튜토리얼과 다르게 파라미터 값을 수정
# 파라미터 값만 수정해도 캐글 스코어 차이가 많이 남
vectorizer = CountVectorizer(analyzer = 'word', 
                             tokenizer = None,
                             preprocessor = None, 
                             stop_words = None, 
                             min_df = 2, # 토큰이 나타날 최소 문서 개수
                             ngram_range=(1, 3),
                             max_features = 5000)
vectorizer

train_data_features = vectorizer.fit_transform(clean_train_reviews)
vocab = vectorizer.get_feature_names()
print(len(vocab))
vocab[:10]

# 벡터화 된 feature를 확인해 봄
import numpy as np
dist = np.sum(train_data_features, axis=0)
    
for tag, count in zip(vocab, dist):
    print(count, tag)

# 전체 데이터의 counts를 모두 sum한 결과
pd.DataFrame(dist, columns=vocab)

Random Forest

[랜덤 포레스트 - 위키백과](https://ko.wikipedia.org/wiki/%EB%9E%9C%EB%8D%A4_%ED%8F%AC%EB%A0%88%EC%8A%A4%ED%8A%B8)

랜덤 포레스트의 가장 핵심적인 특징은 임의성(randomness)에 의해 서로 조금씩 다른 특성을 갖는 트리들로 구성된다는 점이다. 이 특징은 각 트리들의 예측(prediction)들이 비상관화(decorrelation) 되게하며, 결과적으로 일반화(generalization) 성능을 향상시킨다. 또한, 임의화(randomization)는 포레스트가 노이즈가 포함된 데이터에 대해서도 강인하게 만들어 준다.

from sklearn.ensemble import RandomForestClassifier

# 랜덤포레스트 분류기를 사용
forest = RandomForestClassifier(
    n_estimators = 100, n_jobs = -1, random_state=2018)
%time forest = forest.fit(train_data_features, train['sentiment'])

from sklearn.model_selection import cross_val_score
%time score = np.mean(cross_val_score(\
    forest, train_data_features, \
    train['sentiment'], cv=10, scoring='roc_auc'))
score

clean_test_reviews = []
for i in range(0, num_reviews):
     if (i + 1) % 5000 == 0 :  #실행이 잘되는지 확인하기 위해 5000개 실행될때마다 확인문구
         print('Review {} of {}'.format(i+1, num_reviews))
     clean_test_reviews.append(review_to_words(test['review'][i]))

test_data_features = vectorizer.fit_transform(clean_test_reviews)
test_data_features = test_data_features.toarray()
result = forest.predict(test_data_features)
output = pd.DataFrame(data={"id":test["id"], "sentiment" : result, "review":test["review"]})
output    # kaggle score : 0.85360

output_sentiment = output['sentiment'].value_counts()
print(output_sentiment[0] - output_sentiment[1])
output_sentiment

워드 클라우드

- 단어의 빈도 수 데이터를 가지고 있을 때 이용할 수 있는 시각화 방법

- 단순히 빈도 수를 표현하기 보다는 상관관계나 유사도 등으로 배치하는 게 더 의미 있기 때문에 큰 정보를 얻기는 어렵다.

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
# %matplotlib inline 설정을 해주어야지만 노트북 안에 그래프가 디스플레이 된다.
%matplotlib inline

def displayWordCloud(data = None, backgroundcolor = 'white', width=800, height=600 ):
    wordcloud = WordCloud(stopwords = STOPWORDS, 
                          background_color = backgroundcolor, 
                         width = width, height = height).generate(data)
    plt.figure(figsize = (15 , 10))
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()

# train 데이터의 모든 단어에 대한 워드 클라우드를 그려본다.
%time displayWordCloud(' '.join(clean_train_reviews))

# test 데이터의 모든 단어에 대한 워드 클라우드를 그려본다.
%time displayWordCloud(' '.join(clean_test_reviews))

# 단어 수
#train['num_words'] = clean_train_reviews.apply(lambda x: len(str(x).split()))
train['num_words'] = list(map(lambda x: len(str(x).split()), clean_train_reviews))
# 중복을 제거한 단어 수
train['num_uniq_words'] = list(map(lambda x: len(set(str(x).split())),clean_train_reviews))
# 첫 번째 리뷰에 
x = clean_train_reviews[0]
x = str(x).split()
print(len(x))

import seaborn as sns

fig, axes = plt.subplots(ncols=2)
fig.set_size_inches(18, 6)
print('리뷰별 단어 평균 값 :', train['num_words'].mean())
print('리뷰별 단어 중간 값', train['num_words'].median())
sns.distplot(train['num_words'], bins=100, ax=axes[0])
axes[0].axvline(train['num_words'].median(), linestyle='dashed')
axes[0].set_title('리뷰별 단어 수 분포')

print('리뷰별 고유 단어 평균 값 :', train['num_uniq_words'].mean())
print('리뷰별 고유 단어 중간 값', train['num_uniq_words'].median())
sns.distplot(train['num_uniq_words'], bins=100, color='g', ax=axes[1])
axes[1].axvline(train['num_uniq_words'].median(), linestyle='dashed')
axes[1].set_title('리뷰별 고유한 단어 수 분포')

구글colab에서 한글이 깨진다면 별도로 다운받아서 진행할 수 있다.

☆ 텍스트 데이터 전처리 용어

(출처: https://github.com/twitter/twitter-korean-text/)

**정규화 normalization (입니닼ㅋㅋ -> 입니다 ㅋㅋ, 샤릉해 -> 사랑해)**

- 한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ -> 한국어를 처리하는 예시입니다 ㅋㅋ

**토큰화 tokenization**

- 한국어를 처리하는 예시입니다 ㅋㅋ -> 한국어Noun, 를Josa, 처리Noun, 하는Verb, 예시Noun, 입Adjective, 니다Eomi ㅋㅋKoreanParticle

**어근화 stemming (입니다 -> 이다)**

- 한국어를 처리하는 예시입니다 ㅋㅋ -> 한국어Noun, 를Josa, 처리Noun, 하다Verb, 예시Noun, 이다Adjective, ㅋㅋKoreanParticle

**어구 추출 phrase extraction**

- 한국어를 처리하는 예시입니다 ㅋㅋ -> 한국어, 처리, 예시, 처리하는 예시