트럼프 대통령 트윗 분석

카테고리 없음

트럼프 대통령 트윗 분석 - 2. 문자열 함수

b4failrise ㅣ 2018. 2. 25. 01:48

문자열 함수

문자열.method() 처럼 쓰이며, string class type의 멤버함수들로 정의되어 있다고 생각하면 이해하기 쉽다.

※문자열 함수는 내부에서 문자열의 copy본을 받아서 변경된 문자열 값만을 그대로 return 하기 때문에 반환된 값을 따로 변수에 저장하는 과정이 필요하다.

1. startswith()

output type은 bool type이다.

Exercise - 단어의 첫 글자 확인하기

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

# 트럼프 대통령 트윗을 공백 기준으로 분리한 리스트입니다. 수정하지 마세요.
trump_tweets = ['thank', 'you', 'to', 'president', 'moon', 'of', 'south',
 'korea', 'for', 'the', 'beautiful', 'welcoming', 'ceremony', 'it', 'will', 'always', 'be', 'remembered']
 
 
def print_korea(text):
    '''
    문자열로 구성된 리스트에서 k로 시작하는 문자열을 출력합니다.
    '''
    
    # 아래 print_korea() 함수를 완성하세요.
    for word in text:
        if(word.startswith('k')):
            print(word)
    
 
print_korea(trump_tweets)
Colored by Color Scripter

cs

* 1. list 순회에서의 print_korea() method에 같은 기능을 수행한다.

2. split()

문자열을 whitespace를 default delimeter로 인식하여 whitespace단위로 구분한 list를 반환한다.

1
2

intro = "My name is devgraphy"
print(intro.split())

cs

추가적으로, split() 의 argument로 delimeter(구분자)를 입력시켜 줄 수 있다. delimeter를 입력해주면 그 delimeter단위로 끊어 리스트를 반환하게 된다.

만약에 delimeter로 ' '(whitespace 하나)를 입력해주면 어떻게 될까? delimeter 하나 단위로 구분하지만 동일한 연속된 2개이상의 delimeter가 나올 땐 delimeter 하나 뺀 만큼의 수의 빈 문자('')들을 리스트로 갖는다. "왜 이런 중복된 delimeter에 대해 한번에 처리해주지 않고 또 한번 또는 여러 번의 처리의 여지를 남겨놓는 걸까?" 라는 의문을 들게 한다. 수고로움이 더 남은 것이지만 꼼꼼한 데이터 처리를 위해서 더 나은 결과일 수 있다!

또 다른 대표적인 공백문자을 살펴보자.

'\t' : tab '\n' : Enter ,

Exercise - 문장을 단어 단위로 구분하기

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

# 트럼프 대통령의 트윗으로 구성된 문자열입니다. 수정하지 마세요. 
trump_tweets = "thank you to president moon of south korea for the beautiful welcoming ceremony it will always be remembered"
 
 
def break_into_words(text):
    '''
    공백 기준으로 분리된 문자열을 리스트형으로 반환합니다. 
    
    >>> break_into_words('merry christmas')
    ['merry', 'christmas']
    '''
    
    # 아래 break_into_words() 함수를 수정하세요.
    
    words = text.split()
    
    return words
 
# 함수를 완성한 후 아래 코드의 주석을 해제하고 결과를 확인해보세요.  
print(break_into_words(trump_tweets))
Colored by Color Scripter

cs

3. lower(), upper()

python에서 소문자와 대문자 문자열은 서로 다른 문자열로 인식한다. 분석할 때 똑같은 문자열로 간주하기 위해 lower, upper함수를 사용한다.

1
2
3

intro = "My name is Elice!"
print(intro.upper())
print(intro.lower())

cs

이번엔 lower() method의 실행 방법을 좀 더 살펴보자. 다음 코드의 결과를 예상해 보자.

1
2
3

intro = "My name is Elice!"
intro.lower()
print(intro)

cs

lower() method를 실행했을 때, 단지 바꿔서 return해주는 함수이다. 따로 intro variable의 값을 변경시키지는 않는 구현이 아님을 알면 된다.

1
2
3

intro = "My name is Elice!"
intro = intro.lower()
print(intro)

cs

비로소 intro의 값은 바뀌어 저장된다.

Exercise - 대소문자 변환

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

# 트럼프 대통령의 트윗 세개로 구성된 리스트입니다. 수정하지 마세요.
trump_tweets = [
    "FAKE NEWS - A TOTAL POLITICAL WITCH HUNT!",
    "Any negative polls are fake news, just like the CNN, ABC, NBC polls in the election.",
    "The Fake News media is officially out of control.",
]
 
 
def lowercase_all_characters(text):
    '''
    리스트에 저장된 문자열을 모두 소문자로 변환합니다.
    
    >>> lowercase_all_characters(['FAKE NEWS', 'Fake News'])
    ['fake news', 'fake news']
    '''
    
    processed_text = []
    
    # 아래 lowercase_all_characters() 함수를 완성하세요. 
    for word in text:
        processed_text.append((word.lower()))
    
    
    
    return processed_text
 
# 함수를 완성한 후 아래 코드의 주석을 해제하고 결과를 확인해보세요.  
print('\n'.join(lowercase_all_characters(trump_tweets)))
Colored by Color Scripter

cs

4.replace()

1
2

intro = "제 이름은 devgraphy입니다."
print(intro.replace('devgraphy','데브그래피'))

cs

replace() method는 단순히 문자열을 바꿔주는 기능도 하지만, 특정 문자열을 없애는 역할을 수행하도록 할 수 있다.

다음은 띄어쓰기가 포함된 문장을 띄어쓰기가 없는 이어있는 문자열로 바꾸는 기능을 구현해보자.

1
2

intro = "제 이름은 devgraphy입니다."
print(intro.replace(' ',''))

cs

이 method 역시 바뀐 문자열을 단지 return해준다. 문자열 variable을 수정해주지 않는다. 코드와 결과를 확인해 보자.

1
2
3

intro = "제 이름은 devgraphy입니다."
intro.replace(' ','')
print(intro)

cs

Exercise. 특수기호 삭제하기

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

# 트럼프 대통령의 트윗 세개로 구성된 리스트입니다. 수정하지 마세요.
trump_tweets = [
    "i hope everyone is having a great christmas, then tomorrow it’s back to work in order to make america great again.",
    "7 of 10 americans prefer 'merry christmas' over 'happy holidays'.",
    "merry christmas!!!",
]
 
 
def remove_special_characters(text):
    '''
    리스트에 저장된 문자열에서 쉼표, 작은따옴표, 느낌표를 제거합니다.
    >>> remove_special_characters(["wow!", "wall,", "liberals'"])
    ['wow', 'wall', 'liberals']
    '''
    processed_text = []
    # 아래 remove_special_characters() 함수를 완성하세요.
    for sentence in text:
        sentence = sentence.replace("!", "").replace("'", "").replace(",", "")
        processed_text.append(sentence)
    return processed_text
# 함수를 완성한 후 아래 코드의 주석을 해제하고 결과를 확인해보세요.
print('\n'.join(remove_special_characters(trump_tweets)))
Colored by Color Scripter

cs
 
 
동시에 여러 문자 또는 문자열의 replace를 수행하고 싶을 땐 line21과 같이 .replace().replace()... 와 같이 써주면 된다.
 
python에서는 큰 따옴표(")와 작은 따옴표(')는 다른 문자이지만 코드에서는 동일한 문자로 간주한다.
C++에서는 문자열의 경우 ""를 이용해 표현해주었지만 python에서는 상관없다. 그리고 argument로 큰 따옴표를 넘길 땐, ('"')와 같이 쓰고 작은 따옴표를 넘길 땐, ("'")와 같이 써준다.
 

cf. 리스트에 대한 method인 append()를 알아두자.리스트의 마지막 원소 뒤에 argument로 넘겨받은 값을 추가한다.

1
2
3

words = ['hello']
words.append('devgraphy')
print(words)

cs

b4failrise@devgraphy

트럼프 대통령 트윗 분석 - 2. 문자열 함수

티스토리툴바