自然言語処理 入門_4

>>> text1.generate()
Building ngram index…
[ Moby Dick ; though indeed such a dyspepsia it were a shuttle
mechanically weaving and weaving away when I go as a pendulum will ,
at least out of sight . ” Poor rover ! will ye do look brave .” ” They
didn ‘ t always see me , while these ribbed , arched , hairy sides ,
Stubb benevolently towed away at his foe , blindly seeking with a
bunch of grapes was swayed up to the other , why not the ship towing
the two extremes of all misery in others that is true and faithfulgenerate でランダムに文章を書き出す。

len を使って単語数を数える

>>> len(text1)

260819

>>> len(text2)

141576

>>> len(text3)

44764

matplotlib をインストールできん。

自然言語処理 入門_3

コーパスとは、簡単にいうと自然言語処理のために大量に集められた文章情報のこと。「コーパス」という単語自体は特定の文書や学問をさすわけではない。例えば、英語教育には英語教育用の様々な用途のために作られたコーパスがある。

.similar で「その文脈のなかでその単語がどのように使われているか」をチェック

>>> text1.similar(‘monstrous’)
abundant candid careful christian contemptible curious delightfully
determined doleful domineering exasperate fearless few gamesome

horrible impalpable imperial lamentable lazy loving

>>> text2.similar(‘monstrous’)
Building word-context index…
very exceedingly heartily so a amazingly as extremely good great

remarkably sweet vast

——

>>> text1.similar(‘great‘)
good long sea vast whale whole dead large living other small last
mighty more much same sperm such before captain

>>> text2.similar(‘great‘)
a good long much some well any happy large quiet strong such the this

young anxious common considerable far her

また、common_contexts を使うとその二つの単語が使われている文脈を確認できる。

>>> text2.common_contexts([“great”, “good”])

a_deal as_an so_a so_as

>>> text1.common_contexts([“miracle”, “impossible”])

No common contexts were found

自然言語処理 入門_2 concordance

concordance を使って、wall street journal で “died” と “killed” どっちが多く出てくるかとか。あんま変わらんね。

>>> text7.concordance(“died”)
Building index…
Displaying 4 of 4 matches:
with the substance , 28 *ICH*-1 have died — more than three times the expecte
ut 58 % of Campbell ‘s stock when he died in April *T*-1 . In recent months ,
llion *U* in 1985 . But as the craze died , Coleco failed *-1 to come up with

he U.S. about the impending attack . Died : James A. Attwood , 62 , retired ch

>>> text7.concordance(“killed”)

Displaying 3 of 3 matches:
 untrained and , in one botched job killed a client . Her remorse was shallow
4 saying 0 they would n’t have been killed *-2 “ if they had n’t been cruisin
ry assault of June 3-4 that *T*-241 killed hundreds , and perhaps thousands ,

ちなみに、人類永遠のテーマLOVE は各テキストではどうだろう

text1: Moby Dick by Herman Melville 1851

 >>> text1.concordance(“love”)
Displaying 24 of 24 matches:

「白鯨」24つ

text2: Sense and Sensibility by Jane Austen 1811

>>> text2.concordance(“love”)
Building index…
Displaying 25 of 77 matches:

「分別と多感」77つ
ストーリーを理解しようと頑張ったが10秒で挫折した。

text3: The Book of Genesis

>>> text3.concordance(“love”)
Building index…
Displaying 4 of 4 matches:

「創世記」なんと4つ?

text4: Inaugural Address Corpus

>>> text4.concordance(“love”)
Building index…
Displaying 25 of 50 matches:

「就任演説コーパス」50つ

text5: Chat Corpus

>>> text5.concordance(“love”)
Building index…
Displaying 25 of 63 matches:

「チャットコーパス」63つ

text6: Monty Python and the Holy Grail

>>> text6.concordance(“love”)
Building index…

イギリスのコメディ映画

text7: Wall Street Journal

>>> text7.concordance(“love”)
Building index…
No matches
「ウォール・ストリート・ジャーナル」こちらも 0
text8: Personals Corpus

>>> text8.concordance(“love”)
Building index…
Displaying 10 of 10 matches:

この具体的な出元はまだわからず
10こ

text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>> text9.concordance(“love”)
Building index…
Displaying 8 of 8 matches:

20世紀初頭に出版された Metaphisical thriller 小説
8つ

Concordance は「どの単語がどういった文脈で使われているか」を知る上で良い。
一言に “love” といっても、名詞の「愛」だったり、「~したい」だったり、「愛してる」だったり。トピックによっても違うから読んでいて面白い。

オタクな人向け英語学習法。

自然言語処理 入門_1

O’Reilly の自然言語処理入門を参考にしてます。

Terminal からPython を起動、

import nltk
nltk.download()

から Book Collection をダウンロード (from nltk.book import *)

そしたら、こんな感じに項目がロードされていることがわかる。

>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, …, text9 and sent1, …, sent9
Type the name of the text or sentence to view it.
Type: ‘texts()’ or ‘sents()’ to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus

text9: The Man Who Was Thursday by G . K . Chesterton 1908

確認。

>>> text9
<Text: The Man Who Was Thursday by G . K . Chesterton 1908>

.concordance で単語を指定すると、そのテキストでその単語がどこで使われているかを見れる。
ちなみに concordance は「一致」とか「調和」という意味。

例1

>>> text1.concordance(“anonymous”)
Displaying 2 of 2 matches:
nsterns ; but I say that scores of anonymous Captains have sailed out of Nantuc
 a great traveller , he leaves his anonymous babies all over the world ; every

例2

>>> text5.concordance(“awesome”)
Building index…
Displaying 10 of 10 matches:
huskers soybeans lol PART U3 has an awesome pic hope all is doing great lets go
6 fine here ty JOIN JOIN U20 has an awesome pic too :) Doing great U43 — thank
too hot huh U45 ? lol U9 U34 has an awesome pick ! lol U25 lol JOIN hi U43 U48
ttend that Kansas TC get together ? awesome U46 , whatcha doin ? lol Hi all hey
.. funny :) i saw the break up …. awesome movie . ACTION sits on U28 . . You
ys some hellos . got da pics U12 .. awesome hey U27 sure U21 ……… aint no
MODE #14-19teens + o U54 JOIN thats awesome U66 , nothing to be ashamed off !!
. you can like , get laid ? that ‘s awesome . U122 , is back so ppl can ban . p
going to hawaii tommorrow PART JOIN awesome U40 JOIN hi U170 whats up guys PART
 and Edgewood MD ???? it really was awesome when I was growing up . Too crowded