Mungkin, ini ada hubungannya dengan cara regex dikompilasi sebelumnya menggunakan nltk.internals.compile_regexp_to_noncapturing()
yang dihapuskan di v3.1, lihat di sini)
>>> import nltk
>>> nltk.__version__
'3.0.5'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... | [+/\-@&*] # special characters with meanings
... '''
>>>
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
Tapi itu tidak berfungsi di NLTK v3.1
:
>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... | [+/\-@&*] # special characters with meanings
... '''
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
Dengan sedikit modifikasi pada cara Anda menentukan grup regex, Anda bisa mendapatkan pola yang sama untuk berfungsi di NLTK v3.1, menggunakan regex ini:
pattern = r"""(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
|\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
|(?:[+/\-@&*]) # special characters with meanings
"""
Dalam kode:
>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r"""
... (?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-@&*]) # special characters with meanings
... """
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
Tanpa NLTK, dengan menggunakan modul re
python, kita melihat bahwa pola regex lama tidak didukung secara asli:
>>> pattern1 = r"""(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... |\w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... |[+/\-@&*] # special characters with meanings
... |\S\w* # any sequence of word characters#
... """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern1, text)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
>>> pattern2 = r"""(?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-@&*]) # special characters with meanings
... """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern2, text)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
Catatan: Perubahan dalam cara RegexpTokenizer NLTK mengkompilasi regex akan membuat contoh di Tokenizer Ekspresi Reguler NLTK juga sudah usang.
person
alvas
schedule
31.01.2016
U.S.A.
menggunakan(U.S.A.)
sudah cukup. - person   schedule 31.01.2016