Polyglot是一种支持大量多语言应用程序的自然语言管道

网友投稿 750 2022-10-31

Polyglot是一种支持大量多语言应用程序的自然语言管道

Polyglot是一种支持大量多语言应用程序的自然语言管道

polyglot

Polyglot is a natural language pipeline that supports massive multilingual applications.

Free software: GPLv3 licenseDocumentation: http://polyglot.readthedocs.org.

Features

Tokenization (165 Languages)Language detection (196 Languages)Named Entity Recognition (40 Languages)Part of Speech Tagging (16 Languages)Sentiment Analysis (136 Languages)Word Embeddings (137 Languages)Morphological analysis (135 Languages)Transliteration (69 Languages)

Developer

Rami Al-Rfou @ rmyeid gmail com

Quick Tutorial

import polyglotfrom polyglot.text import Text, Word

Language Detection

text = Text("Bonjour, Mesdames.")print("Language Detected: Code={}, Name={}\n".format(text.language.code, text.language.name))

Language Detected: Code=fr, Name=French

Tokenization

zen = Text("Beautiful is better than ugly. " "Explicit is better than implicit. " "Simple is better than complex.")print(zen.words)

[u'Beautiful', u'is', u'better', u'than', u'ugly', u'.', u'Explicit', u'is', u'better', u'than', u'implicit', u'.', u'Simple', u'is', u'better', u'than', u'complex', u'.']

print(zen.sentences)

[Sentence("Beautiful is better than ugly."), Sentence("Explicit is better than implicit."), Sentence("Simple is better than complex.")]

Part of Speech Tagging

text = Text(u"O primeiro uso de desobediência civil em massa ocorreu em setembro de 1906.")print("{:<16}{}".format("Word", "POS Tag")+"\n"+"-"*30)for word, tag in text.pos_tags: print(u"{:<16}{:>2}".format(word, tag))

Word POS Tag------------------------------O DETprimeiro ADJuso NOUNde ADPdesobediência NOUNcivil ADJem ADPmassa NOUNocorreu ADJem ADPsetembro NOUNde ADP1906 NUM. PUNCT

Named Entity Recognition

text = Text(u"In Großbritannien war Gandhi mit dem westlichen Lebensstil vertraut geworden")print(text.entities)

[I-LOC([u'Gro\xdfbritannien']), I-PER([u'Gandhi'])]

Polarity

print("{:<16}{}".format("Word", "Polarity")+"\n"+"-"*30)for w in zen.words[:6]: print("{:<16}{:>2}".format(w, w.polarity))

Word Polarity------------------------------Beautiful 0is 0better 1than 0ugly -1. 0

Embeddings

word = Word("Obama", language="en")print("Neighbors (Synonms) of {}".format(word)+"\n"+"-"*30)for w in word.neighbors: print("{:<16}".format(w))print("\n\nThe first 10 dimensions out the {} dimensions\n".format(word.vector.shape[0]))print(word.vector[:10])

Neighbors (Synonms) of Obama------------------------------BushReaganClintonAhmadinejadNixonKarzaiMcCainBidenHuckabeeLulaThe first 10 dimensions out the 256 dimensions[-2.57382345 1.52175975 0.51070285 1.08678675 -0.74386948 -1.18616164 2.92784619 -0.25694436 -1.40958667 -2.39675403]

Morphology

word = Text("Preprocessing is an essential step.").words[0]print(word.morphemes)

[u'Pre', u'process', u'ing']

Transliteration

from polyglot.transliteration import Transliteratortransliterator = Transliterator(source_lang="en", target_lang="ru")print(transliterator.transliterate(u"preprocessing"))

препрокессинг

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:【1022】Digital Library (30 分)
下一篇:【1035】Password (20 分)
相关文章

 发表评论

暂时没有评论,来抢沙发吧~