VecMap 一款NLP开源框架，能够学习跨语言词嵌入映射-FinClip官网

VecMap 一款NLP开源框架，能够学习跨语言词嵌入映射

网友投稿 923 2022-11-03

VecMap 一款NLP开源框架，能够学习跨语言词嵌入映射

VecMap (cross-lingual word embedding mAPPings)

This is an open source implementation of our framework to learn cross-lingual word embedding mappings, described in the following papers:

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), pages 5012-5019.Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 451-462.Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2289-2294.

The package includes a script to build cross-lingual word embeddings with or without parallel data as described in the papers, as well as evaluation tools in word translation induction, word similarity/relatedness and word analogy.

If you use this software for academic research, please cite the relevant paper(s).

Requirements

Python 3NumPySciPyCuPy (optional, only required for CUDA support)

Usage

In order to build your own cross-lingual word embeddings, you should first train monolingual word embeddings for each language using your favorite tool (e.g. word2vec or fasttext) and then map them to a common space with our software as described below. Having done that, you can evaluate the resulting cross-lingual embeddings using our included tools as discussed next.

Mapping

The mapping software offers 4 main modes with our recommended settings for different scenarios:

Supervised (recommended if you have a large training dictionary):

python3 map_embeddings.py --supervised TRAIN.DICT SRC.EMB TRG.EMB SRC_MAPPED.EMB TRG_MAPPED.EMB

Semi-supervised (recommended if you have a small seed dictionary):

python3 map_embeddings.py --semi_supervised TRAIN.DICT SRC.EMB TRG.EMB SRC_MAPPED.EMB TRG_MAPPED.EMB

Identical (recommended if you have no seed dictionary but can rely on identical words):

python3 map_embeddings.py --identical SRC.EMB TRG.EMB SRC_MAPPED.EMB TRG_MAPPED.EMB

Unsupervised (recommended if you have no seed dictionary and do not want to rely on identical words):

python3 map_embeddings.py --unsupervised SRC.EMB TRG.EMB SRC_MAPPED.EMB TRG_MAPPED.EMB

SRC.EMB and TRG.EMB refer to the input monolingual embeddings, which should be in the word2vec text format, whereas SRC_MAPPED.EMB and TRG_MAPPED.EMB refer to the output cross-lingual embeddings. The training dictionary TRAIN.DICT, if any, should be given as a text file with one entry per line (source word + whitespace + target word).

If you have a NVIDIA GPU, append the --cuda flag to the above commands to make things faster.

For most users, the above settings should suffice. Choosing the right mode should be straightforward depending on the resources available: as a general rule, you should prefer the mode with the highest supervision for the resources you have, although it is advised to try different variants in case of doubt.

In addition to these recommended modes, the software also offers additional options to adjust different aspects of the mapping method as described in the papers. While most users should not need to deal with those, you can learn more about them by running the tool with the --help flag. You can either use one of the recommended modes and modify a few options on top of it, or do not use any recommended mode and set all options yourself. In fact, if you dig into the code, you will see that the above modes simply set recommended defaults for all the different options.

Evaluation

You can evaluate your mapped embeddings in bilingual lexicon extraction (aka dictionary induction or word translation) as follows:

python3 eval_translation.py SRC_MAPPED.EMB TRG_MAPPED.EMB -d TEST.DICT

The above command uses standard nearest neighbor retrieval by default. For best results, it is recommended that you use CSLS retrieval instead:

python3 eval_translation.py SRC_MAPPED.EMB TRG_MAPPED.EMB -d TEST.DICT --retrieval csls

While better, CSLS is also significantly slower than nearest neighbor, so do not forget to append the --cuda flag to the above command if you have a NVIDIA GPU.

In addition to bilingual lexicon extraction, you can also evaluate your mapped embeddings in cross-lingual word similarity as follows:

python3 eval_similarity.py -l --backoff 0 SRC_MAPPED.EMB TRG_MAPPED.EMB -i TEST_SIMILARITY.TXT

Finally, we also offer an evaluation tool for monolingual word analogies, which mimics the one included with word2vec but should run significantly faster:

python3 eval_analogy.py -l SRC_MAPPED.EMB -i TEST_ANALOGIES.TXT -t 30000

Dataset

You can use the following script to download the main dataset used in our papers, which is an extension of that of Dinu et al. (2014):

./get_data.sh

Reproducing results

While we always recommend to use the above settings for best results when working with your own embeddings, we also offer additional modes to replicate the systems from our different papers as follows:

ACL 2018 (currently equivalent to the unsupervised mode):

python3 map_embeddings.py --acl2018 SRC.EMB TRG.EMB SRC_MAPPED.EMB TRG_MAPPED.EMB

AAAI 2018 (currently equivalent to the supervised mode, except for minor differences in re-weighting, normalization and dimensionality reduction):

python3 map_embeddings.py --aaai2018 TRAIN.DICT SRC.EMB TRG.EMB SRC_MAPPED.EMB TRG_MAPPED.EMB

ACL 2017 (superseded by our ACL 2018 system; offers 2 modes depending on the initialization):

python3 map_embeddings.py --acl2017 SRC.EMB TRG.EMB SRC_MAPPED.EMB TRG_MAPPED.EMBpython3 map_embeddings.py --acl2017_seed TRAIN.DICT SRC.EMB TRG.EMB SRC_MAPPED.EMB TRG_MAPPED.EMB

EMNLP 2016 (superseded by our AAAI 2018 system):

python3 map_embeddings.py --emnlp2016 TRAIN.DICT SRC.EMB TRG.EMB SRC_MAPPED.EMB TRG_MAPPED.EMB

FAQ

How long does training take?

The supervised mode (--supervised) should run in around 2 minutes in either CPU or GPU.The rest of recommended modes (either --semi_supervised, --identical or --unsupervised) should run in around 5 hours in CPU, or 10 minutes in GPU (Titan Xp or similar).

This is running much slower for me! What can I do?

If you have a GPU, do not forget the --cuda flag.Make sure that your NumPy installation is properly linked to BLAS/LAPACK. This is particularly important if you are working on CPU, as it can have a huge impact in performance if not properly set up.There are different settings that affect the execution time of the algorithm and can thus be adjusted to make things faster: the batch size (--batch_size), the vocabulary cutoff (--vocabulary_cutoff), the stochastic dictionary induction settings (--stochastic_initial, --stochastic_multiplier and --stochastic_interval) and the convergence threshold (--threshold), among others. However, most of these settings will have a direct impact in the quality of the resulting embeddings, so you should not play with them unless you really know what you are doing.

Prior versions of this software included nice scripts to reproduce the exact same results reported in your papers. Why are those missing now?

As the complexity of the software (and the number of publications/results to reproduce) increased, maintaining those nice scripts became very tedious. Moreover, with the inclusion of CUDA support and FP32 precision, reproducing the exact same results on different platforms became inviable due to minor numerical variations in the underlying computations, which were magnified by self-learning (e.g. the exact same command is likely to produce a slightly different output on CPU and GPU). While the effect in the final results is negligible (the observed variations are around 0.1-0.2 accuracy points), this made it unfeasible to reproduce the exact same results in different platforms.

Instead of that, we now provide an easy interface to run all the systems proposed in our different papers. We think that this might be even more useful than the previous approach: the most skeptical user should still be able to easily verify our results, while we also provide a simple interface to test our different systems in other datasets.

The ablation test in your ACL 2018 paper reports 0% accuracies for removing CSLS, but I am getting better results. Why is that?

After publishing the paper, we discovered a bug in the code that was causing those 0% accuracies. Now that the bug is fixed, the effect of removing CSLS is not that dramatic, although it still has a big negative impact. At the same time, the effect of removing the bidirectional dictionary induction in that same ablation test is slightly smaller.

解析微信小程序和 APP 之间存在的差异

923 2022-11-03

VecMap 一款NLP开源框架，能够学习跨语言词嵌入映射

智慧屏安装APP的最佳实践与跨平台小程序开发的结合

解析微信小程序和 APP 之间存在的差异

小程序与公众号跳转的相关内容剖析

最近发表

更多内容

小程序SDK

Finclip技术文档

小程序开发

小程序容器

小程序框架

Finclip小程序平台

Finclip用户投稿

车联网

推荐文章

小程序SDK是什么意思？小程序sdk和插件有什么区别？

小程序支付功能怎么实现？

企业app开发流程是什么？

app运营模式有哪些？

小程序多端引流怎么做？

小程序生态分析的机会和威胁

Flutter入门这一篇效率文章就够了

原生与跨平台解决方案分析,跨平台软件开发技术方案

热更新技术：让软件更新变得更加轻松快速

解决方案

银行解决方案

证券解决方案

互联网解决方案

政企OA解决方案

科技解决方案

loT解决方案

信任解决方案

热评文章

AppCan:基于混合模式的移动应用开发,移动混合模

Hybrid App混合模式开发的了解

小程序容器技术助力券商数字营销突围，小程序容器化的意

用mpvue开发微信小程序基础知识（vue.js开发

小程序多端框架全面测评对比，强烈推荐！

券商app架构 - 解析券商应用程序的构建与设计