목록전체 글 (62)
언어 전공자의 NLP 로그

글 출처 : https://openai.com/research/instruction-following#fn-5 Aligning language models to follow instructions We’ve trained language models that are much better at following user intentions than GPT-3 while also making them more truthful and less toxic, using techniques developed through our alignment research. These InstructGPT models, which are trained with openai.com 서론 GPT-3는 텍스트 프롬프트를 신중하게 ..
논문 출처 : https://arxiv.org/abs/2304.00612 Eight Things to Know about Large Language Models The widespread public deployment of large language models (LLMs) in recent months has prompted a wave of new attention and engagement from advocates, policymakers, and scholars from many fields. This attention is a timely response to the many urgent questi arxiv.org Abstract 최근 몇달간 공개된 LLM들은 다양한 분야의 지지자, 입법..
논문 출처 : https://arxiv.org/abs/1804.10959 Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations ..
논문 출처 : https://koreascience.kr/article/JAKO202111037333482.page Research on Subword Tokenization of Korean Neural Machine Translation and Proposal for Tokenization Method to Separate Jongsung Abstract Since Neural Machine Translation (NMT) uses only a limited number of words, there is a possibility that words that are not registered in the dictionary will be entered as input. The proposed metho..
논문 출처 : https://arxiv.org/abs/2010.02534 An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks Typically, tokenization is the very first step in most text processing works. As a token serves as an atomic unit that embeds the contextual information of text, how to define a token plays a decisive role in the performance of a model.Even though Byte Pai arxiv.org 0. Abstract 토큰은..
논문 출처 : https://arxiv.org/abs/2105.14274 0. Abstract 철자 단위, 형태소 단위, BPE 분절 방식 중 한영 번역에 가장 효과적인 방법을 트랜스포머 기반 9개 모델을 50,000 에포크 학습하여 찾아낸다. 한국어는 BPE, 영어는 형태소 분절한 결과가 BLEU 35.73으로 가장 좋은 성과를 보였다. 1. Introduction 철자의 종류, 표기법 등이 언어에 따라 다르므로 분절화 방식을 언어에 맞게 설정하는 것이 중요하다. 한국어는 영어와 달리 자모가 결합된 음절 단위 표기를 따른다. 본 논문에서는 철자 단위, 형태소 단위, BPE 분절 방식을 적용해 서로 비교한다. 2. Related Work [1]에 따르면 한국어는 단어, 음절, 음소 단위 분절, 영어는 B..
논문 출처 : https://arxiv.org/abs/2109.07446 When Does Translation Require Context? A Data-driven, Multilingual Exploration Although proper handling of discourse significantly contributes to the quality of machine translation (MT), these improvements are not adequately measured in common translation quality metrics. Recent works in context-aware MT attempt to target a small set arxiv.org 0. abstract..
논문 출처 : https://aclanthology.org/2023.acl-long.852/ Knowledge Transfer in Incremental Learning for Multilingual Neural Machine Translation Kaiyu Huang, Peng Li, Jin Ma, Ting Yao, Yang Liu. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. aclanthology.org 0. Abstract MNMT의 오랜 숙원은 이전 학습 데이터에 접근할 필요 없이 새로운 언어 쌍에 점증적으로 (incrementa..