목록전체 글 (62)
언어 전공자의 NLP 로그

논문 출처 : https://arxiv.org/abs/2310.00785 BooookScore: A systematic exploration of book-length summarization in the era of LLMs Summarizing book-length documents (>100K tokens) that exceed the context window size of large language models (LLMs) requires first breaking the input document into smaller chunks and then prompting an LLM to merge, update, and compress chunk-level summari arxiv.org 문제 의..
논문 출처 : https://arxiv.org/abs/2305.17926 Large Language Models are not Fair Evaluators In this paper, we uncover a systematic bias in the evaluation paradigm of adopting large language models~(LLMs), e.g., GPT-4, as a referee to score and compare the quality of responses generated by candidate models. We find that the quality ranking of cand arxiv.org 문제 의식 GPT 모델을 사용해 만드는 퀄리티 랭킹이 단순히 컨텍스트 상 보여지..

논문 출처 : https://arxiv.org/abs/2302.04166 GPTScore: Evaluate as You Desire Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality arxiv.org 문제 의식 GPT의 창발 능력 (제로샷 인스트럭션)을 활용해 생성문의 점수를 매기는 메트릭. 80M에서 ..

논문 출처 : https://aclanthology.org/2020.emnlp-main.213/ COMET: A Neural Framework for MT Evaluation Ricardo Rei, Craig Stewart, Ana C Farinha, Alon Lavie. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. aclanthology.org 문제 의식 기존 메트릭은 번역 모델의 생성문 (가설)과 사람의 번역 (참조) 사이의 유사도를 측정하는 방식으로 MT의 질을 평가했다. n-gram 기반의 단어 매칭 수를 측정하는 간단한 방법은 BLEU, METEOR처럼 가볍고..
논문 출처 : https://arxiv.org/abs/2004.04696 BLEURT: Learning Robust Metrics for Text Generation Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgments. We propose BLEURT, a learned evaluation metric ba arxiv.org 문제 의식 기존의 BLEU와 ROUGE는 사람이 내린 판단을 잘 반영..
논문 출처 : https://arxiv.org/abs/2106.11520 BARTScore: Evaluating Generated Text as Text Generation A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In arxiv.org 문제 의식 생성 텍스트의 품질을 어떻게 평가할 것인가? 생성 모..

논문 출처 : https://aclanthology.org/2023.mtsummit-research.1/ Multiloop Incremental Bootstrapping for Low-Resource Machine Translation Wuying Liu, Wei Li, Lin Wang. Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track. 2023. aclanthology.org 1. Introduction 기계번역은 규칙 기반 (RBMT), 통계 기반 (SMT)을 거쳐 지금의 신경망 기반 (NMT)으로 발전해왔다. RBMT는 시간과 노동력이 많이 소모되고 일관성을 보이기 힘들다. 훌륭한 SMT는 대략 500 만 쌍의 병렬 데이터..

글 출처 : https://openai.com/research/language-models-can-explain-neurons-in-language-models Language models can explain neurons in language models We use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and scores for every neuron in GPT-2. openai.com 0. 서론 LLM의 능력이..