ANALYSIS OF THE CONTEXT OF WORDS IN PORTUGUESE USING WORD2VEC

Autores

  • Alexandre D’Elia
  • Myrian C.A. Costa
  • Nelson F.F. Ebecken
  • Valéria M. Bastos

Resumo

Due to the wide availability of textual documents on the Web, there was an intense study on
how the machine should deal with related words and contexts. The usual methods of
representing words as index in a vocabulary have become obsolete for new applications. Faced
with this demand was developed a new way to study words and contexts efficiently. Word2Vec,
presented in [1], is a group of templates used to produce word integrations. These models are
two-layered neural networks trained to reconstruct linguistic contexts of words.Word2Vec takes
as input a large text corpus, producing a vector space, typically containing several hundred
dimensions, with each unique word in the body being assigned to a corresponding vector in
space. The word vectors are positioned in the vector space so that words that share common
corpus contexts are located close to each other in space.This paper aims to make an analysis of
documents in contexts in the Portuguese language. For a general study on the language, a
database of 37.5 million randomly selected Web pages was used. In this way, it became possible
to observe the use of words based on the context in which they are inserted in an empirical way.
Finally, according to tests, the performance was higher than expected.

Downloads

Publicado

2024-08-26

Edição

Seção

Artigos