Understanding Neural Machine Translation: Encoder-Decoder Architecture

Published in

Towards Data Science

3 min readNov 10, 2019

The state of the art for machine translation has utilized Recurrent Neural Networks (RNNs) using an encoder-attention-decoder model. Here I will try to cover how it all works from a high level view.

Language Translation: Components

We can break translation into two components: the individual units and the grammar:

We need to encode sequences of words into vector spaces in order to perform computations in a neural net. Because the words also have meaningful sequences, a recurrent neural network is suitable for this task:

The problem

This encoder-decoder architecture, however, breaks down after about 20+ word sentences:

Why? Longer sentences illustrate the limitations of a single-directional encoder-decoder architecture.

Because language consists of tokens and grammar, the problem with this model is it does not entirely address the complexity of the grammar.

Specifically, when translating the nth word in the source language, the RNN was considering only the 1st n-word in the source sentence, but grammatically, the meaning of a word depends on both the sequence of words before and after it in a sentence:

A solution: The bi-directional LSTM model

If we use a bi-directional model, it allows us to input the context of both past and future words to create an accurate encoder output vector:

But then, the challenge then becomes, which word do we need to focus on in a sequence?

In 2016, Bahdanau et. al. came out with a paper that shows we can learn which words in the source language to focus on by storing the previous outputs of the LSTM cells, then ranking them by the relevancy of each, and picking the word with the highest score:

Below, you can see how this looks in the graphs: the resulting architecture then embeds this Attention mechanism in between the Encoder and Decoder:

You can see there is a performance upgrade over the previous Encoder-Decoder architecture:

This is it in a nutshell, and it is also how Google Translate’s NMT works, albeit scaled up with a larger number of layers for the encoder LSTMs.

That’s it in a nutshell, to recap:

The encoder takes each word in the source language and encodes it into vector space
These vector representations of words are then passed into an attention mechanism which determines which source words to focus on while generating some output for the desired language.
This output is passed through a decoder that turns the vector representations into the target language

Illustrations from CS Dojo Community

Long Short Term Memory

Written by Nicholas Asquith

56 Followers

Writer for

Towards Data Science

More from Nicholas Asquith and Towards Data Science

Nicholas Asquith
in
Towards Data Science

Understanding RNNs, LSTMs and GRUs

A recurrent neural network (RNN) is a variation of a basic neural network. RNNs are good for processing sequential data such as natural…

4 min readNov 10, 2019

Cristian Leo
in
Towards Data Science

The Math Behind Neural Networks

Dive into Neural Networks, the backbone of modern AI, understand its mathematics, implement it from scratch, and explore its applications

28 min readMar 29, 2024

Alex Honchar
in
Towards Data Science

Intro to LLM Agents with Langchain: When RAG is Not Enough

First-order principles of brain structure for AI assistants

7 min readMar 15, 2024

Nicholas Asquith
in
Towards Data Science

Character Level Language Models

Utilizing Knowledge Distillation in NLP Models

6 min readJan 19, 2020

See all from Nicholas Asquith

See all from Towards Data Science

Recommended from Medium

Text Similarity Implementation using BERT Embedding in Python

Ahmed Mellit

Text Similarity Implementation using BERT Embedding in Python

Unlocking the Power of BERT and Cosine Similarity: Enhancing Text Similarity Analysis

5 min readNov 9, 2023

xuer chen

LLM study notes: Positional Encoding

This post is part of my study notes of LLM, which is also on github page.

6 min readNov 2, 2023

Lists

Natural Language Processing

1361 stories851 saves

Practical Guides to Machine Learning

10 stories1303 saves

data science and AI

40 stories127 saves

Staff Picks

620 stories889 saves

Molly Ruby
in
Towards Data Science

How ChatGPT Works: The Models Behind The Bot

A brief introduction to the intuition and methodology behind the chat bot you can’t stop hearing about.

9 min readJan 30, 2023

125

Self-Attention: A step-by-step guide to calculating the context vector

Lovelyn David

Self-Attention: A step-by-step guide to calculating the context vector

Introduction

7 min readOct 16, 2023

What are the query, key, and value vectors?

RAHULRAJ P V

What are the query, key, and value vectors?

In a transformer architecture, “key,” “query,” and “value” are fundamental components used in the mechanism of attention. Attention is an…

5 min readDec 10, 2023

Histogram plot of text and summary based on word count

Koyela Chakrabarti

Abstractive text summarization with Transformer model using Pytorch

Text summarization is one of the many applications of natural language processing. There are two types of summarization models available…

21 min readOct 18, 2023

See more recommendations

Help
Status
About
Careers
Blog
Privacy
Terms
Text to speech
Teams