Skip to content

Commit 8de4d20

Browse files
committed
differences for PR #46
1 parent 6d652e3 commit 8de4d20

File tree

2 files changed

+22
-15
lines changed

2 files changed

+22
-15
lines changed

01-introduction.md

Lines changed: 20 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,13 @@ exercises: 60
2222

2323
## What is NLP?
2424

25-
Natural language processing (NLP) is an area of research and application that focuses on making human languages processable for computers, so that they can perform useful tasks. It is therefore not a single method, but a collection of techniques that help us deal with linguistic inputs. The range of techniques spans simple word counts, to Machine Learning (ML) methods, all the way up to complex Deep Learning (DL) architectures.
25+
Natural language processing (NLP) is an area of research and application that focuses on making human languages processable for computers, so that they can perform useful tasks. It is therefore not a single method, but a collection of techniques that help us deal with linguistic inputs. The range of techniques spans from simple word counts, to Machine Learning (ML) methods, all the way up to complex Deep Learning (DL) architectures.
2626

2727
We use the term "natural language", as opposed to "artificial language" such as programming languages, which are by design constructed to be easily formalized into machine-readable instructions. In contrast to programming languages, natural languages are complex, ambiguous, and heavily context-dependent, making them challenging for computers to process. To complicate matters, there is not only a single *human language*. More than 7000 languages are spoken around the world, each with its own grammar, vocabulary, and cultural context.
2828

29-
In this course we will mainly focus on written language, specifically written English, we leave out audio and speech, as they require a different kind of input processing. But consider that we use English only as a convenience so we can address the technical aspects of processing textual data. While ideally most of the concepts from NLP apply to most languages, one should always be aware that certain languages require different approaches to solve seemingly similar problems. We would like to encourage the usage of NLP in other less widely known languages, especially if it is a minority language. You can read more about this topic in this [blogpost](https://www.ruder.io/nlp-beyond-english/).
29+
In this course we will mainly focus on written language, specifically written English. We leave out audio and speech, as they require a different kind of input processing. But consider that we use English only as a convenience so we can address the technical aspects of processing textual data. While ideally most of the concepts from NLP apply to most languages, one should always be aware that certain languages require different approaches to solve seemingly similar problems. We would like to encourage the usage of NLP in other less widely known languages, especially if it is a minority language. You can read more about this topic in this [blogpost](https://www.ruder.io/nlp-beyond-english/).
3030

31-
We can already find differences between languages in the most basic step for processing text. Take the problem of segmenting text into meaningful units, most of the times these units are words, in NLP we call this task **tokenization**. A naive approach is to obtain individual words by splitting text by spaces, as it seems obvious that we always separate words with spaces. Just as human beings break up sentences into words, phrases and other units in order to learn about grammar and other structures of a language, NLP techniques achieve a similar goal through tokenization. Let's see how can we segment or **tokenize** a sentence in English:
31+
We can already find differences between languages in the most basic step for processing text. Take the problem of segmenting text into meaningful units. Most of the times these units are words. In NLP we call this task **tokenization**. A naive approach is to obtain individual words by splitting text by spaces, as it seems obvious that we always separate words with spaces. Just as human beings break up sentences into words, phrases and other units in order to learn about grammar and other structures of a language, NLP techniques achieve a similar goal through tokenization. Let's see how can we segment or **tokenize** a sentence in English:
3232

3333
``` python
3434
english_sentence = "Tokenization isn't always trivial."
@@ -42,7 +42,14 @@ print(len(english_words))
4242
4
4343
```
4444

45-
The words are mostly well separated, however we do not get fully formed words (we have punctuation with the period after "trivial" and also special cases such as the abbreviation of "is not" into "isn't"). But at least we get a rough count of the number of words present in the sentence. Let's now look at the same example in Chinese:
45+
The words are mostly well separated, however we do not get fully formed words (we have punctuation with the period after "trivial" and also special cases such as the abbreviation of "is not" into "isn't"). But at least we get a rough count of the number of words present in the sentence.
46+
47+
::: callout
48+
### A short history of word separation
49+
As any historian knows, word separation in written texts is a relatively new development. You can check this yourself next time you visit a city with ancient monuments. Word separation, as oddly as it might sound today, is an example of technology.
50+
:::
51+
52+
Let's now look at the same example in Chinese:
4653

4754
``` python
4855
# Chinese Translation of "Tokenization is not always trivial"
@@ -87,15 +94,15 @@ Natural Language Processing deals with the challenges of correctly processing an
8794

8895
## Why should we learn NLP Fundamentals?
8996

90-
In the past decade, NLP has evolved significantly, especially in the field of deep learning, to the point that it has become embedded in our daily lives, one just needs to look at the term Large Language Models (LLMs), the latest generation of NLP models, which is now ubiquitous in news media and tech products we use on a daily basis.
97+
In the past decade, NLP has evolved significantly, especially in the field of deep learning, to the point that it has become embedded in our daily lives. One just needs to look at the term Large Language Models (LLMs), the latest generation of NLP models, which is now ubiquitous in news media and tech products we use on a daily basis.
9198

9299
The term LLM now is often (and wrongly) used as a synonym of Artificial Intelligence. We could therefore think that today we just need to learn how to manipulate LLMs in order to fulfill our research goals involving textual data. The truth is that Language Modeling has always been part of the core tasks of NLP, therefore, by learning NLP you will understand better where are the main ideas behind LLMs coming from.
93100

94101
![NLP is an interdisciplinary field, and LLMs are just a subset of it](fig/intro0_cs_nlp.png)
95102

96-
LLM is a blanket term for an assembly of large neural networks that are trained on vast amounts of text data with the objective of optimizing for language modeling. Once they are trained, they are used to generate human-like text or fine-tunned to perform much more advanced tasks. Indeed, the surprising and fascinating properties that emerge from training models at this scale allows us to solve different complex tasks such as answer elaborate questions, translate languages, solve complex problems, generate narratives that emulate reasoning, and many more, all of this with a single tool.
103+
LLM is a blanket term for an assembly of large neural networks that are trained on vast amounts of text data with the objective of optimizing for language modeling. Once they are trained, they are used to generate human-like text or fine-tunned to perform much more advanced tasks. Indeed, the surprising and fascinating properties that emerge from training models at this scale allows us to solve different complex tasks such as answering elaborate questions, translating languages, solving complex problems, generating narratives that emulate reasoning, and many more. All of this with a single tool.
97104

98-
It is important, however, to pay attention to what is happening behind the scenes in order to be able **trace sources of errors and biases** that get hidden in the complexity of these models. The purpose of this course is precisely to take a step back, and understand that:
105+
It is important, however, to pay attention to what is happening behind the scenes in order to be able **trace sources of errors and biases** that get hidden in the complexity of these models. The purpose of this course is precisely to take a step back and understand that:
99106

100107
- There are a wide variety of tools available, beyond LLMs, that do not require so much computing power
101108
- Sometimes a much simpler method than an LLM is available that can solve our problem at hand
@@ -116,16 +123,16 @@ We can also argue if the statement "Chinese is generally tokenized character by
116123

117124
## Language as Data
118125

119-
From a more technical perspective, NLP focuses on applying advanced statistical techniques to linguistic data. This is a key factor, since we need a structured dataset with a well defined set of features in order to manipulate it numerically. Your first task as an NLP practitioner is to **understand what aspects of textual data are relevant for your application** and apply techniques to systematically extract meaningful features from unstructured data (if using statistics or Machine Learning) or choose an appropriate neural architecture (if using Deep Learning) that can help solve our problem at hand.
126+
From a more technical perspective, NLP focuses on applying advanced statistical techniques to linguistic data. This is a key factor, since we need a structured dataset with a well defined set of features in order to manipulate it numerically. Your first task as an NLP practitioner is to **understand what aspects of textual data are relevant for your application**. Afterwards you can apply techniques to systematically extract meaningful features from unstructured data (if using statistics or Machine Learning) or choose an appropriate neural architecture (if using Deep Learning) that can help solve our problem at hand.
120127

121128
### What is a word?
122129

123-
When dealing with language our basic data unit is usually a word. We deal with sequences of words and with how they relate to each other to generate meaning in text pieces. Thus, our first step will be to load a text file and provide it with structure by splitting it into valid words (tokenization)!
130+
When dealing with language our basic data unit is usually a word. We deal with sequences of words and with how they relate to each other to generate meaning in text pieces. Thus, our first step will be to load a text file and provide it with structure by splitting it into valid words (this is known as tokenization)!
124131

125132
::: callout
126133
### Token vs Word
127134

128-
For simplicity, in the rest of the course we will use the terms "word" and "token" interchangeably, but as we just saw they do not always have the same granularity. Originally the concept of token comprised dictionary words, numeric symbols and punctuation. Nowadays, tokenization has also evolved and became an optimization task on its own (How can we segment text in a way that neural networks learn optimally from text?). Tokenizers allow one to reconstruct or revert back to the original pre-tokenized form of tokens or words, hence we can afford to use *token* and *word* as synonyms. If you are curious, you can visualize how different state-of-the-art tokenizers split text [in this WebApp](https://tiktokenizer.vercel.app/)
135+
For simplicity, in the rest of the course we will use the terms "word" and "token" interchangeably, but as we just saw they do not always have the same granularity. Originally the concept of token comprised dictionary words, numeric symbols and punctuation. Nowadays, tokenization has also evolved and became an optimization task on its own (_How can we segment text in a way that neural networks learn optimally from text?_). Tokenizers allow one to reconstruct or revert back to the original pre-tokenized form of tokens or words, hence we can afford to use *token* and *word* as synonyms. If you are curious, you can visualize how different state-of-the-art tokenizers split text [in this WebApp](https://tiktokenizer.vercel.app/)
129136
:::
130137

131138
Let's open a file, read it into a string and split it by spaces. We will print the original text and the list of "words" to see how they look:
@@ -203,7 +210,7 @@ print(len(only_verbs))
203210
10148
204211
```
205212

206-
SpaCy also predicts the sentences under the hood for us. It might seem trivial to you as a human reader to recognize where a sentence begins and ends but for a machine, just like finding words, finding sentences is a task on its own, for which sentence-segmentation models exist. In the case of Spacy, we can access the sentences like this:
213+
SpaCy also predicts the sentences under the hood for us. It might seem trivial to you as a human reader to recognize where a sentence begins and ends. But for a machine, just like finding words, finding sentences is a task on its own, for which sentence-segmentation models exist. In the case of spaCy, we can access the sentences like this:
207214

208215
``` python
209216
sentences = [sent.text for sent in doc.sents] # Sentences are also python objects
@@ -372,15 +379,15 @@ Natural language exhibits a set of properties that make it more challenging to p
372379

373380
### Compositionality
374381

375-
The basic elements of written languages are characters, a sequence of characters form words, and words in turn denote objects, concepts, events, actions and ideas (Goldberg, 2016). Subsequently words form phrases and sentences which are used in communication and depend on the context in which they are used. We as humans derive the meaning of utterances from interpreting contextual information that is present at different levels at the same time:
382+
The basic elements of written languages are characters, a sequence of characters form words, and words in turn denote objects, concepts, events, actions and ideas (Goldberg, 2016). Subsequently, words form phrases and sentences which are used in communication and depend on the context in which they are used. We as humans derive the meaning of utterances from interpreting contextual information that is present at different levels at the same time:
376383

377384
![Levels of Language](fig/intro2_levels_lang.svg){width="573"}
378385

379386
The first two levels refer to spoken language only, and the other four levels are present in both speech and text. Because in principle machines do not have access to the same levels of information that we do (they can only have independent audio, textual or visual inputs), we need to come up with clever methods to overcome this significant limitation. Knowing the levels of language is important so we consider what kind of problems we are facing when attempting to solve our NLP task at hand.
380387

381388
### Ambiguity
382389

383-
The disambiguation of meaning is usually a by-product of the context in which utterances are expressed and also the historic accumulation of interactions which are transmitted across generations (think for instance to idioms -- these are usually meaningless phrases that acquire meaning only if situated within their historical and societal context). These characteristics make NLP a particularly challenging field to work in.
390+
The disambiguation of meaning is usually a by-product of the context in which utterances are expressed and also of the historic accumulation of interactions which are transmitted across generations (think for instance to idioms -- these are usually meaningless phrases that acquire meaning only if situated within their historical and societal context). These characteristics make NLP a particularly challenging field to work in.
384391

385392
We cannot expect a machine to process human language and simply understand it as it is. We need a systematic, scientific approach to deal with it. It's within this premise that the field of NLP is born, primarily interested in converting the building blocks of human/natural language into something that a machine can understand.
386393

md5sum.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,11 @@
55
"index.md" "8192ac75bd179a0ba01eb2e2258afed5" "site/built/index.md" "2025-09-16"
66
"links.md" "7215ee9c7d9dc229d2921a40e899ec5f" "site/built/links.md" "2025-09-16"
77
"workshops.md" "a2cadfeeb8e5f49e2441c65f3989e43c" "site/built/workshops.md" "2025-09-16"
8-
"episodes/01-introduction.md" "fb4ac50502d79df8c58eea95b0e977db" "site/built/01-introduction.md" "2025-10-07"
8+
"episodes/01-introduction.md" "72946b9e94a759b7df0a33ed3925f7da" "site/built/01-introduction.md" "2025-10-13"
99
"episodes/02-preprocessing.md" "2f19e1ae0007128124cb7dd3ce9629a3" "site/built/02-preprocessing.md" "2025-09-24"
1010
"episodes/03-transformers.md" "c171fb204a3033c2f2687a036875c0aa" "site/built/03-transformers.md" "2025-09-24"
1111
"episodes/04-LargeLanguageModels.md" "96a5780c2121d4750a50c0a1c9a4f7b9" "site/built/04-LargeLanguageModels.md" "2025-09-24"
1212
"instructors/instructor-notes.md" "cae72b6712578d74a49fea7513099f8c" "site/built/instructor-notes.md" "2025-09-16"
1313
"learners/setup.md" "a0c051956d36f4793a9293c2e71afd1c" "site/built/setup.md" "2025-10-08"
1414
"profiles/learner-profiles.md" "7cf2c1bec32069388ea395a4914bad46" "site/built/learner-profiles.md" "2025-09-16"
15-
"renv/profiles/lesson-requirements/renv.lock" NA "site/built/renv.lock" "2025-10-08"
15+
"renv/profiles/lesson-requirements/renv.lock" NA "site/built/renv.lock" "2025-10-13"

0 commit comments

Comments
 (0)