Entropy and Language Modeling (homework 1)ΒΆ
Solution to the first homework for the NPFL067 course.
source_files = [
"TEXTCZ1.txt",
"TEXTEN1.txt"
]
## load the text files
def load_text(filename) -> str:
with open(filename, "r", encoding="utf-8") as f:
return f.read()
texts = {file : load_text(file).split('\n') for file in source_files}
for text, words in texts.items():
print(f"Text: {text}, word-count: {len(words)}, unique-words: {len(set(words))}")
lexicons = {file : set(texts[file]) for file in texts}
alphabets = {file : set("".join(texts[file])) for file in texts}
Text: TEXTCZ1.txt, word-count: 222413, unique-words: 42827 Text: TEXTEN1.txt, word-count: 221099, unique-words: 9608
Entropy of a textΒΆ
Task: For each text, determine the conditional entropy of the word distribution given a previous word. We will need to count two things:
- P(x) : The probability of a word being in the text,
- P(i,j) : The probability of two words i and j, where j immediately follows i, being in the text.
From these two, we are also able to compute P(j|i), since P(j|i) = P(i,j) / P(i)
from collections import defaultdict
from typing import List, Dict, Tuple
def get_single_word_prob(text : List[str]) -> Dict[str, float]:
counts = defaultdict(int)
total_words = 0
for word in text:
if word not in counts:
counts[word] = 0
counts[word] += 1
total_words += 1
return {word : count / total_words for word, count in counts.items()}
def get_pair_counts(text : List[str]) -> Tuple[Tuple[Tuple[str, str], int], int]:
counts = defaultdict(int)
total_pairs = 0
for i, j in zip(text, text[1:]):
counts[(i,j)] += 1
total_pairs += 1
return ([(pair, count) for pair, count in counts.items()], total_pairs)
def get_pair_joint_prob(text : List[str]) -> Dict[Tuple[str, str], float]:
counts = defaultdict(int)
total_pairs = 0
for i, j in zip(text, text[1:]):
if ((i,j) not in counts):
counts[(i,j)] = 0
counts[(i,j)] += 1
total_pairs += 1
return {pair : count / total_pairs for pair, count in counts.items()}
def get_conditional_prob(joint_prob : Dict[Tuple[str, str], float], word_prob : Dict[str, float]):
return {pair : prob / word_prob[pair[0]] for pair, prob in joint_prob.items()}
Now we have all the data necessary to compute the conditional entropy (and perplexity) for both texts...
from math import log2
def get_entropy(word_probs : Dict[str, float]) -> float:
return -sum([word_probs[word] * log2(word_probs[word]) for word in word_probs.keys()])
def get_perplexity(word_probs : Dict[str, float]) -> float:
return 2**get_entropy(word_probs)
def get_conditional_entropy(joint_probs : Dict[Tuple[str, str], float], cond_probs : Dict[Tuple[str, str], float]) -> float:
return -sum([joint_probs[pair] * log2(cond_probs[pair]) for pair in joint_probs.keys()])
def get_conditional_perplexity(joint_probs : Dict[Tuple[str, str], float], cond_probs : Dict[Tuple[str, str], float]) -> float:
return 2**get_conditional_entropy(joint_probs, cond_probs)
conditional_word_probs = {file : get_single_word_prob(texts[file]) for file in texts}
joint_probs = {file : get_pair_joint_prob(texts[file]) for file in texts}
cond_probs = {file : get_conditional_prob(joint_probs[file], conditional_word_probs[file]) for file in texts}
## count word entropy for the whole text
word_entropy = {text : get_entropy(conditional_word_probs[text]) for text in texts}
## count the conditional entrppy for the whole text
cond_entropy = {text : get_conditional_entropy(joint_probs[text], cond_probs[text]) for text in texts}
cond_perplexity = {text : get_conditional_perplexity(joint_probs[text], cond_probs[text]) for text in texts}
for text in texts:
print(f"Text {text} has word entropy {format(word_entropy[text], ".3f")} and perplexity {format(get_perplexity(conditional_word_probs[text]), ".3f")}")
print()
for text in texts:
print(f"Text {text} has conditional entropy {format(cond_entropy[text], ".3f")} and perplexity {format(cond_perplexity[text], ".3f")}")
Text TEXTCZ1.txt has word entropy 11.538 and perplexity 2974.496 Text TEXTEN1.txt has word entropy 9.037 and perplexity 525.469 Text TEXTCZ1.txt has conditional entropy 4.748 and perplexity 26.868 Text TEXTEN1.txt has conditional entropy 5.287 and perplexity 39.056
InterpretationΒΆ
Since the Czech text contains significantly more unique words compared to the English text, we expected the word entropy to be higher for the Czech text than for the English text. This expectation was confirmed by our experiment.
On the other hand, the conditional entropy showed the opposite trend, with English having higher entropy than Czech. This could be attributed to the fact that the English text contains four times fewer unique words while being of roughly the same length. Consequently, this leads to a greater number of unique word pairs sharing the same first word. Therefore, when trying to predict the following word while keeping the previous word fixed, there are likely more valid combinations for a word pair in English, resulting in greater uncertainty about the next word.
To support this claim, we grouped all word pairs by their first word and, for the top 100 words, examined how often each word appears as the first word of a pair. We expect that the text with more unique words (Czech) would have a lower count of pairs with a common first word, leading to a higher probability of correctly predicting the following word (lower entropy).
import matplotlib.pyplot as plt
from collections import defaultdict
def plot_top_word_pairs(texts, top_n : int, title : str):
fig, ax = plt.subplots(figsize=(10, 6))
for text in texts:
unique_pairs, pairs_count = get_pair_counts(texts[text])
first_word_pairs = defaultdict(int)
for pair, count in unique_pairs:
first_word_pairs[pair[0]] += count
top_counts = sorted(first_word_pairs.values(), reverse=True)[:top_n]
# Plot the rank (1st, 2nd, etc.) vs counts
ranks = range(1, len(top_counts) + 1)
ax.plot(ranks, top_counts, label=f"Text {text}")
# Customize the plot
ax.set_title(title)
ax.set_xlabel('Rank (1st, 2nd, ...)')
ax.set_ylabel('Counts')
ax.legend()
ax.grid(True)
# Show plot
plt.tight_layout()
plt.show()
# Example usage:
plot_top_word_pairs(texts, 100, f'Top {100} Word Pair Counts by Rank for Each Text')
Mess it up!ΒΆ
In this section, we will manipulate the text to observe how it affects entropy and perplexity. We will run each experiment REPETITION times and evaluate the minimal, maximal, and average scores for each experiment pair (text, probability).
import random
from typing import List
def mess_up_chars(text : List[str], alphabet : str, prob : float) -> List[str]:
new_text = []
for word in text:
new_word = []
for char in word:
if random.random() <= prob:
new_word.append(random.choice(alphabet))
else:
new_word.append(char)
new_text.append("".join(new_word))
return new_text
def mess_up_words(text : List[str], lexicon : List[str], prob : float) -> List[str]:
new_text = []
for word in text:
if random.random() <= prob:
new_text.append(random.choice(lexicon))
else:
new_text.append(word)
return new_text
HypotesisΒΆ
We expect that both the entropy and perplexity will increase as the probability of altering a character or word increases. Since entropy can be described as a level of uncertainty in a given random variable, random alterations to some words or characters would deviate the results from the original text (language model), thereby introducing more uncertainty into the prediction of the next word.
import random
import pickle
REPETITIONS = 10
results = {}
probs = [.2, .1, .05, .01, .001, .0001, .00001]
for text in texts:
results_single_text = {}
for prob in probs:
results_single_prob = {}
for type in ["char", "word"]:
random.seed(42)
results_single_type = []
for i in range(REPETITIONS):
if (type == "char"):
new_text = mess_up_chars(texts[text], list(alphabets[text]), prob)
else:
new_text = mess_up_words(texts[text], list(lexicons[text]), prob)
## evalute the single experiment
new_word_probs = get_single_word_prob(new_text)
new_joint_probs = get_pair_joint_prob(new_text)
new_cond_probs = get_conditional_prob(new_joint_probs, new_word_probs)
new_cond_entropy = get_conditional_entropy(new_joint_probs, new_cond_probs)
new_cond_perplexity = get_conditional_perplexity(new_joint_probs, new_cond_probs)
results_single_type.append({"entropy" : new_cond_entropy, "perplexity" : new_cond_perplexity})
## agregaet one type results
min_entropy = min([result["entropy"] for result in results_single_type])
max_entropy = max([result["entropy"] for result in results_single_type])
avg_entropy = sum([result["entropy"] for result in results_single_type]) / REPETITIONS
min_perplexity = min([result["perplexity"] for result in results_single_type])
max_perplexity = max([result["perplexity"] for result in results_single_type])
avg_perplexity = sum([result["perplexity"] for result in results_single_type]) / REPETITIONS
results_single_prob[type] = {"entropy" : {"min" : min_entropy, "max" : max_entropy, "avg" : avg_entropy}, "perplexity" : {"min" : min_perplexity, "max" : max_perplexity, "avg" : avg_perplexity}}
## aggregate one probability results
results_single_text[format(prob, ".5f")] = results_single_prob
## aggregate one text results
results[text] = results_single_text
with open("results.pkl", "wb") as f:
pickle.dump(results, f)
Now lets compare the final results:
import pickle
from utils import plot_results, create_tables
import pandas as pd
with open("results.pkl", "rb") as f:
results = pickle.load(f)
# Call the function to plot
plot_results(results)
for table_name, table in create_tables(results).items():
print(f"{table_name}")
display(table)
Source: TEXTCZ1.txt, Messing: word
Prob. | Min E. | Max E. | Avg E. | Min P. | Max P. | Avg P. | |
---|---|---|---|---|---|---|---|
0 | 0.20000 | 4.47 | 4.49 | 4.48 | 22.14 | 22.41 | 22.27 |
1 | 0.10000 | 4.63 | 4.64 | 4.64 | 24.81 | 25.01 | 24.91 |
2 | 0.05000 | 4.70 | 4.70 | 4.70 | 25.96 | 26.05 | 25.99 |
3 | 0.01000 | 4.74 | 4.74 | 4.74 | 26.69 | 26.73 | 26.70 |
4 | 0.00100 | 4.75 | 4.75 | 4.75 | 26.85 | 26.87 | 26.85 |
5 | 0.00010 | 4.75 | 4.75 | 4.75 | 26.86 | 26.87 | 26.87 |
6 | 0.00001 | 4.75 | 4.75 | 4.75 | 26.87 | 26.87 | 26.87 |
Source: TEXTCZ1.txt, Messing: char
Prob. | Min E. | Max E. | Avg E. | Min P. | Max P. | Avg P. | |
---|---|---|---|---|---|---|---|
0 | 0.20000 | 3.53 | 3.55 | 3.54 | 11.58 | 11.71 | 11.62 |
1 | 0.10000 | 4.00 | 4.01 | 4.01 | 16.02 | 16.12 | 16.08 |
2 | 0.05000 | 4.33 | 4.34 | 4.34 | 20.15 | 20.29 | 20.21 |
3 | 0.01000 | 4.66 | 4.66 | 4.66 | 25.20 | 25.32 | 25.25 |
4 | 0.00100 | 4.74 | 4.74 | 4.74 | 26.69 | 26.70 | 26.69 |
5 | 0.00010 | 4.75 | 4.75 | 4.75 | 26.85 | 26.86 | 26.85 |
6 | 0.00001 | 4.75 | 4.75 | 4.75 | 26.87 | 26.87 | 26.87 |
Source: TEXTEN1.txt, Messing: word
Prob. | Min E. | Max E. | Avg E. | Min P. | Max P. | Avg P. | |
---|---|---|---|---|---|---|---|
0 | 0.20000 | 5.56 | 5.57 | 5.56 | 47.20 | 47.51 | 47.34 |
1 | 0.10000 | 5.46 | 5.46 | 5.46 | 43.91 | 44.12 | 43.99 |
2 | 0.05000 | 5.38 | 5.39 | 5.38 | 41.56 | 41.86 | 41.68 |
3 | 0.01000 | 5.30 | 5.31 | 5.31 | 39.53 | 39.63 | 39.59 |
4 | 0.00100 | 5.29 | 5.29 | 5.29 | 39.10 | 39.12 | 39.11 |
5 | 0.00010 | 5.29 | 5.29 | 5.29 | 39.06 | 39.07 | 39.06 |
6 | 0.00001 | 5.29 | 5.29 | 5.29 | 39.06 | 39.06 | 39.06 |
Source: TEXTEN1.txt, Messing: char
Prob. | Min E. | Max E. | Avg E. | Min P. | Max P. | Avg P. | |
---|---|---|---|---|---|---|---|
0 | 0.20000 | 4.02 | 4.04 | 4.03 | 16.23 | 16.45 | 16.32 |
1 | 0.10000 | 4.72 | 4.74 | 4.73 | 26.44 | 26.69 | 26.56 |
2 | 0.05000 | 5.05 | 5.06 | 5.06 | 33.18 | 33.39 | 33.30 |
3 | 0.01000 | 5.25 | 5.25 | 5.25 | 37.96 | 38.14 | 38.05 |
4 | 0.00100 | 5.28 | 5.28 | 5.28 | 38.93 | 38.98 | 38.96 |
5 | 0.00010 | 5.29 | 5.29 | 5.29 | 39.04 | 39.05 | 39.05 |
6 | 0.00001 | 5.29 | 5.29 | 5.29 | 39.05 | 39.06 | 39.06 |
InterpretationΒΆ
The hypothesis holds true only for the manipulation of words in English. For the other experiments, we provide the following explanation:
Why messing characters does not work?ΒΆ
In our initial hypothesis, we expected that manipulating characters would also lead to more options for predicting the next word. However, we overlooked the fact that altering single characters primarily results in the creation of entirely new and unseen words in the dataset. This, in turn, leads to a much higher number of unique pairs that occur only once in the text. For these single-occurrence pairs (x, y), the probability P(yβ£x) (the probability that y follows x) is equal to 1, resulting in an entropy of 0. We will illustrate this with the number of new word pairs compared to the number of changed words in the manipulated text. The closer the number of new word pairs is to the number of changed words, the more obvious the prediction becomes.
messed_texts = {
text:
{
"char" : mess_up_chars(texts[text], list(alphabets[text]), 0.2),
"word" : mess_up_words(texts[text], list(lexicons[text]), 0.2)
}
for text in texts
}
messed_texts_char = {text : messed_texts[text]["char"] for text in texts}
messed_texts_word = {text : messed_texts[text]["word"] for text in texts}
for text in texts:
print(f"Evaluating {text}...")
for messing_type in ["char", "word"]:
messed_text = messed_texts[text][messing_type]
num_of_unique_words = len(lexicons[text])
num_of_unique_words_messed = len(set(messed_text))
diff_words = num_of_unique_words_messed - num_of_unique_words
prct_words = num_of_unique_words_messed / num_of_unique_words * 100 if diff_words > 0 else (1 - num_of_unique_words_messed / num_of_unique_words) * 100
pair_counts = get_pair_counts(texts[text])
pair_counts_messed = get_pair_counts(messed_text)
## calculate the number of pairs that start with the same word
same_first_word_counter = defaultdict(int)
for pair, count in pair_counts[0]:
same_first_word_counter[pair[0]] += 1
same_first_word_counter_messed = defaultdict(int)
for pair, count in pair_counts_messed[0]:
same_first_word_counter_messed[pair[0]] += 1
single_pair_count = len([count for count in same_first_word_counter.values() if count == 1])
single_pair_count_messed = len([count for count in same_first_word_counter_messed.values() if count == 1])
diff_pairs = single_pair_count_messed - single_pair_count
prct_pairs = single_pair_count_messed / single_pair_count * 100 if (diff_pairs > 0) else (1 - single_pair_count_messed / single_pair_count) * 100
print(f" Messed with: {messing_type}:")
print(f" {"+" if diff_words > 0 else ""}{diff_words} words ({"+" if diff_words > 0 else "-"}{prct_words:.2f}%)")
print(f" {"+" if diff_pairs > 0 else ""}{diff_pairs} pairs ({"+" if diff_pairs > 0 else "-"}{prct_pairs:.2f}%)")
Evaluating TEXTCZ1.txt... Messed with: char: +76477 words (+278.57%) +84003 pairs (+398.72%) Messed with: word: -1939 words (-4.53%) -17350 pairs (-61.70%) Evaluating TEXTEN1.txt... Messed with: char: +85030 words (+984.99%) +80396 pairs (+1959.73%) Messed with: word: -11 words (-0.11%) -4239 pairs (-98.06%)
InterpretationΒΆ
We observe that for both languages, when altering characters, we generate a similar number of new single-occurrence pairs as we do new words. In both cases, the number of single-occurrence pairs increases rapidly. This means that manipulating characters can only decrease the entropy, as all the single-occurrence pairs have 0 entropy.
In the case of word manipulation, we see that in English, we nearly eliminate all single-occurrence pairs while maintaining a similar lexicon. This must increase the entropy, as for many pairs, P(xβ£y) is no longer equal to 1. Conversely, we cannot make the same claim for Czech since we did not eliminate all single-occurrence pairs.
Cross Entropy and Language ModelingΒΆ
Now, we will focus on language modeling based on n-gram predictions. We will split the dataset into train, heldout, and test sets. We will count conditional probabilities of each n-gram and use the EM algorithm to obtain the best mix of these n-gram predictions for a language model capable of predicting the next word.
from typing import List, Dict, Tuple
from collections import defaultdict
from collections import Counter
datasets : Dict[str, Dict[str, List[str]]] = {}
for text in texts:
datasets[text] = {
"test" : texts[text][-20_000:], ## last 20000 words
"heldout" : texts[text][:-20_000][-40_000:], ## last 40000 words from the what is left without the test
"train" : texts[text][:-60_000] ## the rest
}
## extract word counts from the train dataset for each ngram size and text
ngram_sizes = [3, 2, 1, 0]
counts : Dict[str, Dict[int, Counter]] = {
text : {
1 : Counter(datasets[text]["train"][:-2]),
2 : Counter(zip(datasets[text]["train"][:-1], datasets[text]["train"][1:-1])),
3 : Counter(zip(datasets[text]["train"], datasets[text]["train"][1:], datasets[text]["train"][2:])),
}
for text in texts
}
## calculate the conditional probability of each word given the previous n-1 words
ngram_probs = {}
for text in counts.keys():
uniform_prob = 1 / len(set(datasets[text]["train"]))
unigram_prob = {word : count / len(datasets[text]["train"]) for word, count in counts[text][1].items()}
bigram_prob = {pair : count / counts[text][1][pair[0]] for pair, count in counts[text][2].items()}
trigram_prob = {triple : count / counts[text][2][triple[:2]] for triple, count in counts[text][3].items()}
ngram_probs[text] = {0 : uniform_prob, 1 : unigram_prob, 2 : bigram_prob, 3 : trigram_prob}
## compute four smoothing parameters from the heldout dataset using EM algorithm
def em_algo(text : List[str],
uniform_prob : float,
unigram_cond_probs : Dict[Tuple[str], float],
bigram_cond_probs : Dict[Tuple[str, str], float],
trigram_cond_probs : Dict[Tuple[str, str, str], float],
lambdas : List[float],
epsilon : float = 1e-6) -> List[float]:
## retransform n-grams to the form of Dict[history, Dict[word, prob]]
bigram_cond_probs_transformed : Dict['str', Dict['str', float]] = {}
for (history, word), prob in bigram_cond_probs.items():
if history not in bigram_cond_probs_transformed:
bigram_cond_probs_transformed[history] = {}
bigram_cond_probs_transformed[history][word] = prob
trigram_cond_probs_transformed : Dict[Tuple[str, str], Dict[str, float]] = {}
for (history_0, history_1, word), prob in trigram_cond_probs.items():
if (history_0, history_1) not in trigram_cond_probs_transformed:
trigram_cond_probs_transformed[(history_0, history_1)] = {}
trigram_cond_probs_transformed[(history_0, history_1)][word] = prob
prev_lambdas = lambdas[:]
## iterate until convergence
while True:
expected_counts = [0] * len(lambdas)
for i in range(len(text) - 2):
wi_2, wi_1, wi = text[i-2], text[i-1], text[i]
## calculate the probability of the trigram
m0 = uniform_prob
m1 = unigram_cond_probs.get(wi, 0)
m2 = bigram_cond_probs_transformed[wi_1].get(wi, 0) if (wi_1 in bigram_cond_probs_transformed) else uniform_prob
m3 = trigram_cond_probs_transformed[(wi_2, wi_1)].get(wi, 0) if ((wi_2, wi_1) in trigram_cond_probs_transformed) else uniform_prob
current_prob = 0
current_prob += lambdas[0] * m0
current_prob += lambdas[1] * m1
current_prob += lambdas[2] * m2
current_prob += lambdas[3] * m3
## calculate the expected counts
expected_counts[0] += lambdas[0] * m0 / current_prob
expected_counts[1] += lambdas[1] * m1 / current_prob
expected_counts[2] += lambdas[2] * m2 / current_prob
expected_counts[3] += lambdas[3] * m3 / current_prob
## normalize the expected counts
total_counts = sum(expected_counts)
lambdas = [count / total_counts for count in expected_counts]
## check for convergence
if all([abs(old - new) < epsilon for new, old in zip(lambdas, prev_lambdas)]):
break
prev_lambdas = lambdas[:]
return lambdas
from math import log2
def get_cross_entropy(
text : List[str],
uniform_prob : float,
unigram_cond_probs : Dict[Tuple[str], float],
bigram_cond_probs : Dict[Tuple[str, str], float],
trigram_cond_probs : Dict[Tuple[str, str, str], float],
lambdas : List[float]
) -> float:
cross_entropy = 0
for i in range(len(text) - 2):
wi_2, wi_1, wi = text[i-2], text[i-1], text[i]
current_prob = 0
current_prob += lambdas[0] * uniform_prob
current_prob += lambdas[1] * unigram_cond_probs.get(wi, 0)
current_prob += lambdas[2] * bigram_cond_probs.get((wi_1, wi), 0)
current_prob += lambdas[3] * trigram_cond_probs.get((wi_2, wi_1, wi), 0)
cross_entropy -= log2(current_prob)
return cross_entropy / (len(text) - 2)
optimal_lambdas = {
text : em_algo(datasets[text]["heldout"],
ngram_probs[text][0],
ngram_probs[text][1],
ngram_probs[text][2],
ngram_probs[text][3],
[0.25, 0.25, 0.25, 0.25])
for text in texts
}
false_training_lambdas = {
text : em_algo(datasets[text]["train"],
ngram_probs[text][0],
ngram_probs[text][1],
ngram_probs[text][2],
ngram_probs[text][3],
[0.25, 0.25, 0.25, 0.25])
for text in texts
}
for text in texts:
lambdas = optimal_lambdas[text]
print(f"Smoothing parameters for {text}: ")
for i, lambda_ in enumerate(lambdas):
print(f" lambda_{i}: {lambda_*100:.3f}%")
cross_entropy = get_cross_entropy(datasets[text]['test'], ngram_probs[text][0], ngram_probs[text][1], ngram_probs[text][2], ngram_probs[text][3], lambdas)
print(f"Cross-entropy for {text}: {cross_entropy:.3f}")
print()
Smoothing parameters for TEXTCZ1.txt: lambda_0: 14.024% lambda_1: 42.892% lambda_2: 24.460% lambda_3: 18.624% Cross-entropy for TEXTCZ1.txt: 10.443 Smoothing parameters for TEXTEN1.txt: lambda_0: 6.967% lambda_1: 25.407% lambda_2: 49.207% lambda_3: 18.418% Cross-entropy for TEXTEN1.txt: 7.572
InterpretationΒΆ
We observe that for Czech, the model places more emphasis on the unigram rather than the bigram or trigram, as the text contains many more unique words, leading to lower chances of encountering learned longer sequences. For this same reason, the model places 0.25 emphasis on the uniform distribution. In contrast, since English has fewer unique words, the likelihood of seeing a sequence (at least a pair) of words from the training set is higher, allowing the model to focus more on the bigram. It is visible the most on the uniform distribution (lambda 0), where the English model is having very low priority on that part.
Because English relies more on the bigram, which can provide meaningful predictions (compared to the unigram that simply represents the chance of the word's presence), the English model exhibits lower entropy.
for text in texts:
lambdas = false_training_lambdas[text]
print(f"Smoothing parameters for {text}: ")
for i, lambda_ in enumerate(lambdas):
print(f" lambda_{i}: {lambda_*100:.3f}%")
cross_entropy = get_cross_entropy(datasets[text]['test'], ngram_probs[text][0], ngram_probs[text][1], ngram_probs[text][2], ngram_probs[text][3], lambdas)
print(f"Cross-entropy for {text}: {cross_entropy:.3f}")
print()
Smoothing parameters for TEXTCZ1.txt: lambda_0: 0.000% lambda_1: 0.001% lambda_2: 0.001% lambda_3: 99.998% Cross-entropy for TEXTCZ1.txt: 36.181 Smoothing parameters for TEXTEN1.txt: lambda_0: 0.001% lambda_1: 0.000% lambda_2: 0.001% lambda_3: 99.998% Cross-entropy for TEXTEN1.txt: 18.150
For the sake of an experiment, we also tried to run the EM algorithm on the (same) training data. We can see that the model is then sattled at predicting mostly by the most detailed (3-gram) probability distribution, since the data on which the EM algorithm was ran are the same as the distributions were obtained. Any lower-gram distribution than the maximum-gram (3-gram) can only give a lower probability to a certain (x_i | h_i), since the maximum-gram is the most specific.
Playing around with lambdasΒΆ
First, we will artificially boost the trigram component of the model (even though the EM algorithm advised against it).
offsets : List[float] = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99]
trigram_boost : Dict[str, Dict[float, float]] = {}
for text in texts:
trigram_boost[text] = {}
trigram_lambda = optimal_lambdas[text][3]
trigram_diff = 1 - trigram_lambda
for offset in offsets:
remaining_lambdas_mass = sum(optimal_lambdas[text][:3])
remaining_mass = 1 - (trigram_lambda+trigram_diff*offset)
lambdas = [lambda_ * remaining_mass / remaining_lambdas_mass for lambda_ in optimal_lambdas[text][:3]]+[trigram_lambda+trigram_diff*offset]
if (sum(lambdas) < 0.9999):
print(f"Error: lambdas sum to {sum(lambdas)}")
cross_entropy = get_cross_entropy(datasets[text]['test'], ngram_probs[text][0], ngram_probs[text][1], ngram_probs[text][2], ngram_probs[text][3], lambdas)
trigram_boost[text][offset] = cross_entropy
Secondly, we will suppress the trigram model and emphasize the bigram and unigram components.
offsets : List[float] = [0.99, 0.95, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0]
lowergram_boost : Dict[str, Dict[float, float]] = {}
for text in texts:
lowergram_boost[text] = {}
trigram_lambda = optimal_lambdas[text][3]
for offset in offsets:
new_trigram_lambda = trigram_lambda*offset
remaining_mass = 1 - new_trigram_lambda
lambdas = [lambda_ * remaining_mass / (1 - trigram_lambda) for lambda_ in optimal_lambdas[text][:3]]+[new_trigram_lambda]
if (sum(lambdas) < 0.9999):
print(f"Error: lambdas sum to {sum(lambdas)}")
cross_entropy = get_cross_entropy(datasets[text]['test'], ngram_probs[text][0], ngram_probs[text][1], ngram_probs[text][2], ngram_probs[text][3], lambdas)
lowergram_boost[text][offset] = cross_entropy
import matplotlib.pyplot as plt
# Plot settings
plt.figure(figsize=(12, 6))
# Loop over each text in the dictionaries
for text in texts:
offsets_trigram = list(trigram_boost[text].keys())
cross_entropies_trigram = list(trigram_boost[text].values())
offsets_lowergram = list(lowergram_boost[text].keys())
cross_entropies_lowergram = list(lowergram_boost[text].values())
# Plot trigram_boost cross-entropy
plt.plot(offsets_trigram, cross_entropies_trigram, marker='o', label=f'{text} Trigram Boost')
# Plot lowergram_boost cross-entropy
plt.plot(offsets_lowergram, cross_entropies_lowergram, marker='x', label=f'{text} Lowergram Boost')
# Adding labels and title
plt.xlabel('Offset')
plt.ylabel('Cross Entropy')
plt.title('Cross Entropy Across Offsets for Trigram and Lowergram Boost')
plt.legend()
plt.grid(True)
plt.show()
import pandas as pd
# Create DataFrame for trigram_boost values
trigram_boost_df = pd.DataFrame(trigram_boost).T # Transpose to have texts as rows, offsets as columns
trigram_boost_df.columns = [f'Offset {offset}' for offset in trigram_boost_df.columns]
trigram_boost_df.index.name = 'Text'
# Create DataFrame for lowergram_boost values
lowergram_boost_df = pd.DataFrame(lowergram_boost).T # Transpose to have texts as rows, offsets as columns
lowergram_boost_df.columns = [f'Offset {offset}' for offset in lowergram_boost_df.columns]
lowergram_boost_df.index.name = 'Text'
# Display the DataFrames
print("Trigram Boost Cross-Entropy Table:")
display(trigram_boost_df)
print("\nLowergram Boost Cross-Entropy Table:")
display(lowergram_boost_df)
Trigram Boost Cross-Entropy Table:
Offset 0.0 | Offset 0.1 | Offset 0.2 | Offset 0.3 | Offset 0.4 | Offset 0.5 | Offset 0.6 | Offset 0.7 | Offset 0.8 | Offset 0.9 | Offset 0.95 | Offset 0.99 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Text | ||||||||||||
TEXTCZ1.txt | 10.394591 | 10.406147 | 10.474226 | 10.579072 | 10.718923 | 10.899240 | 11.133112 | 11.447540 | 11.905063 | 12.707247 | 13.521420 | 15.428067 |
TEXTEN1.txt | 7.560139 | 7.571013 | 7.613759 | 7.683418 | 7.780816 | 7.910988 | 8.084625 | 8.323388 | 8.677306 | 9.307598 | 9.953423 | 11.474359 |
Lowergram Boost Cross-Entropy Table:
Offset 0.99 | Offset 0.95 | Offset 0.9 | Offset 0.8 | Offset 0.7 | Offset 0.6 | Offset 0.5 | Offset 0.4 | Offset 0.3 | Offset 0.2 | Offset 0.1 | Offset 0.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Text | ||||||||||||
TEXTCZ1.txt | 10.394853 | 10.395974 | 10.397545 | 10.401313 | 10.406043 | 10.411941 | 10.419303 | 10.428586 | 10.440555 | 10.456712 | 10.480960 | 10.551031 |
TEXTEN1.txt | 7.560288 | 7.560996 | 7.562145 | 7.565397 | 7.570077 | 7.576439 | 7.584839 | 7.595821 | 7.610277 | 7.629889 | 7.658753 | 7.723425 |
InterpretationΒΆ
From the results, we see that for both languages, the more we prioritize trigram predictions, the worse our predictions become. This makes sense, as the longer the n-gram, the fewer sequences visited during training, leading to poorer predictions for unseen sequences.
Before we continue commenting on the results, we should alert readers that both lower-gram boost plots are inverted on the x-axis (we subtracted a portion of the trigram lambda from the trigram lambda)!
Suppressing the trigram and supporting the unigram and bigram also has a similar effect for both languages. Since the trigram was contributing at most ~13% to the final prediction, we do not observe a significant decline when we suppress it. It essentially just show that supporting those with a majority vote in the model will less likely fail than supporting the minority vote.