Lab 7-1: Detecting English through Programming

Note: Part of this lab came from Al Sweigart’s great book, Hacking Secret Ciphers with Python: A beginner’s Guide to cryptography and computer programming with Python, available online here at Invent With Python, among his other works. Feel free to check them out if they interest you!

Lab Requirements and Specifications

As you saw from the previous chapter, in the Transposition cipher there are far too many keys (especially in long messages) for us to sort through. It’s quite a chore for us, but no sweat for our computers. Sounds like a job that a program can do!

Provided for you is a dictionary file, with over 45,000 of the most common words in the English language. You can access that file by clicking on the following link: dictionary.txt. To view the file, you can also see: dictionary.txt.

Let’s start by making clear what it means for a sentence to be English. Complex programs exist to analyze specific character patterns and frequencies to determine the language it is in, but we’re going to take the simpler route. To us, a word is considered English if it is one of the words in our dictionary. If not, then it is “not English”. A sentence or paragraph or essay is then considered English if a specific percentage of words in the text are English. We are going to call this percentage target_percentage, or in other words, our minimum % of words that have to be English before we consider the entire text English. target_percentage is given to the function as a float between 0 and 1.

As a reminder, you can find the % of words that are English by counting the number of English words, counting the number of total words, and then performing the following calculation to find our actual percentage:

\[\textrm{actual threshold} = \frac{\textrm{english words}}{\textrm{total words}}\]

As an example, let’s say we have a text with 100 words, 57 of which appear in our dictionary file. Our target percentage is .5, indicating that at least 50% of the words have to be English for our program to consider it English. Since our actual percentage is 57 / 100 is .57, or 57%, it is above our target, and passes - therefore, by our standards, it is English.

In this lab, you are creating a module for other programs to import and use. The focus of this module will be a function called is_english(), which takes in message and target_percentage and will return True or False depending on whether it passes as English or not.

Keep in mind that while 50% might seem like a pretty low target, the majority of the words being analyzed (like when bruteforcing the transposition and caesar cipher) will be complete gibberish.

You should name your file FILN_detectEnglish.py, where FILN is your first initial and last name, no space.

Code Template

Below is the code that you should use as a template for your program.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
#Detect English module

#To use, type this code:
#   import detectEnglish
#   detectEnglish.is_english(someString) #returns true/false
#(There must be a dictionary.txt file in this directory with
# all english words in it, one per line.  You can download
# this file from http://invpy.com/dictionary.txt)

# a string to hold all uppercase and lowercase letters,
#   plus spaces, tabs, newlines
LETTERS_AND_SPACE = UPPERLETTERS + UPPERLETTERS.lower() + ' '

def loadDictionary():
    dictionaryFile = open('dictionary.txt') # opens file for reading
    englishWords = {} #create empty dictionary
    for word in dictionaryFile.read().split('\n'): #read file and
                                                   #  split by line
        englishWords[word] = None # add the word as a key
                                  #   to the dictionary, no value
    dictionaryFile.close()
    return englishWords

# a dictionary to hold all english words in given dictionary.txt file
ENGLISH_WORDS = loadDictionary()

# function: get_english_count
# purpose: gets the number of words that are found in the given
#   dictionary file
# arguments:
#   message: the message to count
# returns:
#   the number of "English" words in the message
def get_english_count(message):
    #code here

# function: remove_non_letters
# purpose: removes any character that is not a letter or a space
#   from a string
# arguments:
#   message: the message in which to remove non-letter/space chars
# returns:
#   the modified message that no longer contains special chars
def remove_non_letters(message):
    #code here

# function: message_to_upperlist
# purpose: take in a message as a string and convert it all to
#   uppercase turn it into a list of words (this function
#   should call the remove_non_letters() function)
# arguments:
#   message: the message to convert
# returns:
#   a list of words in the original message, without special chars
def message_to_upperlist(message):
    #code here

# function: is_english
# purpose: determines whether a message is english based on
#   the percentage of its words that are found in our
#   provided dictionary file
# arguments:
#   message: the message to be analyzed
#   target_percentage: the minimum percentage for message to be
#       considered English. Default value if not provided is 30%
# returns:
#   True if the actual percentage is greater to or equal to the
#       target_percentage
#   False otherwise
def is_english(message, target_percentage=.3):
    #code here

Your job is to code the three incomplete functions to their specifications.

You may want to keep the following tips in mind:
  • the dictionary.txt file contains words that are all uppercase, so make sure you use the .upper() string method to convert words to uppercase to match.
  • use the .split(" ") string method to turn a string into a list where items are separated by space.
  • but be careful - "hello   there".split(" ") will return ['hello', '', '', 'there'] since there are three spaces. Remove empty list values, as those are not words!
  • when thinking of ways to remove non letters, try thinking of it in a different way: instead of removing non-letters, what if we were to keep only letters and spaces?
  • be careful of divide by zero errors!

Since it is a module for other programs to import, you want to leave the instructions at the top of the code intact, so that other developers will know how to use it. However, once you’ve written each function, feel free to delete the instructions for that function.

Testing Your Program

You can test your program by feeding it sample sentences and thresholds. Sentences can be 10 words long to make your own calculations simple. The following are a few tests:

string = "Hello these are ten words ajds ahasd haaaae jsldf nassl"
print(is_english(string,0.1)) # should be True
print(is_english(string,0.4)) # should be True
print(is_english(string,0.7)) # should be False

string = "now to include extras, with punctuation right? AAAAA abasdf YOLO"
print(is_english(string,0.4)) # should be True
print(is_english(string,0.7)) # should be True
print(is_english(string,0.9)) # should be False

And of course, mid-development, you can test your functions individually. For example:

string = "hello alksdf how sdf dsasasfd are you asdf"
print(get_english_count(string)) # should be 4

string = "hello, alksdf ho^$3w sdf dsasasfd are you? asdf!!!!!"
print(remove_non_letters(string))
# should print "hello alksdf how sdf dsasasfd are you asdf"

The following space is provided in case you want to test code out or write it in the browser:

Next Section - Lab 7-2: Bruteforcing the Transposition Cipher, Revisited