I. Basic survival in Linux (or rather in Bash): ----------------------------------------------- 1) Name and describe at least two options for each of the following commands in bash: ls, sort, cut, iconv, grep (1 point). - ls -a -> vypíše i skryté soubory - ls -l -> vypíše i podrobnosti - sort -u -> odstraní duplikáty - sort -f -> třídí case sensitive - cut -d -> vybere konkrétní delimiter - cut -f -> vybere konkrétní field ze vstupu - iconv -f 'zdrojovy_format' - iconv -t 'cilovy_format' - grep -r -> rekurzivní volání na podsložky - grep -x -> pouze ty řádky, co přesně odpovídají (exlusive) 2) Give examples of what the .bashrc file can be used for (1 point). - defining aliases - changing "theme" of the terminal - editing PATH variable 3) Explain how command line pipelining works (1 point). - every program in unix is working with text input and produces text ouptut. Therefore, if I want an output of one program to be an input of the next program, I use pipe "|" for this operation. - it is therefore very easy to construct complex operations with atomic programs 4) Create a bash script that counts the total number of words in all *txt files in all subdirectories of the current directory (2 points). - find . -type f -name "*.txt" -exec cat {} + | wc -w - find all files ending with .txt, on each file match executes "cat", where {} is placeholder for the current file name and + is end of "exec" section, then words counted by wc -w 5) You created a new file called doit.sh and wrote some Bash commands into it, How do you run it now? (1 point) - you need to have permission for execution. Best is to call chmod u+x doit.sh for granting execution permission and then ./doit.sh 6) What do you think the following command does? "ls -t | head -n 5 | cat -n" - shows 5 files that were the last modified, in numbered list II. Character encoding: ----------------------- 1) Explain the notions "character set" and "character encoding" (1 point). - character set is the set of chars. in ASCII for example, character encoding in the ASCII itself, bijection between number and the character 2) Explain the main properties of ASCII (1 point). - has 7bit information space and 1bit empty, therefore up to 128 chars are representable, used for basic english only, 0 ~ 48, A ~ 65 3) What 8-bit encoding do you know for Czech or other European languages (or your native language)? Name at least three. How do they differ from ASCII? (1 point) - ISO 8859-2 and Windows-1250, both for central europe, focusing on some specific chars like "č". - ISO 8859-15, focusing on western europe. 4) What is Unicode and what Unicode encodings do you know? (1 point) - UTF-8, 16, 32 - back compatible with ASCII, but capable of coding up to 4B of chars. - no region specific variant needed 5) Explain the relation between UTF-8 and ASCII. (1 point) - UTF-8 has variant lenght from 1-4B. If reading ASCII, all chars are present in the first byte and the UTF-8 understands them just asi ASCII 6) How can you detect the encoding of a file? (1 point) - Must be explicitly stated what encoding is used, otherwise imposible to say. 7) You have three files containing identical Czech text. One of them is encoded using the ISO charset, one of them uses UTF-8, and one uses UTF-16. How can you tell which is which? (1 point) - You could see that text is UTF-8 when reading as ASCII, because the text will be longer then it should be 8) How would you proceed if you are supposed to read a file encoded in ISO-8859-1, add a line number to each line and store it in UTF8? (a source code snippet in your favourite programming language is expected here) (2 points) - {Python}: with open(path+"utf8", 'w', encoding="utf-8") as target_file: with open(path, 'r', encoding='iso-8859-1') as source_file: for index, line in enumerate(source_file): target_file.writeline(f"{index} - {line}") 9) Name three Unicode encodings (1 point). - UTF-8, UTF-16, UTF-32 10) Explain the size difference between a file containing a text in Czech (or in your native language) stored in an 8-bit encoding and the same file stored in UTF-8. (1 point) - Since UTF-8 is byte-lenght variable and ASCII back compatible, all chars from ASCII will be represented in 1 byte, but everything outside ASCII will bve 11) How do you convert a file from one encoding to another, for instance from a non-UTF-8 encoding to UTF-8? (1 point) - Using iconv in bash, with parameters -f (from) and -t (to) 12) Write a Python script that reads a text content from STDIN encoded in ISO-8859-2 and prints it to STDOUT in utf8. (2 points) - {Python}: input_text = sys.stdin.buffer.read() decoded = input_text.decode('iso-8859-2') utf8_text = decoded.encode('utf8') sys.stdout.buffer.write(utf8_text) 13) Explain what BOM is (in the context of file encodings). (1 point) - It is a ByteOrderMarker which indicates endianity of the representation and is used primarily in UTF-16 and UTF-32 where multiple bytes are always used. Not necessary needed in UTF-8, where endianity is given by the end mark of each byte. 14) What must be done if you have a CP1250-encoded HTML web page and you want to turn it into a UTF-8-encoded page? (1 point) - Save the HTML webpage, change the encoding via iconv and then change the meta tag in header of the HTML to chatset="utf-8". 15) How are line ends encoded in plain text files? (1 point) - There are few ways of line-endings: \n, \n\r, \r (newline, newline-carriage_return, carriage_return), \n and \r having own ASCII code. Which is used depends on the system, Windows having \n\r and Unix having \n 16) What would be the minimum and maximum expected size (in bytes) of a textual file that contains a 5-letter Czech word. Explain all reasons of this file size variability. (2 points) - minimum size is 5bytes, when all letters are included in ASCII and are therefore represented byte per char. - maximum size could be 5 * 4 = 20 bytes when encoding in UTF-32 or 5 * (2~3) bytes when encoding in UTF-8 where the most of the czech words are represented in 2 to 3 bytes per char. 17) How could you explain the situation in which you have a UTF8-encoded plain text file that contains two words which look exactly the same, but they don't fit string equality (and have different byte representations when being view using hexdump too)? (1 point) - Can happen for example in "café" example, where the 'é' can be represented as 'e' + '´' joined together as two unicodes or one unicode 'é'. 18) How can you distinguish a file containing the Latin letter "A" from a file containing the Cyrilic letter "A" or the Greek letter "A"? (1 point) - In UTF-8, each 'A' has a different unicode, therefore after locating the 'A' char, we can distinguish the coding by the unicode of the 'A' char. 19) Align screenshot pictures A-F with file encoding vs. view encoding situations I-IV. (2 points) - ... III. Text-processing in Bash: ----------------------------- 1) Using the Bash command line, get all lines from a file that contain one or two digits, followed by a dot or a space. (1 point) - grep -E '[0-9]{1-2}[\.\s]' path.txt (-E is extended regex) 2) Using the Bash command line, remove all punctuation from a given file. (1 point) - sed 's/[[:punct:]]//g' path.txt - ../g stands for "globally", otherwise only the first occurence on the line is replaced 3) Using the Bash command line, split text from a given file into words, so that there is one word on each line. (1 point) - sed 's/[[:space:]]\+/\n/g' path.txt - [[:space:]]\+ stands for one or more occurances of space - /\n/ replaced by newline 4) Using the Bash command line, download a webpage from a given URL and print the frequency list of opening HTML tags contained in the page. (2 points) - curl -s https://www.mff.cuni.cz/ | grep -oE '<[^/][^>]*>' | sed 's/<\([^[:space:]>]*\).*>/\1/' | sort | uniq -c | sort -nr - grep -oE '<[^/][^>]*>' finds all openning tags - sed 's/<\([^[:space:]>]*\).*>/\1/' replaces all into tag 5) Using the Bash command line, print out the first 5 lines of each file (in the current directory) whose name starts with "abc". (2 points) - for file in abc*; do head -n 5 "$file"; done 6) Using the Bash command line, find the most frequent word in a text file. (2 points) - cat path.txt | sed 's/[[:space:]]+/\n/g' | sort | uniq -c | sort -nr | head -n 1 7) Assume you have some linguistically analyzed text in a tab-separated file (TSV). You are just interested in the word form, which is in the second column, and the part-of-speech tag, which is in the fourth column. How do you extract only this information from the file using the Bash command line? (2 points) - cut -d '\t' -f 2,4 your_file.tsv 8) Create a Makefile with three targets. The "download" target downloads the webpage nic.nikde.eu into a file, the "show" target prints out the file, and the "clean" target deletes the file. (2 points) - {makefile}: URL = nic.nikde.eu SOURCE_FILE = temp.txt $(SOURCE_FILE): @wget $(URL) -O $(SOURCE_FILE) > /dev/null 2>&1 download: @make $(SOURCE_FILE) show: $(SOURCE_FILE) @cat $(SOURCE_FILE) clean: $(SOURCE_FILE) @rm -rf $(SOURCE_FILE) 9) Create a Makefile with two targets. When the first target is called, a web page is downloaded from a given URL. When the second target is called, the number of HTML paragraphs (

elements) contained in the file is printed. (2 points) - {makefile}: SOURCE_FILE = temp.txt URL = https://www.mff.cuni.cz $(SOURCE_FILE): @wget $(URL) -O $(SOURCE_FILE) download: @make $(SOURCE_FILE) num_of_p: $(SOURCE_FILE) @cat $(SOURCE_FILE) | grep -oE ']*>' | wc -l 10) Suppose there is a plain-text file containing an English text. Write a Bash pipeline of commands which prints the frequency list of 50 most frequent tokens contained in the text. (Simplification: it is sufficient to use only whitespace characters as token separators) (2 points). - cat path.txt | sed 's/[[:space:]]/\n/g' | sort | uniq -c | sort -nr | head -n 50 11) Assume you have some linguistic data in a text file. However, some lines are comments (these lines start with a "#" sign) and some lines are empty, and you are not interested in those. How do you get only the non-empty non-comment lines using the Bash command line? (2 points) - cat path.txt | grep -vE '^\s*(#|$)' > cleaned.txt - ^ means matching the beginning of the line - \s means space - (#|$) is catch for either # or empty line - [#$] is representing hastag and dollar sign on the other hand (FYI) 12) Assume you have some linguistically analyzed text in a comma-separated file (CSV). The first column is the token index — for regular tokens, this is simply a natural number (e.g. 1 or 128), for multiword tokens this is a number range (e.g. 5-8), and for empty tokens it is a decimal number (e.g. 6.1). How do you get only the lines that contain a regular token? (2 points) - cat path.txt | grep -E '^[0-9]+,.*' 13) Explain the following bash code: 'grep . table.txt | rev | cut -f2,3 | rev' - Will take all nonempty lines from table.txt, reverse its chars, take the second and third field (column) and reverse it again, resulting in taking the third and second last field (column) from each file of the original file 14) Create a bash script that reads an English text from STDIN and prints only interrogative sentences extracted from the text to STDOUT, one sentence per line (simplification: let's suppose that sentences can be ended only by fullstops and questionmarks). (2 points) - cat path.txt | sed 's/\([?\.]\)/\1\n/g' | grep -E '.+\?' | sed 's/^[[:space:]]//' - sed 's/\([?\.]\)/\1\n/g' adds new line after each . or ? - grep -E '.+\?' matches only those lines ending with ? - sed 's/[[:space:]]//' removes the empty space before the sentence. 15) Write a bash script that returns a word-bigram frequency "table" (in the tab-separated format) for its input (2 points). - cat text.txt | sed 's/[[:space:][:punct:]]/\n/g' | grep -E '[^[:space:]]' > temp_bigrams.txt ; cat temp_bigrams.txt | tail -n +2 > temp_shifted_bigrams.txt ; paste temp_bigrams.txt temp_shifted_bigrams.txt | sort | uniq -c | sort -nr - sed 's/[[:space:][:punct:]]/\n/g' rozdělí na na jednotlivá slova podle mezery nebo interpunkce na jednotlivý řádek - grep -E '[^[:space:]]' > temp_bigrams.txt smaže mezeru na začátku slova - cat temp_bigrams.txt | tail -n +2 > temp_shifted_bigrams.txt posune všechny slova o jedno nahoru - paste temp_bigrams.txt temp_shifted_bigrams.txt | sort | uniq -c | sort -nr spojí dva soubory pomocí \t za sebe, seřadí, sjednotí stejné řádky s jejich počtem a seřadí podle počtu s nejvyšším nahoře 16) Write a Bash script that returns a letter-bigram frequency "table" (in the tab-separated format) for its input (2 points). - cat text.txt | sed 's/[[:space:]]*\([[:alnum:][:punct:]]\)/\1\n/g' | grep -E '[^[:space:]]' > temp_bigrams.txt ; cat temp_bigrams.txt | tail -n +2 > temp_shifted_bigrams.txt ; paste temp_bigrams.txt temp_shifted_bigrams.txt | sort | uniq -c | sort -nr - sed 's/[[:space:]]*\([[:alnum:][:punct:]]\)/\1\n/g' vezme libovolné písmeno nebo interpunkci, které může předcházet whitespace, vezme pouze content bez whitespacu a přilepí za něj new_line - všechno ostatní stejné jako 15) IV. Git: -------- 1) Name 4 Git commands and briefly explain what each of them does (a few words or a short sentence for each command) (1 point). - add : adds given changes in a file or a whole directory to the upcoming commit - commit -m : commits the added changes to the current branch, until push is executed, new changes are held only locally on the current branch - push: if any commit made, the changes (within the commit) are published to the upstream by the push. It upload the commited changes to the remote upstream of the branch and becames available for others to pull - pull: if any commit made on the remote branch, the local branch must be syncronized by pull, which takes the currnet last state of the remote branch and applies it to the local branch. - User must always pull before push, because when pulling, user will treat a potential merges, when the GIT was unable to merge the current changes automatically 2) Assume you already are in a local clone of a remote Git repository. Create a new file called "a.txt" with the text "This is a file.", and do everything that is necessary so that the file gets into the remote repository (2 points). - echo "This is a file" > a.txt ; git add ./a.txt ; git commit -m "Added new file" ; git push (pull might be required when any remote changes done since the previous pull / clone). 3) Name two advantages of versioning your source codes (with Git) versus not versioning it (e.g. just having it in a directory on your laptop) (1 point). a) Colaboration: If multiple users work on the same code or project, their changes are much easier maintainable and merging of two concurrent changes is handled well without loosing any data. b) Branching: If there are multiple workers working on different parts of the project, they can all create their own branch, work separately without being disrupted by the work of others and resolve conflicts in the code at the end, when their work is done. 4) You and your colleague are working together on a project versioned with Git. Line 27 of script.py is empty. You change that line to initialize a variable ("a = 10"), while you colleague changes it to modify another variable ("b += 20"). He is faster than you, so he commits and pushes first. What happens now? Can you push? Can you commit? What do you need to do now? (2 points) - You can (and actually must) commit your changes, but later the push won't be successful, because new changes will be spotted and a new pull will be necessary before another push. This pull will unfortunatelly lead to a merge conflict (on the line 27) which will have to be resolved by the second user. The user must decide, which line to keep and which to throw away (if any). He will very likely let the b+= 20 on line 27 and add a new line, where the a = 10 will be placed. Then he makes another commit with the merge and finally pushes. 5) What's probably wrong with the following sequence of commands? What did the author probably want to do? How would you correct it? "echo aaa > a; git add a; git push; git commit -m'creating a'" - User wanted to add a new file to the git repository, but forgot to commit the changes before doing the actuall push, which therefore meant that no change was pushed to the remote stream. - Instead, he should have executed: "echo aaa > a; git add a; git commit -m'creating a'; git push" 6) What's probably wrong with the following sequence of commands? What did the author probably want to do? How would you correct it? "echo aaa > a; git commit -m'creating a'; git push" - User wanted to add a new file to the git repository, but forgot to pick files which to include in the next commit. Therefore the commit "creating a" included 0 files. - Instead, he should have executed: "echo aaa > a; git add a; git commit -m'creating a'; git push" 7) What's probably wrong with the following sequence of commands? What did the author probably want to do? How would you correct it? "echo aaa > a; git add a; git push" - User wanted to add a new file to the git repository, but forgot to make a commit, therefore there was nothing to push when calling git push. - Instead, he should have executed: "echo aaa > a; git add a; git commit -m'creating a'; git push" V. Python basics: ----------------- 1) What should the first line of a Python script look like? (1 point) - #!/usr/bin/env python3 2) How do you install a Python module? (1 point) - pip3 install module_name 3) How do you use a Python module in your Python script? (1 point) - import module_name (as shortcut) | from module_name import module_part 4) What Python data types do you know? What do they represent? (1 point) - int, representing integer "3" - float, representing floating point number "3.1" - str, representing of text of any length, "this is string \n with new line" - bool, representing boolean variable True or False - list, representing a list of data types (dont need to be of the same type) or classes, ["animal", 2, 3.2] - tuple, representing "fixed-length" list (dont need to be of the same type) of data types or classes, (1, "two") - dict, reresenting key-value pairs collection where keys are hashed and found in O(1) time, { "CZE": "Kč", "SLO": "€" } - set, representing key collection where keys are hashed and found in O(1) time - NoneType, representing None or empty value 5) In Python, given a string called text, how do you get the following: first character, last character, first 3 characters, last 4 characters, 3rd to 5th character? (2 points) - first character: text[0] - last character: text[-1] - first 3 characters: text[:3] - last 4 characters: text[-4:] - 3rd to 5th character: text[2:5] 6) Write a minimal Python script that prints "Hello NAME", where NAME is given to it as the first commandline argument; include the correct shebang line in the script. (2 points) - {python}: #!/usr/bin/env python3 import sys if (len(sys.argv)) >= 2: name = sys.argv[1] print(f"Hello {name}") 7) In Python, define a function that takes a string, splits it into tokens, and prints out the first N tokens (10 by default). (2 points) - {python}: def custom_split(input : str, num_of_tokens = 10): split_text = input.split() if len(split_text) >= num_of_tokens: print(split_text[:num_of_tokens]) 8) In Python, given a text split into a list of tokens, print out the 10 most frequent tokens. (1 point) - {python}: def most_frequent(input : str): split_text = input.split() freq = {} for token in split_text: if not freq.contains(token): freq[token] = 0 freq[token] += 1 most_common = sorted(freq.items(), key=lambda x: x[1], reverse=True)[:10] print(most_common) 9) In Python, given a text split into a list of tokens, print out all tokens that have a frequency higher than 5. (1 point) - {python}: def most_frequent(input : str): split_text = input.split() freq = {} for token in split_text: if not freq.contains(token): freq[token] = 0 freq[token] += 1 filtered = [token[0] for token in freq.items() if token[1] > 5] print(filtered) 10) In Python, given a text split into a list of tokens, print out all tokens that have a frequency above the median. (2 points) - {python}: def most_frequent(input : str): split_text = input.split() freq = {} for token in split_text: if not token in freq: freq[token] = 0 freq[token] += 1 sorted_tokens = sorted(freq.items(), key=lambda x: x[1], reverse=True)[:10] median = sorted_tokens[len(sorted_tokens)//2][1] filtered = [token[0] for token in sorted_tokens if token[1] > median] print(filtered) 11) In Python, implement an improved version of wc: write a script that reads in the contents of a file, and prints out the number of characters, whitespace characters, words, lines and empty lines in the file. (2 points) - {python}: def wc(path : str): with open(path, 'r', encoding="given") as input: print(len(input)) print(len([x for x in input if x.isspace()])) print(len(input.split())) print(input.count('\n')+1) print(input.count('\n\n')) 12) In Python, assume the variable genesis_text contains a text, with punctuation removed, i.e. there are just words separated by spaces. Print out the most frequent word. (2 points) - {python}: genesis_text = "..." genesis_split = genesis_text.split() freq = {} for word in genesis: if word not in freq: freq[word] = 0 freq[word] += 1 most_frequent = sorted(freq.items(), key=lambda x: x[1], reversed=True)[0] print(most_frequent) VI. Simple string processing in Python: ---------------------------------------