CYF+

In this project we’ll practice parsing data from files in different formats.

Many larger projects require reading configuration data or files, and often times it’s also convenient to write a small program to understand or process some data, and being comfortable quickly doing so can be very helpful.

Timebox: 2 days

Objectives:

Write Go code to parse data in non-standard formats.
Write tests for Go code.
Become comfortable leveraging libraries to parse standard formats in slightly non-standard ways.

Project

The examples sub-directory in this directory contains a number of data files, each of which contains the same data in a different format. Descriptions of each format can be found below. Each data file contains a data-set of player names, and their high scores, for some game.

We are going to write a program to analyse the data, and print out the names of the players with the highest and lowest scores.

Don’t forget to write tests for your program, too.

Data formats

JSON

json.txt contains an array of JSON objects, each with a “name” and “high_score” key.

Repeated JSON

repeated-json.txt contains lines of data stored in JSON format. Each line contains exactly one record, stored as an object. Lines starting with # are comments and should be ignored.

Comma Separated Value

data.csv is a standard Comma Separated Value (“CSV”) file. The format is well-documented online, and there are many libraries which support parsing it.

Custom Binary

There are two files in a custom binary serialisation format.

:memo: This section refers to a concept called “endianness”. You can learn more about endianness in this article.

:memo: This section refers to a character encoding called UTF-8. You can learn more about UTF-8 on wikipedia. For this exercise, it’s sufficient to know that UTF-8 is a way of encoding strings as bytes, and that if you read the bytes of a UTF-8 string in Go, you can use that value as a string without needing to change it.

:memo: This section refers to a “null terminating character”. When encoding a piece of data with variable length, we need to know how big it is. There are a few ways we can typically do that; the one we’re going to use is that we’ll specify “you’ll know when the string is over, when you see a byte which is all zeros - the byte before that one was the last byte of the string”.

The format is as follows:

First two bytes of the file indicate endianness of numbers. If the bytes are FE FF, numbers in the file are stored in big endian byte order. If the bytes are FF FE, numbers in the file are stored in little endian byte order.
Each record contains exactly four bytes representing the score as a signed 32-bit integer, in the above described endian format, then the name of the player stored in UTF-8 which may not contain a null character, followed by a null terminating character.

The tool od (which you can learn more about here) can be useful for exploring binary data. For instance, we can run:

> od -t x1 projects/file-parsing/examples/custom-binary-le.bin
0000000    ff  fe  0a  00  00  00  41  79  61  00  1e  00  00  00  50  72
0000020    69  73  68  61  00  ff  ff  ff  ff  43  68  61  72  6c  69  65
0000040    00  19  00  00  00  4d  61  72  67  6f  74  00                
0000054

This prints each byte of the file, one at a time, represented as hexidecimal digits.

We can see in this example that the first byte is ff and the second is fe - according to our file format specification, that suggests the numbers here are stored in little endian byte order.

We can see the next four bytes contain 0a then three 00s, then three non-null bytes, then a null byte.

Extra things to consider

Variable-length data encoding

In the description of the binary serialisation format, we mentioned that there were different ways of encoding variable-length data.

The example that we used for strings was to use a terminating character - we specify that there’s a character (or several characters in sequence) which isn’t allowed to appear in our data (e.g. a null byte, one where all the bits are 0), and that we’ll add that to the end of the data so you know it’s the end.

We also used another technique, for storing the scores - we specified that the score is stored in a fixed number of bytes (specifically 4). 4 bytes is probably more space than we actually need to store our game’s scores (4 bytes can store 4294967296 different values - really big scores!), and in fact most of our scores fit into 1 byte (256 different values), but specifying exactly 4 bytes is a simple rule, and gives us flexibility in case scores increase in the future.

Yet another technique is that we can write down the length of the variable length data in a fixed amount of memory before the data; i.e. to say “These 4 bytes say that the string after them will be 100 bytes long”.

Each of these three approaches has different trade-offs - benefits they bring, and drawbacks they add.

Consider the trade-offs of each approach. What makes it a good approach? What makes it a bad approach? What kind of data and use-cases is each well-suited for?

Some things to consider:

Are any more or less efficient in terms of how much space they use? Do any waste space?
What limits do they apply to the kind of data we can actually store?
Do any of the approaches make it easier/harder or faster/slower to parse data stored in that format?

Avoiding writing code

While it’s useful to be comfortable putting together ad-hoc programs to parse some data (and you should practice this!), one of the advantages of using existing formats of data is that there are often tools which can help us to do some parsing or analysis without even needing to write a program at all.

Two examples of this are jq (which allows you to parse JSON using a custom query language), and fx which allows you to write JavaScript snippets to manipulate JSON.

For example, you can use jq to answer the question “Who had the highest score” without needing to write a whole program:

> jq -r '. | max_by(.high_score).name' file-parsing/examples/json.txt
Prisha

Or use fx to do the same, but using more familiar JavaScript as the query language:

> fx file-parsing/examples/json.txt '.sort((l, r) => r.high_score - l.high_score)[0].name'
Prisha

Similarly, a program called csvq can be used to query CSV files in a SQL-like query language:

> cat examples/data.csv | csvq 'SELECT * ORDER BY `high score` DESC LIMIT 1'
+--------+------------+
|  name  | high score |
+--------+------------+
| Prisha | 30         |
+--------+------------+

Spend some time experimenting with these tools:

Write some interesting queries over the data.
Try to work out what the limits are of using these pre-existing tools, and when you’re more likely going to want to just write a custom program yourself.