Yet another article on YAML

How to write configuration files

By Martin Helm in Data Science ML Tools

October 28, 2021

If you have ever written a configuration file, it was probably in the YAML format. Nowadays, it is very commonly used for this task, but you can also use it as a format to store actual data. This can also be seen from the meaning of the abbreviation. Originally, YAML stood for “Yet Another Markup Language”, while in the meantime it is representing “YAML Ain’t Markup Language”. All the more reasons to look at what we can do with it!

Usually, I like to start with an example, showing all the possibilities of the format. But because YAML can do so much this would be very long, so I put it at the end this time. I will first go over some general syntax here, introducing the different structures. You can then find all the details in the example. Basically a YAML file describing itself!

General syntax

Each document start with --- That way one can also store multiple documents in one file. If you want to store multiple documents in one file, you can simply separate them from each other by starting each with ---. If you only have one document in a file, you would actually not necessarily need the — at the start and it would be an implicit document. But it is good practice to include them, and you also need them in case you have If you want to be very explicit you can also end each document with the end of document marker ... but usually you dont have to.

Commenting can be done easily using #. The comments can either span an entire line, or they can be fit after the key-value pair or.

Structures

YAML supports three basic structures, or nodes how they call it. Every node consists of a key-value pair that is separated by a :. They keys can have white spaces in them, just remember that it might be difficult for the programming language to deal with this, therefore one usually avoids them. The white space around the colon is ignored, so it should not be used to convey information.

Scalars

Scalars are the simplest structure. They are a simple key-value pair without any further nesting. There is a large number of different scalars covering basically everything that one would expect, numerics, strings, logicals and null values. Have a look at the example YAML file further down, which describes them all!

Sequences

Sequences are ordered collections of items. As with JSON, you can either write them inline using [] or spread them over multiple lines using indentation and -. Keep in mind that the values do not need to be of the same type

---
key: [value1, 2, .NaN]
another key: 
  - value1
  - 2
  - .NaN

This brings us to indentation, which can be any number of white spaces, but no tabs. There is no clear convention for the number of white spaces, typically 2 or 4 are used.

Sequences can also be nested:

---
World:
  - Europe:
    - UK
    - Germany
    - France
  - North America:
    - USA
    - Canada

Maps

Finally, maps are a collection of key-value pairs. You can construct them inline using {} or again spread over multiple lines with indentation. They behave basically the same way as sequences, just that every value has an additional key. And of course they can be nested with other maps or sequences:

---
Company: {name: EvilOrg, Employees: 254}
Team:
 Name: Superteam
 Members:
    - Mark:
      age: 23
      haircolor: blonde
    - Jack:
        age: 25
        haircolor: brown

As with JSON, everything in a YAML file needs to be part of at least one map. That means one could not have a single sequence without a key. In reality this is pretty intuitive, just remember that the following would not be a valid YAML file, as it lacks a key:

---
- something
- in a
- sequence

Full example

# Comments start with a hashtag and can be before the actual document
---
# --- denotes the start of a document
key: value
strings: strings can be outside quotes
strings2: "strings can also be in quotes"
strings3: 'or in single quotes. See below how they differ'
integers: 3
decimal: +3
octal: 0o12 # Octals always start with a zero 
hex: 0xC # Start with 0x
boolean: true #Previous YAML versions also supported On/Off, but they are no longer valid in the current version.
float: 3.14
exponential: 0.0314e+2
infinity: .inf #Capitalization does not matter, so inf, Inf and INF are all valid.
negative infinity: -.inf
keys can have whitespaces: true
not available: .NaN
not defined: .null
indentation: "matters. Use any number of white spaces but no tabs!"
sequences:
  - can contain different types
  - 1
  - true
inline sequences: [Sequences, defined, inline]
maps:
  types: maps can contain different types of data 
  a number: 5
  nesting:
    possible: true
    also with other types: 
      - another
      - sequence
inline maps: {key: value, one: twp}
complex_strings: "a deer\n" # \n will be converted to newline
complex_strings1: 'a deer\n' # \n will be interpreted as part of the string!
complex_strings2 : a deer\n # \n will be interpreted as part of the string!
modifiers: "> and | modify how a multiline string gets interpreted"
folded style: >
  a multiline
  string intrepreted
  as one line
literal style |
  A multiline string
  interpreted as 
  a multiline string
chomp modifiers: 
  general: "+ and - modify how white spaces and the final linefeed are preserved. Use with | or <"
  +: Multiline strings preserve trailing white spaces and the final linefeed
  -: Multiline strings are stripped of trailing white spaces and the final linefeed

# ... denotes end of document
...

Comparison to JSON

YAML is a strict superset of JSON. That means you can do everything you can do with JSON and much more. Also, every valid JSON document should also be a valid YAML document. Let’s review the main differences between YAML and JSON:

JSONYAML
Comments are not allowed Comments are denoted with #
Objects and arrays are denoted in curly braces and brackets respectively Hierarchy is denoted using doble space characters (indentation). Tab characters are not allowed
Strings must be in double quotes String quotes are optional, and support double and single quotes.
Root node must either be an array or an object Root node can be any of the valid data types
Differences between YAML and JSON

Summary

As we can see from the example, YAML is truly a very flexible markup language where you can incorporate basically any kind of data or information you can think about. But this flexibility also comes with complexity, which is especially tricky for strings.

Since YAML is usually used for config files, where one does not have complicated string manipulation, this is usually not an issue though. Most data is still transferred via JSON.

In case you still run into one the peculiarities of YAML, check out the resources below to help you!

Resources

Photo by Ferenc Almasi on Unsplash