Writing Marks

Here we'll walk through a practical example of writing a mark file.

We'll create a mark for CSV (comma separated values) files, a simple format for storing tabular data in a text file.

CSV files separate fields with commas and rows with line breaks. They look something like:

foo,bar,baz
blah,blah,blah
1,2,3

There is a little complexity surrounding special characters in fields and line endings, but otherwise the only other rule is that all rows must have the same number of fields. You can refer to RFC4180 on the IETF website for more details.

We'll represent such a structure in Hoon as a (list (list @t)) like:

[['foo' 'bar' 'baz' ~] ['blah' 'blah' 'blah' ~] ['1' '2' '3' ~] ~]

We could perhaps create the type with a $| rune to include row-length validation in the mold itself, but a (list (list @t)) is simpler for demonstrative purposes.

A simple mark

Let's begin with the simplest mark file:

|_ csv=(list (list @t))
++ grab
|%
++ noun (list (list @t))
--
++ grow
|%
++ noun csv
--
++ grad %noun
--

The door takes a (list (list @t)) as its sample, and we've given it a face of csv so we can easily reference it. Note its face could be anything, it needn't be the name of our mark. When we're doing something with data that has a %csv mark like converting it to another mark or creating a diff, this is where our data will reside.

Next we have the +grab arm of our door, which contains a core with arms for converting to our mark from other marks. We've given it one arm for the %noun mark - the most generic mark which will take any noun. Our +noun arm will simply clam whatever it's given with the (list (list @t)) mold.

Next is the +grow arm which does the inverse of +grab, converting from our mark to another mark. We've also given it a +noun arm, this time it will simply return the door's sample named csv, which is of course already a noun.

Note that the +noun arm is mandatory in +grab. Clay cannot build a mark core without it. Conversion arms for any other marks apart from %noun are optional.

Finally we have the +grad arm. This arm specifies functions for revision control like creating diffs, patching files and so on. In our case, rather than writing all those functions, we've just delegated those tasks to the %noun mark. We can do this because we've specified conversion routines to and from the %noun mark in our +grow and +grab arms. When we modify a file with a %csv mark, Clay will convert our data to a %noun mark, execute the necessary +grad functions from the %noun mark file, and then convert it back to a %csv mark again.

So now we have a valid %csv mark file. If we save this as csv.hoon in the /mar directory we could store %csv data in Clay. This may be sufficient for some applications, but what if we want to import a CSV file from Unix or elsewhere? In the next section, we'll look at conversions to and from a %mime mark to address this.

%mime conversions

The $mime type represents raw data from Unix or elsewhere. For example, if a text file from Unix containing the word foo were converted to a $mime type in Urbit, it would look something like:

[/text/plain q=[p=3 q=7.303.014]]

/text/plain is its MIME type and p.q is the byte-length of q.q, which is the data itself as an atom.

The %mime mark is used by Clay to store and convert $mime data. It's an important mark for moving files from Unix to Urbit and vice versa. When you add a file to a desk you have mounted to Unix and |commit the change, Clay will first receive the file as a %mime mark, then convert from a %mime to whatever mark matches the file extension. For example, foo.txt will be converted from %mime to %txt. Additionally, data fetched by Iris over HTTP will come in as a $mime-data:http, which is an unvalidated form of $mime that you may wish to convert to a %mime mark and then to another mark. Likewise with Eyre, some of the lower-level interfaces receive HTTP requests with $mime-data:http in them.

So with the nature of the %mime mark hopefully now clear, the reason we want conversion methods to and from %mime in our %csv mark is so we can import CSV files from Unix and vice versa.

Since a CSV file on Unix will just be a long string with ASCII or UTF-8 encoding, we can treat q.q in the $mime as a cord, and thus write a parser to convert it to a (list (list @t)). For this purpose, here's a library: csv-utils.hoon, which you can view in full on the Examples page.

The library contains four functions:

  • +de-csv - Parse a CSV cord to a (list (list @t)).
  • +en-csv - Encode a (list (list @t)) as a CSV cord.
  • +validate - Check all rows of (list (list @t)) are the same length.
  • +csv-join - Ignore this for now, we'll use it later on.

The decoding and encoding arms use parsing functions from the Hoon standard library. It's not important to be familiar with parsing in Hoon for our purposes here, but you can have a look at the Parsing Guide in the Hoon documentation if you're interested. The important thing to note is that +de-csv takes a valid CSV-format @t and returns a (list (list @t)), and +en-csv does the reverse - it takes a (list (list @t)) and returns a CSV-format @t.

Let's try the library in the dojo. After we've added it to /lib and run |commit, we can build the file:

> =csv-utils -build-file %/lib/csv-utils/hoon

...try decode a CSV-format @t:

> (de-csv:csv-utils 'foo,bar,baz\0ablah,blah,blah\0a1,2,3')
~[<|foo bar baz|> <|blah blah blah|> <|1 2 3|>]

...and try encode a (list (list @t)) as a CSV-format @t:

> (en-csv:csv-utils [['foo' 'bar' 'baz' ~] ['blah' 'blah' 'blah' ~] ['1' '2' '3' ~] ~])
'foo,bar,baz\0ablah,blah,blah\0a1,2,3\0a'

With that working, we can add an import for our library to our %csv mark defintion and add a +mime arm to both our +grab and +grow arms:

/+ *csv-utils
|_ csv=(list (list @t))
++ grab
|%
++ mime |=((pair mite octs) (de-csv q.q))
++ noun
|= n=*
^- (list (list @t))
=/ result ((list (list @t)) n)
?> (validate result)
result
--
++ grow
|%
++ mime
?> (validate csv)
[/text/csv (as-octs:mimes:html (en-csv csv))]
++ noun
?> (validate csv)
csv
--
++ grad %noun
--

In +grab we've added a +mime arm to convert from a %mime mark to our %csv mark. It's a simple gate that takes a $mime (specified as (pair mite octs) to avoid conflict with the arm name), runs the data through the +de-csv function and returns a (list (list @t)) of the CSV data.

We've also added a +mime arm to +grow for converting from our %csv mark to a %mime mark. We encode our (list (list @t)) csv sample with our +en-csv function and then run that through as-octs:mimes:html to get a $octs (so it has the byte-length). We also add the /text/csv MIME type so it's a valid $mime.

Additionally, we've used the +validate function in a few places to make sure our CSV data has consistent row lengths.

If we save the above mark file as csv.hoon in /mar and |commit %base, we should now be able to import CSV files into Urbit. Let's give it a go. In the root of our %base desk, let's add a file named foo.csv with the following contents:

foo,bar,baz
blah,blah,blah
1,2,3

If we now |commit %base, we should see it's been successfully added:

> |commit %base
>=
+ /~zod/base/4/foo/csv

And if we try reading the file with the -read thread:

> -read [%x our %base da+now /foo/csv]
~[<|foo bar baz|> <|blah blah blah|> <|1 2 3|>]

We can see our %csv mark has successfully converted our foo.csv file to a (list (list @t)) when it was imported.

Let's try the other direction now. We can create a new bar.csv files in the root of %base from the dojo like so:

> */bar/csv ~[['abc' 'def' ~] ['ghi' 'jkl' ~]]
+ /~zod/base/5/bar/csv

And if we check it in the terminal on the Unix side we can see it's been correctly encoded:

> cat zod/base/bar.csv
abc,def
ghi,jkl

So now our %csv mark lets us move data in and out of Urbit. In the next section, we'll look at the +grad arm in more detail.

+grad

So far we've just delegated +grad functions to the %noun mark, but now we'll look at writing our own.

For demonstrative purposes, we can just poach the algorithms used in the +grad arm of the %txt mark and modify them to take our (list (list @t)) type instead of a wain. It's not the most efficient algorithm for a CSV file but it'll do the job.

Our diff format will be a (urge:clay (list @t)), and we'll use some differ functions from zuse.hoon like +loss, +lusk and +lurk to produce diffs and apply patches.

The csv-utils.hoon library we imported also contains a +csv-join function which we'll use in the +join arm, just to save space here.

Here's the new %csv mark defintion:

/+ *csv-utils
|_ csv=(list (list @t))
++ grab
|%
++ mime |=((pair mite octs) (de-csv q.q))
++ noun
|= n=*
^- (list (list @t))
=/ result ((list (list @t)) n)
?> (validate result)
result
--
++ grow
|%
++ mime
?> (validate csv)
[/text/csv (as-octs:mimes:html (en-csv csv))]
++ noun
?> (validate csv)
csv
--
++ grad
|%
++ form %csv-diff
++ diff
|= bob=(list (list @t))
^- (urge:clay (list @t))
?> (validate csv)
?> (validate bob)
(lusk:differ csv bob (loss:differ csv bob))
++ pact
|= dif=(urge:clay (list @t))
^- (list (list @t))
=/ result (lurk:differ csv dif)
?> (validate result)
result
++ join
|= $: ali=(urge:clay (list @t))
bob=(urge:clay (list @t))
==
^- (unit (urge:clay (list @t)))
(csv-join ali bob)
++ mash
|= $: [ship desk (urge:clay (list @t))]
[ship desk (urge:clay (list @t))]
==
^- (urge:clay (list @t))
~|(%csv-mash !!)
--
--

In our modified +grad arm, we've replaced the %noun delegation with a core containing five arms: +form, +diff, +pact, +join, and +mash. These arms are all required for a valid +grad if it's not delegated to another mark. We'll now look at each in detail.

+form

++ form %csv-diff

+form simply specifies the mark of the diff file that may be produced by other +grad functions. If your diff is the same type as your mark, it could just specify itself like %csv. In our case our diff is a (urge:clay (list @t)) rather than a (list (list @t)), so we need a separate mark file for the diff itself.

Here's another mark file which can be saved as csv-diff.hoon in /mar:

|_ dif=(urge:clay (list @t))
++ grab
|%
++ noun (urge:clay (list @t))
--
++ grow
|%
++ noun dif
--
++ grad %noun
--

It's very bare-bones, we just need it for our %csv mark to work. In our %csv mark, we've specified it as %csv-diff in +form.

+diff

++ diff
|= bob=(list (list @t))
^- (urge:clay (list @t))
?> (validate csv)
?> (validate bob)
(lusk:differ csv bob (loss:differ csv bob))

This arm produces the diff of two %csv files. The first %csv file will be given as the sample of the parent door, which if you'll recall we gave a face of csv. The second %csv file will be given as the sample of the gate in +diff, which we've named bob here. We then just produce the diff of these two files and return it as the type of the mark specified in +form, which in our case is (urge:clay (list @t)) for a %csv-diff. Clay will use +diff when a file is revised, so it doesn't have to store a whole new copy of the file each time it's modified.

+pact

++ pact
|= dif=(urge:clay (list @t))
^- (list (list @t))
=/ result (lurk:differ csv dif)
?> (validate result)
result

+pact patches a %csv file with the given diff. Its gate takes a diff and applies it to the %csv given as the sample of the parent door (which we gave a face of csv). If the patch succeeds, it will return a new %csv file - a valid (list (list @t)). When we read a file that's been modified in Clay, Clay will apply all the diffs it has with +pact and return the resulting file.

+join

++ join
|= $: ali=(urge:clay (list @t))
bob=(urge:clay (list @t))
==
^- (unit (urge:clay (list @t)))
(csv-join ali bob)

The +join arm merges two different diffs. It takes them both as the sample of its gate (which we've named ali and bob), and returns a new diff wrapped in a unit like (unit (urge:clay (list @t))). The unit will be ~ if the merge failed due to a conflict. This is used by Clay in some cases when desks are merged. If diff merges are not possible for your use case, you could just have it always return ~.

+mash

++ mash
|= $: [ship desk (urge:clay (list @t))]
[ship desk (urge:clay (list @t))]
==
^- (urge:clay (list @t))
~|(%csv-mash !!)

This is like +join except it forces a diff merge even if there's a conflict. Rather than returning a unit, it just returns the diff - a (urge:clay (list @t)) in our case. Also unlike +join, it takes the ship and desk each diff came from as well as the diff itself.

The +mash arm is not used by Clay in its file revision operations, so it's safe to just make it a dummy arm that crashes as we've done here. If you were to use it, it would likely just be used manually in an agent, thread or generator.

An example of its use would be the %txt mark, which includes a proper +mash function that produces a diff with any conflicts annotated, though how you have +mash handle conficts would depend on your use case. If there were no conflicts between the two diffs, it should produce the same diff as the +join arm.

Conclusion

So there you have it, a fully functional mark for CSV files. A mark file can be as complex or as simple as you'd like, they're very flexible depending on your use case. Additional conversion methods can always be added as they're needed. For example, with just a few lines of code we could add arms for converting CSV files to %json or %txt and vice versa.

In the next document, we'll look at building and using mark cores and mark conversion gates in our own code.