This module will discuss how text is represented in Hoon, discuss tools for producing and manipulating text, and introduce the %say
generator, a new generator type. We don't deal with formatted text (tank
s) or parsers here, deferring that discussion. Formatted text and text parsing are covered in a later module.
Text in Hoon
We've incidentally used 'messages written as cords'
and "as tapes"
, but aside from taking a brief look at how lists (and thus tapes) work with tree addressing, we haven't discussed why these differ or how text works more broadly.
There are four basic ways to represent text in Urbit:
@t
, a cord, which is an atom (single value)@ta
, aknot
used for URL-safe path elements, which is an atom (single value)@tas
, aterm
used primarily for constants, which is an atom (single value)tape
, which is a(list @t)
This is more ways than many languages support: most languages simply store text directly as a character array, or list of characters in memory. Colloquially, we would only call cords and tapes strings, however.
What are the applications of each?
@t
cord
What is a written character? Essentially it is a representation of human semantic content (not sound strictly). (Note that we don't refer to alphabets, which prescribe a particular relationship of sound to symbol: there are ideographic and logographic scripts, syllabaries, and other representations. Thus, characters not letters.) Characters can be combined—particularly in ideographic languages like Mandarin Chinese.
One way to handle text is to assign a code value to each letter, then represent these as subsequent values in memory. (Think, for instance, of Morse code↗.) On all modern computers, the numeric values used for each letter are given by the ASCII↗ standard, which defines 128 unique characters (2⁷ = 128).
65 83 67 73 73A S C I I
A cord simply shunts these values together in one-byte-wide slots and represents them as an integer.
> 'this is a cord''this is a cord'> `@`'this is a cord'2.037.307.443.564.446.887.986.503.990.470.772
It's very helpful to use the @ux
aura if you are trying to see the internal structure of a cord
. Since the ASCII values align at the 8-bit wide characters, you can see each character delineated by a hexadecimal pair.
> `@ux`'HELLO'0x4f.4c4c.4548> `@ub`'HELLO'0b100.1111.0100.1100.0100.1100.0100.0101.0100.1000
You can think of this a couple of different ways. One way is to simple think of them as chained together, with the first letter in the rightmost position. Another is to think of them as values multipled by a “place value”:
Letter | ASCII | Place | “Place Value” |
---|---|---|---|
H | 0x48 | 0 | 2⁰ = 1 → 0x48 |
E | 0x45 | 1 | 2⁸ = 256 = 0x100 → 0x4500 |
L | 0x4c | 2 | 2¹⁶ = 65.536 = 0x1.0000 → 0x4c.0000 |
L | 0x4c | 3 | 2²⁴ = 16.777.216 = 0x100.0000 → 0x4c00.0000 |
O | 0x4f | 4 | 2³² = 4.294.967.296 = 0x1.0000.0000 → 0x4f.0000.0000 |
This way, each value slots in after the preceding value.
Special characters (non-ASCII, beyond the standard keyboard, basically) are represented using a more complex numbering convention. Unicode↗ defines a standard specification for code points or numbers assigned to characters, and a few specific bitwise encodings (such as the ubiquitous UTF-8). Urbit uses UTF-8 for @t
values (thus both cord
and tape
).
(list @t)
tape
There are some tools to work with atom cord
s of text, but most of the time it is more convenient to unpack the atom into a tape. A tape
splits out the individual characters from a cord
into a list
of character values.
We've hinted a bit at the structure of list
s before; for now the main thing you need to know is that they are cells which end in a ~
sig. So rather than have all of the text values stored sequentially in a single atom, they are stored sequentially in a rightwards-branching binary tree of cells.
A tape is a list of @tD
atoms (i.e., characters). (The upper-case character at the end of the aura hints that the @t
values are D→3 so 2³=8 bits wide.)
> "this is a tape""this is a tape"> `(list @)`"this is a tape"~[116 104 105 115 32 105 115 32 97 32 116 97 112 101]
Since a tape is a (list @tD)
, all of the list
tools we have seen before work on them.
@ta
knot
If we restrict the character set to certain ASCII characters instead of UTF-8, we can use this restricted representation for system labels as well (such as URLs, file system paths, permissions). @ta
knot
s and @tas
term
s both fill this role for Hoon.
> `@ta`'hello'~.hello
Every valid @ta
is a valid @t
, but @ta
does not permit spaces or a number of other characters. (See ++sane
, discussed below.)
@tas
term
A further tweak of the ASCII-only concept, the @tas
term
permits only “text constants”, values that are first and foremost only themselves.
[
@tas
permits only] a restricted text atom for Hoon constants. The only characters permitted are lowercase ASCII letters,-
, and0-9
, the latter two of which cannot be the first character. The syntax for@tas
is the text itself, always preceded by%
. The empty@tas
has a special syntax,$
.
term
s are rarely used for message-like text, but they are used all the time for internal labels in code. They differ from regular text in a couple of key ways that can confuse you until you're used to them.
For instance, a @tas
value is also a mold, and the value will only match its own mold, so they are commonly used with type unions to filter for acceptable values.
> ^- @tas %5mint-nice-need.@tas-have.%5nest-faildojo: hoon expression failed> ^- ?(%5) %5%5> (?(%5) %5)%5
For instance, imagine creating a function to ensure that only a certain classical element↗ can pass through a gate. (This gate is superfluous given how molds work, but it shows off a point.)
|= input=@t=<(validate-element input)|%+$ element ?(%earth %air %fire %water)++ validate-element|= incoming=@t%- element incoming--
(See how that =<
tisgal works with the helper core?)
Text Operations
Text-based data commonly needs to be produced, manipulated, or analyzed (including parsing).
Producing Text
String interpolation puts the result of an expression directly into a tape
:
> "{<(add 5 6)>} is the answer.""11 is the answer."
The ++weld function can be used to glue two tape
s together:
> (weld "Hello" "Mars!")"HelloMars!"
|= [t1=tape t2=tape]^- tape(weld t1 t2)
Manipulating Text
If you have text but you need to change part of it or alter its form, you can use standard library list
operators like ++flop as well as tape
-specific arms.
Applicable list
operations—some of which you've seen before—include:
The ++flop function takes a list and returns it in reverse order:
> (flop "Hello!")"!olleH"> (flop (flop "Hello!"))"Hello!"The ++sort function uses the quicksort algorithm↗ to sort a list. It takes a
list
to sort and a gate that serves as a comparator. For example, if you want to sort the list~[37 62 49 921 123]
from least to greatest, you would pass that list along with the ++lth gate (for “less than”):> (sort ~[37 62 49 921 123] lth)~[37 49 62 123 921]To sort the list from greatest to least, use the gth gate ("greater than") as the basis of comparison instead:
> (sort ~[37 62 49 921 123] gth)~[921 123 62 49 37]You can sort letters this way as well:
> (sort ~['a' 'f' 'e' 'k' 'j'] lth)<|a e f j k|>The function passed to sort must produce a flag, i.e.,
?
.The ++weld function takes two lists of the same type and concatenates them:
> (weld "Happy " "Birthday!")"Happy Birthday!"It does not inject a separator character like a space.
The ++snag function takes an atom
n
and a list, and returns then
th item of the list, where 0 is the first item:> (snag 3 "Hello!")'l'> (snag 1 "Hello!")'e'> (snag 5 "Hello!")'!'Exercise:
++snag
Yourself- Without using
++snag
, write a gate that returns then
th item of a list. There is a solution at the bottom of the page.
- Without using
The ++oust function takes a pair of atoms
[a=@ b=@]
and a list, and returns the list with b items removed, starting at item a:> (oust [0 1] `(list @)`~[11 22 33 44])~[22 33 44]> (oust [0 2] `(list @)`~[11 22 33 44])~[33 44]> (oust [1 2] `(list @)`~[11 22 33 44])~[11 44]> (oust [2 2] "Hello!")"Heo!"The ++lent function takes a list and returns the number of items in it:
> (lent ~[11 22 33 44])4> (lent "Hello!")6Exercise: Count the Number of Characters in Text
- There is a built-in
++lent
function that counts the number of characters in atape
. Build your owntape
-length character counting function without using++lent
.
You may find the
?~
wutsig rune to be helpful. It tells you whether a value is~
or not. (How would you do this with a regular?:
wutcol?)- There is a built-in
The foregoing are list operations. The following, in contrast, are tape-specific operations:
The ++crip function converts a
tape
to acord
(tape
→cord
).> (crip "Mars")'Mars'The ++trip function converts a
cord
to atape
(cord
→tape
).> (trip 'Earth')"Earth"The ++cass function: convert upper-case text to lower-case (
tape
→tape
)> (cass "Hello Mars")"hello mars"The ++cuss function: convert lower-case text to upper-case (
tape
→tape
)> (cuss "Hello Mars")"HELLO MARS"
Analyzing Text
Given a string of text, what can you do with it?
- Search
- Tokenize
- Convert into data
Search
The ++find function takes
[nedl=(list) hstk=(list)]
and locates a sublist (nedl
, needle) in the list (hstk
, haystack). (++find
starts counting from zero.)> (find "brillig" "'Twas brillig and the slithy toves")[~ 6]++find
returns aunit
, which right now means that we need to distinguish between nothing found (~
null) and zero[~ 0]
.unit
s are discussed in more detail in a later lesson.
Tokenize/Parse
To tokenize text is to break it into pieces according to some rule. For instance, to count words one needs to break at some delimiter.
"the sky above the port was the color of television tuned to a dead channel"1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Hoon has a sophisticated parser built into it that we'll use later. There are a lot of rules to deciding what is and isn't a rune, and how the various parts of an expression relate to each other. We don't need that level of power to work with basic text operations, so we'll instead use basic list
tools whenever we need to extract or break text apart for now.
Exercise: Break Text at a Space
Hoon has a very powerful text parsing engine, built to compile Hoon itself. However, it tends to be quite obscure to new learners. We can build a simple one using list
tools.
Compose a gate which parses a long
tape
into smallertape
s by splitting the text at single spaces. For example, given atape
"the sky above the port was the color of television tuned to a dead channel"the gate should yield
~["the" "sky" "above" "the" ...]To complete this, you'll need ++scag and ++slag (who sound like villainous henchmen from a children's cartoon).
|= ex=tape=/ index 0=/ result *(list tape)|- ^- (list tape)?: =(index (lent ex))(weld result ~[`tape`(scag index ex)])?: =((snag index ex) ' ')$(index 0, ex `tape`(slag +(index) ex), result (weld result ~[`tape`(scag index ex)]))$(index +(index))
Convert
If you have a Hoon value and you want to convert it into text as such, use ++scot and ++scow. These call for a value of type +$dime
, which means the @tas
equivalent of a regular aura. These are labeled as returning cord
s (@t
s) but in practice seem to return knot
s (@ta
s).
The ++scot function renders a
dime
as acord
(dime
→cord
); the user must include any necessary aura transformation.> `@t`(scot %ud 54.321)'54.321'> `@t`(scot %ux 54.000)'0xd2f0'> (scot %p ~sampel-palnet)~.~sampel-palnet> `@t`(scot %p ~sampel-palnet)'~sampel-palnet'The ++scow function renders a
dime
as atape
(dime
→tape
); it is otherwise identical to ++scot.The ++sane function checks the validity of a possible text string as a
knot
orterm
. The usage of++sane
will feel a bit strange to you: it doesn't apply directly to the text you want to check, but it produces a gate that checks for the aura (as%ta
or%tas
). (The gate-builder is a fairly common pattern in Hoon that we've started to hint at by using molds.)++sane
is also not infallible yet.> ((sane %ta) 'ångstrom')%.n> ((sane %ta) 'angstrom')%.y> ((sane %tas) 'ångstrom')%.n> ((sane %tas) 'angstrom')%.yWhy is this sort of check necessary? Two reasons:
@ta
knot
s and@tas
term
s have strict rules, such as being ASCII-only.Not every sequence of bits has a conversion to a text representation. That is, ASCII and Unicode have structural rules that limit the possible conversions which can be made. If things don't work, you'll get a
%bad-text
response.> 0x1234.5678.90ab.cdef0x1234.5678.90ab.cdef[%bad-text "[39 239 205 171 144 120 86 52 92 49 50 39 0]"]
There's a minor bug in Hoon that will let you produce an erroneous
term
(@tas
):> `@tas`'hello mars'%hello marsSince a
@tas
cannot include a space, this is formally incorrect, as++sane
reveals:> ((sane %tas) 'hello')%.y> ((sane %tas) 'hello mars')%.n
Exercise: Building Your Own Library
Let's take some of the code we've built above for processing text and turn them into a library we can use in another generator.
Take the space-breaking code and the element-counting code gates from above and include them in a
|%
barcen core. Save this file aslib/text.hoon
in the%base
desk of your fakeship and commit.Produce a generator
gen/text-user.hoon
which accepts a tape and returns the number of words in the text (separated by spaces). (How would you obtain this from those two operations?)
Logging
The most time-honored method of debugging is to simply output relevant values at key points throughout a program in order to make sure they are doing what you think they are doing. To this end, we introduced ~&
sigpam in the last lesson.
The ~&
sigpam rune offers some finer-grained output options than just printing a simple value to the screen. For instance, you can use it with string interpolation to produce detailed error messages.
There are also >
modifiers which can be included to mark “debugging levels”, really just color-coding the output:
- No
>
: regular >
: information>>
: warning>>>
: error
(Since all ~&
sigpam output is a side effect of the compiler, it doesn't map to the Unix stdout
/stderr
streams separately; it's all stdout
.)
You can use these to differentiate messages when debugging or otherwise auditing the behavior of a generator or library. Try these in your own Dojo:
> ~& 'Hello Mars!' ~'Hello Mars!'~> ~& > 'Hello Mars!' ~> 'Hello Mars!'~> ~& >> 'Hello Mars!' ~>> 'Hello Mars!'~> ~& >>> 'Hello Mars!' ~>>> 'Hello Mars!'~
%say
Generators
A naked generator is merely a gate: a core with a $
arm that Dojo knows to call. However, we can also invoke a generator which is a cell of a metadata tag and a core. The next level-up for our generator skills is the %say
generator, a cell of [%say core]
that affords slightly more sophisticated evaluation.
We use %say
generators when we want to provide something else in Arvo, the Urbit operating system, with metadata about the generator's output. This is useful when a generator is needed to pipe data to another program, a frequent occurrence.
To that end, %say
generators use mark
s to make it clear, to other Arvo computations, exactly what kind of data their output is. A mark is akin to a MIME type on the Arvo level. A mark
describes the data in some way, indicating that it's an %atom
, or that it's a standard such as %json
, or even that it's an application-specific data structure like %talk-command
. mark
s are not specific to %say
generators; whenever data moves between programs in Arvo, that data is marked.
So, more formally, a %say
generator is a cell. The head of that cell is the %say
tag, and the tail is a gate
that produces a cask
-- a pair of the output data and the mark
describing that data. -- Save this example as add.hoon
in the /gen
directory of your %base
desk:
:- %say|= *:- %noun(add 40 2)
Run it with:
> |commit %base> +add42
Notice that we used no argument, something that is possible with %say
generators but impossible with naked generators. We'll explain that in a moment. For now, let's focus on the code that is necessary to make something a %say
generator.
:- %say
Recall that the rune :-
colhep produces a cell, with the first following expression as its head and the second following expression as its tail.
The expression above creates a cell with %say
as the head. The tail is the |= *
expression on the line that follows.
|= *:- %noun(add 40 2)
|= *
constructs a gate that takes a noun. This gate
will itself produce a cask
, which is cell formed by the prepending :-
. The head of that cask
is %noun
and the tail is the rest of the program, (add 40 2)
. The tail of the cask
will be our actual data produced by the body of the program: in this case, just adding 40 and 2 together.
A %say
generator has access to values besides those passed into it and the Hoon standard subject. Namely, a %say
generator knows about our
, eny
, and now
, as well as the current desk:
our
is our current ship identity.eny
is entropy, a source of randomness.now
is the current system timestamp.bec
is the current path (beak).
These values can be stubbed out with *
or ^
if they are not needed in a particular generator.
%say
generators with arguments
We can modify the boilerplate code to allow arguments to be passed into a %say
generator, but in a way that gives us more power than we would have if we just used a naked generator.
Naked generators are limited because they have no way of accessing data that exists in Arvo, such as the date and time or pieces of fresh entropy. In %say
generators, however, we can access that kind of subject by identifying them in the gate's sample, which we only specified as *
in the previous few examples. But we can do more with %say
generators if we do more with that sample. Any valid sample will follow this 3-tuple scheme:
[[now=@da eny=@uvJ bec=beak] [list of unnamed arguments] [list of named arguments]]
This entire structure is a noun, which is why *
is a valid sample if we wish to not use any of the information here in a generator. But let's look at each of these three elements, piece by piece.
Exercise: The Magic 8-Ball
This Magic 8-Ball generator returns one of a variety of answers in response to a call. In its entirety:
Click to expand
!::- %say|= [[* eny=@uvJ *] *]:- %noun^- tape=/ answers=(list tape):~ "It is certain.""It is decidedly so.""Without a doubt.""Yes - definitely.""You may rely on it.""As I see it, yes.""Most likely.""Outlook good.""Yes.""Signs point to yes.""Reply hazy, try again""Ask again later.""Better not tell you now.""Cannot predict now.""Concentrate and ask again.""Don't count on it.""My reply is no.""My sources say no.""Outlook not so good.""Very doubtful."===/ rng ~(. og eny)=/ val (rad:rng (lent answers))(snag val answers)
~(. og eny)
starts a random number generator with a seed from the current entropy. Right now we don't know quite enough to interpret this line, but we'll revisit the ++og aspect of this %say
generator in the lesson on subject-oriented-programming. For now, just know that it allows us to produce a random (unpredictable) integer using ++rad:rng
. We slam the ++rad:rng
gate which returns a random number from 0 to n-1 inclusive. This gives us a random value from the list of possible answers.
Since this is a %say
generator, we can run it without arguments:
> +magic-8"Ask again later."
If we need to include optional arguments to a generator, we separate them using a ,
com:
+cat /===/gen/cat/hoon, =vane %c
Exercise: Using the Playing Card Library
Recall the playing card library /lib/playing-cards.hoon
in /lib
. Let's use it with a %say
generator.
/gen/cards.hoon
/+ playing-cards:- %say|= [[* eny=@uv *] *]:- %noun(shuffle-deck:playing-cards make-deck:playing-cards eny)
Having already saved the library as /lib/playing-cards.hoon
, you can import it with the /+
faslus rune. When cards.hoon
gets built, the Hoon builder will pull in the requested library and also build that. It will also create a dependency so that if /lib/playing-cards.hoon
changes, this file will also get rebuilt.
Below /+ playing-cards
, you have the standard say
generator boilerplate that allows us to get a bit of entropy from arvo
when the generator is run. Then we feed the entropy and a deck
created by make-deck
into shuffle-deck
to get back a shuffled deck
.
Solutions to Exercises
Roll-Your-Own-
++snag
::: snag.hoon::|= [a=@ b=(list @)]?~ b !!?: =(0 a) i.b$(a (dec a), b t.b)