A Macro Journey

My foray into macros, and why I think they're pretty darn neat

Published on 2021-12-14

A while back I decided to write about a story of me at work, developing tools for wrangling spreadsheet data, namely CSV sheets. I was using the Racket programming language, which is a language I've really come to enjoy working within. It's simple, easy, and the functionality and flexibility is pretty incredible. It's multiple paradigms at once, can do many things, can compile it's code to contain miniature VMs for multiple platforms, and overall proves to be useful to me over and over.

But this story didn't always start out pretty. I was tasked with handling a lot of random spreadsheet data and trying to create logic to interweave different data sets. I didn't have much in the way of project support, and I didn't want to integrate parts that would make it more complicated. As the sole programmer I could pick whatever environment I would want to work with, but finding a language that dealt with CSV sheets gracefully was a draining process.

This story will be about how I approached this problem using native Racket components, and why I decided to go the way I went. And eventually how I ported this code to my open source CSV package ez-csv.

Structs

A struct can mean different things in many different languages, but in reality, a struct in Racket is a lot like a list that has named accessor methods. The basic gist of a struct looks like the following:

; defining the name of the struct, and it's fields
(struct point (x y z))

; make a struct
(define p1 (point 3 4 5))

; access the Z field
(displayln (point-z p1)

; predicate to determine if a type is a point
(displayln (point? p1))

The struct is effectively a macro that defines many different functions for us. From that one line definition it creates a constructor method, the accessors to all the fields, and a predicate function. There are other struct mechanisms that exist like some basic inheritance and data copying, but I won't talk about those quite yet.

So naturally, a struct seems like a decent starting out point to read in some CSV data. A CSV record is a string separated by commas that represent N-number of fields. The first line of a CSV file tells us the headers while the following lines are all the records. We can create a naive implementation of a CSV reader in a few lines using some basic string functionality.

(define (read-csv csv-file-path)
  (call-with-input-file csv-file-path
    (lambda (input)
      (map
        (lambda (line)
          (string-split line ","))
      (port->lines input)))))

> (read-csv "example.csv")
'(("name", "id_number")
  ("steven", "id123")
  ("thomas", "id457")
  ...

This implements a basic kind of parser. You can take the head of the list and use it as header information, and the rest of the list as the data. You can encapsulate the CSV using structs to create easy accessors instead of depending on traversal like car, cdr and cadr.

; define an Employee struct
(struct Employee (name id))

; define a mapper tool to iterate over a list of data
(define (lines->Employees listof-records)
  (map (lambda (listof-data) (apply Employee listof-data))
       listof-records))

; create an easy way to convert file to structs
; uses a cdr call to skip the first header row
(define file->Employees
  (compose lines->Employees cdr read-csv))

Now we can read a file and turn it into a struct by using apply. But however, the reverse operation is less clear, because the headers were contained inside the read data. By throwing it away, we've failed to remember it for when we want to write records back out. This may or may not matter depending on the application, but keeping track of data is now an extra chore to keep in mind.

(define (read-header-and-records fpath)
  (call-with-input-file)
    (lambda (in-file)
      (define all-rows
        (map (lambda (line)
          (apply Employee (string-split line ",")))
          (port->lines in-file)))
      (values (car all-rows)
              (cdr all-rows))))

(define-values (header records)
  (read-header-and-records "my_employees.csv")

You can do it by using values to return multiple values each time, but now the header information is still taken apart from the original record data. To write file out, we must remember to include it with generic functions, as well as always remembering to use define-values instead of regular define, which could be an extra mental step.

Updating records to keep a more up-to-date file, or maybe to do some simple operations on data, are easy to do using struct-copy, which allows you to copy a struct and return a new one, leaving the original completely unmodified.

; use struct-copy to copy the original struct and apply changes
(define (update-record rec)
  (struct-copy Employee rec
               [name (string-upcase (Employee-name rec))]))

; now all employees will have names with uppercase letters only
(define fixed-data (map update-record all-employees))

Now we come to the last part of the problem, where we must now write the records to a CSV file. We have to pass in the headers externally if they are kept the same.

; we need to convert a record to a string
(define (Employee->string rec)
  (string-join (list (Employee-name rec)
                     (Employee-id rec)) ","))

; iterate over all records and write to file
(define (write-csv out-path header records)
  (call-with-output-file #:exists 'replace
    (lambda (out-port)
      (parameterize ([current-output-port out-port])
        (displayln (Employee->string header))
        (for-each
          (lambda (record)
            (displayln (Employee->string record)))
           records)))))

; write the data to file using the above call
(write-csv "new_data.csv" header fixed-data)

We now have a full successful means of operating over records. However, there's so many issues I have with all this code that it drives me up the wall. Yes, we can work with this code, and yes, I have been doing it in similar ways using these functions for months now. But unfortuatenly, it makes me go crazy.

Since header information is part of the spreadsheet, it must be kept either alongside the data, or inside some data type that can hold both. It is significantly easier to not include headers into the native struct itself, so you don't always have to include a header field each time you declare a new struct definition.
Structs make it convenient, but it cannot be written to a file directly without some sort of serialization that displayln can understand. We need it in a string format, meaning we need to also keep track of a delimiter each time. Additional information overhead, and not exactly something we also want to have to create an accessor field for in a struct.
Defining all the intermediary functions like struct->string is kind of boring. This part can be done automatically in some sense, but the struct macro will not provide it. You can provide a function by setting prop:custom-write, but again, that's additional overhead you will have to tack on yourself.

I have been writing CSV parsing code for a while now, and this is the main three problems I had. struct got me close, but not close enough to not be tiresome in the long run. I had a few choices to look into.

Classes and Objects
Third party libraries
A DIY Racket Macro

Classes and Objects

The reason why I don't like the class idea is because it doesn't resonate well with me. Racket is largely a functional Lisp-family language, and I use it primarily because I enjoy the functional programming aspect. Going back to an object-oriented style of coding would be a detriment to me, as it means I have to interject OOP-like closures and new methods and retain more mental gymnastics.

While I think a class/object relationship approach would be nice, there are some ideas that mesh rather well. Sometimes I have data in one sheet that needs to be extended and carry over information. Classes can define that relationship well, by having parents and children classes that extend fields.

The downside is to interact with objects you largely have to use a function called send. send will quite literally send a message to an object to execute some code that is bound to that instance of data. It could be mutable, or it could be immutable. Hard to tell from that angle, as you may or may not get return values based on the code itself.

It seemed both better and worse in some aspects. I could get OOP relationships in data, but functionality would be all over the place and (probably) broken up in different classes that wouldn't need classes. I would have to write new methods that can interact with the send function, and then I have to remember key identifiers that match up to the bound functions. Maybe some functions aren't bound to certain data types, and I forget that, and lose an hour trying to figure out where I messed up if the stack trace errors aren't clear enough. It seemed like a net loss more than a net gain.

Third Party Libraries

My other solution would be to use other people's approaches to CSV files. From what I have seen, there aren't many projects out there dedicated to this, or maybe some of them are simply too old for me to want to run the risk of using. Here's a sample of libraries I've looked at.

csv - last updated August 2017
csv-reading - how do I even contribute to this??
csv-writing - last updated November 2019
simple-csv - fails to build

I need to write reliable code now, that I can possibly share with others. CSV files aren't that complex, but if I have shortcomings with someone else's code, I need to get things working now, not later. I have time restrictions that simply make it not possible for me to make open source contributions that could take who knows how long to get merged.

Therefore, I have published ez-csv, as a means of approaching this weird CSV space that doesn't get enough attention. I am sharing my code that reads and writes CSV files to be helpful and give back to the Racket community.

A DIY Racket Macro

I am vaguely familiar with the notions behind a Racket macro. A Racket macro is essentially a type of Racket function that operates on a code as if it were a data type. Instead of using something like a C preprocessor, which comes with a slew of developer woes, Racket uses it's own language as the preprocessor.

The chain of operations looks something like this

<Racket Macro Code> ; (define x 3)
 |
 -> [Racket Macro Expansion]
    |
    -> <Racket Code> ; (define-values (x) (quote 3))
       |
       > [Racket Interpreter]
         |
         > <Results> ; > 3

So writing macro code looks really no different than that of regular Racket code. In fact, you are allowed to write macro and non-macro code in the same space. The difference is that the data we operate on is a little bit different - it's mostly symbols and lists.

So no longer are you working with numbers or strings, you're working purely with lists and creating new lists. You take in a list of symbols and inner lists, and output a new list of symbols/lists. Let's observe some simple macros we'll call syntax rules.

; a simple identity rule - return whatever it gives
(define-syntax-rule (id X) X)

; null - return nothing given something
(define-syntax-rule (null X)
  (void))
  
; duplicate - duplicate whatever is given
(define-syntax-rule (duplicate X)
 `(,X ,X))


> (id 3) ; => 3
> (null 3) ; => <returns void, no value>
> (duplicate 3)
'(3 3)
> (duplicate (displayln "hi"))
hi
hi
'(#<void> #<void>)

The duplicate function is slightly interesting. Notice how it evaluates the effects of displayln, and then captures the void values of the displayln call? It's because we evaluated a single piece of code twice! That's the core of all programming right there - doing more with less.

This is the key difference in how macros work versus how ordinary functions work. In a normal evaluation, the displayln call would have been evaluated, "hi" would have been printed, and we would see two void values. Instead in a macro environment, we took displayln as if it were a normal list of symbols, then returned the expansion of it being duplicated.

Though if we were to want to avoid this by using a macro, we need to evaluate the code first, store that result, then duplicate the values afterwards. We could write this as an ordinary function, but it's a neat exercise in understanding macro code expansion.

(define-syntax-rule (run-once-dupe X)
  (let ([runs-once X])
    `(,runs-once ,runs-once)))

> (run-once-dupe (displayln "hi"))
hi
'(#<void> #<void>)

At a glance, you think this might not work at all. There's no real difference between doing this as a macro versus an ordinary function. You could be right - there isn't any. Some people might think it should print twice, and you are correct in that line of thinking. But in macro land, things are different.

In that let binding, it appears as if all we're doing is binding the variable runs-once to the input argument of X, which would be a list of some Racket code. The let variable should then be evaluated twice as per the quasi-quoted list, but at this point, we're past that point. We've already expanded the input argument and turned it into a regular let expression. The value is evaluated at the let expression, even with a binding as simple as the one pictured. The real expansion looks something like this now.

(let ([runs-once (displayln "hi")])
  `(,runs-once ,runs-once))

The input code was transformed into a let block, and now there's no multiple displayln expressions. Therefore, displayln only prints once, but we duplicate the #<void> value like before.

This isn't a fancy macro at all, but it's an interesting dive into how macro transformation works. You have to take code literally and visualize almost like a copy-and-paste transform of code when doing macro expansion. There's no real tricks to it, it's very simple.

One of the special things with macros we can do is the ability to have code defined for us ahead of time using some magic macro trickery. We have to use some more advanced macro tooling, but it's not too hard to follow. Namely we're going to use syntax-case and with-syntax as special ways of creating new code.

(define-syntax (make-names stx)
  (syntax-case stx ()
    [(_ first last)
     (with-syntax
      ([first-name (format-id #'first "first-name")]
       [last-name (format-id #'last "last-name")]
       [full-name (format-id #'first "full-name")])
     #'(begin
        (define first-name first)
        (define last-name last)
        (define full-name (string-append first " " last))))]))

> (make-names "Steven" "Leibrock")
> full-name
"Steven Leibrock"

syntax-case and with-syntax are special functions designed to work with syntax templates and patterns, otherwise we would be left with destructuring syntax lists ourselves, and that's really not much fun. The final result is pretty cool, it is able to create three variable bindings on our behalf from only two strings, and now we get bindings to first, last and the full name on top of it by only doing one line of code. Pretty neat!

This effectively serves as the basis of my new CSV binding code. Using syntax-case and with-syntax, I am able to create a pretty feature-complete portable CSV library that allows users to define their own CSV record type. The code is far too long to actually paste here, but you can see the full thing over at ez-csv.

The #'(begin ...) pattern is the syntax for defining a new code block, which the macro will return for later evaluation. It's the same thing as a regular Racket list. You can use cons to attach something to the front, you can iterate over it, and do many cool things within macros. My CSV library is really only scratching the surface of what one can do with some simple macros. In some cases of my programming, basic macros are simply hygienic and prevent me from doing some insane code re-use.

The case for with-syntax is that we cannot directly refer to template variables from syntax-case directly. They instead have to be cast into something we can use, which is where with-syntax comes into play. In fact, with-syntax is a specialized form of syntax-case, but it's written a bit differently. format-id is used to create identifiers that we can reference to in code, and it can inject symbols into format patterns. It does a bit more than simple string formatting, such as generating proper identifiers, but that one's above my head.

You can also use things like identifier? or free-identifier? to determine if symbols are actually identifiers (or free to reserve in the runtime), from which you can say "hey, let's not bother continuing, these variables are already in use" by throwing a syntax error. Or if you would like to mutate the already in-place variable using set!, you can do that too!

Macros are pretty funny when you start to get into them, and it's powerful what you can do. Here's a sample snippet of how I use my defrec macro, a macro that auto-creates a ton of CSV handling code for me.

; define a CSV record, which populates many functions for me
(defrec Contact ["Name" "Number"] [name number] ",")

; import my phonebook, each as a Contact record
(define my-contacts (file->Contacts "phonebook.csv")

; map a displayln over all records using a Contact->string proc
(for-each (compose displayln Contact->string) my-contacts)

There's some more hidden functions, but again, check out the repository to get a better picture of things.

Conclusion

I have been working on this macro for maybe what seems like two weeks now, and I am pleased with it's progress in my real-life work scenarios. I don't think the other libraries do what this macro is doing currently. I'm excited and also scared at the idea of other people using this and submitting feedback to me. I'm also scared of not doing macros the right way, so I think I need to seek feedback from more veteran Racket users still.

This has been a fun journey and I do love using Racket. It's been a blast, and since I can't share it with most of my real-life friends, I figured I would do my best to share my experience with the world. Thank you for reading!