CSV Macros for Great Good (and sanity)

Why I made my own Racket macro to parse CSV files

Published on 2021-12-08

I know I shouldn't be so mono-language fascinated, but I had a hard decision to make between keeping going in my preferred language, or handing over my problems to Python and calling it a day. I am however, happy to announce that I am not admitting defeat and am going to keep going in Racket.

Deep into the CSV mangling jungle I was, I was beginning to dread writing any kind of code related to spreadsheet parsing and transforming. It felt more boiler-platey than it was worth, and coming up with solutions seemed very dead-zoned. I had tools I needed, but more unique problems came about which required more specific code. But the code I was writing was only slightly different from others. I tried re-writing tools to create more generic and broad functions, but it didn't work out in all cases for me.

It was here I decided to throw my hat into the world of macros and see what I could come up with.

Structs

I find Racket structs to be completely reasonable to use in this CSV hellhole I find myself trapped in. A struct is a named tuple sort of thing that we can use to encapsulate data and pass it around. By themselves, structs are pretty harmless, but they lack a lot of basic functionality that I would love to have. A lot of times in processing I need to compare a struct to see if it is equal to another.

; a struct representing a 2d point
(struct point (x y))

; okay! struct definitons are in
(point? (point 3 4)) ; works, #t
(point-x (point 3 4) ; works, 3
(point=? (point 3 4) (point 4 5)) ; doesn't exist, error

; okay let's see if it works with eqv?, maybe?
(eqv? (point 3 4) (point 3 4))   ; #f - um why
(eq? (point 3 4) (point 3 4))    ; #f
(equal? (point 3 4) (point 3 4)) ; #f - this is tiring

Is this a big deal? Well... Sort of, I guess. It is kind of helpful to know. Instead I'm sort of forced to write my own equality methods here, which isn't amazing.

; okay, a two field struct, easy compare method
(struct point (x y))

(define (point=? a b)
  (and (= (point-x a) (point-x b))
       (= (point-y a) (point-y b))))

; good lord help us...
(struct csvrecord (id name desc qty price amznid ebayid ...)
(define (csvrecord=? a b)
  (and (= (csvrecord-id a) (csvrecord-id b)
          ...))) ; this is pain

In this case, it's not always necessary to match every field. It might be as simple as matching up the IDs, by which we still have to write a pretty boilerplate function each time we write a struct.

(define product (id name blabla))

(define (product=? a b)
  (= (product-id a) (product-id b)))

; refactor to use less calls
(define (product=? a b)
  (apply = (map product-id (list a b))))

Looks perfectly reasonable, yes. Except however this code has a flaw - the = function doesn't work for strings. CSV records are almost always plaintext, so strings are almost everywhere. Now we need to re-write this with strings in mind! Simple change, but how do we keep track of this? Are structs actually the solution here?

Simply put, a struct does not care about the typing unless you forced typing into it when you initialize the structs. If you're doing a naked read of a CSV file, everything will be strings. Mapping the values on a spreadsheet read is logic you will have to code in yourself, as there's no cool functionality for that.

Boilerplate

I published a lot of my CSV workings into a handy package on GitHub called ez-csv. It's a pain-free way of writing a struct that can help you read and write files.

The only issue is it lacks some major features:

type-enforcement of data, it cannot coerce strings into types (or at least, lacks the ability to do so)
does not keep track of headers, for when you read a file and want to store the headers for later
does not do it lazily, no lazy I/O (this is more a me problem)
doesn't do everything automatically, you must plug some code in a couple of places

There is a disconnect between the struct definition itself and the code to support it. The functions I would like to provide are:

file->structs, a one-line function to import a file as a list of struct records
structs->file, the inverse operation, but it doesn't have headers attached to the structs, must be passed in as an argument
struct=?, I cannot provide this function nakedly without more context, so I simply don't provide it at all

I can do a job that's slightly decent, but it doesn't cover me in many cases. I find struct-copy to be a great function that's very handy, but I have to write very boilerplatey code each time I want to implement a new CSV record with similar-ish logic.

It is here where I have to decide how worth it it is to continue writing these complex CSV programs in a functional language, versus handing it over to Python, who has a built-in csv library and calling it a day.

Macros

I vaguely remembered a tutorial covering macros as a basic jumping-off point, and it's by a Racket coder named Greg Hendershott. He wrote "Fear of Macros", a guide on getting started writing your own Racket macros. And it's a very easy to understand guide.

Before going on this adventure, I remembered reading through his guide, and vaguely recalled something about defining your own structs with a macro. I was interested in that, as I wanted to extend the functionality of Racket structs so I could avoid repeated code patterns. If I could squash it all into a macro, I would save a lot of headaches now, and time later.

Internally, a struct definition in Racket is a macro transform binding to several definitions based on the layout of the struct. There are several parameters you can fill out with the Racket struct like guards and inheritance measures, but I don't really need all those bits and bobs quite (or maybe I do, but I can't think of how those special values help me yet). One defined struct equates to several definitions; initialize, predicate, field accessors, a way of doing struct-copy, and so on and so forth.

Below I will post the code from Greg's post here:

(define-syntax (our-struct stx)
  (syntax-case stx ()
    [(_ id (fields ...))
     ; Guard or "fender" expression:
     (for-each (lambda (x)
                 (unless (identifier? x)
                   (raise-syntax-error #f "not an identifier"
                                       stx x)))
               (cons #'id (syntax->list #'(fields ...))))
     (with-syntax ([pred-id (format-id #'id "~a?" #'id)])
       #`(begin
           ; Define a constructor.
           (define (id fields ...)
             (apply vector (cons 'id  (list fields ...))))
           ; Define a predicate.
           (define (pred-id v)
             (and (vector? v)
                  (eq? (vector-ref v 0) 'id)))
           ; Define an accessor for each field.
           #,@(for/list ([x (syntax->list #'(fields ...))]
                         [n (in-naturals 1)])
                (with-syntax ([acc-id (format-id #'id "~a-~a"
                                                 #'id x)]
                              [ix n])
                  #`(define (acc-id v)
                      (unless (pred-id v)
                        (error 'acc-id "~a is not a ~a struct"
                                       v 'id))
                      (vector-ref v ix))))))]))

The code on it's own is able to produce several functions when used as a macro: an initializing function, a predicate to check if an object is the custom struct, and accessor functions to index the fields of our new struct macro here.

First let's go over the basics: what the heck all this means. Let's start from the top: syntax-case. When we create our syntax using define-syntax, we define a name for a macro transformer, and a variable that accepts a syntax object of some type, namely the stx argument. We pass this strange syntax object into a new function syntax-case, which is a pattern matching tool to help us unwind whatever syntax we get.

The syntax-case structure is handled like a cond, but we bind variable names to the overall structure of the syntax received. In the case of (_ id (fields ...)), we are looking for syntax that resembles that shape. In turn we will transform a macro call of (our-struct my-record [field1 field2]) into something that vaguely resembles a Racket-like struct definition after doing some magic.

The goal of a macro is to take a syntax object, but we must return a syntax object in return. This is where with-syntax comes into play - we must transform these pattern variables we just binded from syntax-case into some kind of code we can return. Arguments like id or fields are no good on their own, and must be transformed into things we can use when we generate new code.

In the case of making an initializing function and a predicate, we can use whatever name we received as the initializer, similar to a struct call. For the predicate, we can append a question mark, but we must do that through a with-syntax binding, using format-id to make a new variable substitute.

If you do (format-id #'id "~a?" #'id), we will get a string that will look like my-record? at the end. If we bind this to pred-id inside of with-syntax, we have a new substitute variable we can use.

Inside the with-syntax is a begin expression that looks like #(quasiquote (begin ...), it is here we can use those with-syntax variables freely without having to actually back-tick them. Hence why you see something like (define (pred-id v) without any kind of back-tick escaping. It's a direct substitute.

The tricky part is understanding the code at the bottom where it starts with #,@(for/list ...). This is a looping piece of code that maps over the field IDs and creates new function definitions to access those field values. All it's doing is creating functions that can call vector-ref for us on our internal struct representation.

Because that's all we're doing! If you've noticed inside (define (id fields ...)), all we're doing is passing a list over to the vector initializer. We're essentially packing all of our information into a vector data type, which is a good way of passing data around.

After all that, you will have a constructor/initializer, a predicate, and N-number of field access methods. Pretty simple stuff!

Extending

Knowing my own non-macro CSV code, I wanted to adjust this and add more functionality. I wanted to include the automatic file-to-struct/struct-to-file code so that's included right out the gate. I want to include headers, so we don't lose that information anywhere and it's neatly bound within the internal vector. The delimiter is also contained because I'm tired of passing that in as an argument everywhere too.

However there's some things Greg's original code will not take care of for us - if we're parsing CSVs, we must assume a fixed-sized amount of data when we receive it. His code does not check for bounds of incoming data, so we must do that ourselves. If we have five headers, we need to be certain we take in five pieces of data when building our internal vector.

(define (id fields ...)
  (unless (= (length headers-lst) (length (list fields ...)))
    (error "Whoops! Invalid amount of data chunks!"))
  (apply vector `(id ,headers-lst ,delimiter ,@(list fields ...))))

It's kind of a bare-minimum length check, but it at least it's a start somewhere. I need to ensure the lengths of the vectors are at least the same.

The next thing is including a simple file->structs helper, which isn't really so hard. This eliminates a lot of boilerplate code as I found myself often times re-writing this function a lot because of how much I had to apply a string converter and the delimiter back into scope for it. Since they were never attached in a meaningful way, I was always passing in additional arguments to do this.

But now with a macro, it produces a simple procedure that works for every struct defined with this macro.

(define (thing->csv fpath listof-v)
  (call-with-output-file #:exists 'replace
    fpath
    (λ (output)
      (parameterize ([current-output-port output])
        (displayln (string-join headers-lst delimiter))
        (for-each
          (λ (v)
            (displayln
              (string-join
                (for/list ([f headers-lst] [n (in-naturals 3)])
                  (vector-ref v n))
              delimiter)))
            listof-v)))))

It's pre-baked to enumerate over the headers and do the vector referencing, so one no longer will have to always do the (struct-field1 v) (struct-field2 v) ... madness that I have often times found myself doing to present basic CSV files.

Lastly comes the final issue I am still slightly struggling to come up with an idea for: mimicking struct-copy behavior. struct-copy is incredibly useful for me to iterate over structs and map new values to them, but how do I apply that behavior here in concept? struct-copy has a kind of macro nature to it where it takes actual syntax and applies the transformation properly, but I don't feel comfortable enough creating a sub-macro from a macro. I'm not ready for that life yet.

Instead, I think it might be better to trade out using a vector type, which is a fixed-size space storage type, and use an immutable-hash type instead. Vectors I feel have great performance, but the functions around updating them are pretty terrible (and are often due to the nature of boxing up memory for the vector; doing copies of vectors is relatively memory-unfriendly I imagine). I don't really care for the memory performance of vectors, what I want is something that is easy to update freely.

For me to get this, I would much prefer working with a hash instead and using hash-update as a means of updating the struct by mapping new values to their respective keys. I can create an updater function that looks at a list of pairs and apply a hash-update for each pairing. The work for me to carry over Greg's code into an immutable-hash will take a small amount of time, but it will become a more usable data type after it is generated.

Conclusion

I have spent a few days writing this macro for work purposes, and I feel excitement over the idea of using it, and fear over knowing that I'll stop using Racket structs which were completely okay to use before. All mistakes and errors that come forth will be because I went overboard and wrote a crazy macro.

But it's exciting because I want to share it with the world and put it into ez-csv, even if no one uses it. It isn't in what I would say the greatest state in the world, because I don't have a means of exporting all the functions I defined, but I can work on that and try to improve that part and actually get it going. I think this macro is pretty cool and saves me quite a bit of effort and headaches.

I will write a bit more about this macro at a later time, possibly after I publish it into the ez-csv repo where it will be now released into the wild. So far it passes all my basic read/write tests and I have been using it in some programs for a few days now. We shall see how it goes.

Thanks for reading!