Ibex: A Typed DataFrame Language with C++ Code Generation

Anthropic used a swarm of Claude Opus 4.6 agents to build a C compiler for about $20,000 in API costs with minimal intervention. It did make me realise that while I have been a happy user of R’s data.table for quite some time, the time to build something better is here. Third party packages like data.table but also pandas, Polars, the Tidyverse are bolted on an existing scripting language. These packages are marvels of engineering but the required hacks do have both pros and cons. They can lead to surprising behaviour (often: bugs) and when performance matters, a deep understanding of what happens under the hood is needed. The trade-off seems structural. A language focused on data frames might be a much more powerful tool.

So I, together with my team of LLMs started building a new language. The result is Ibex. It is far from done but it does do some useful work already:

  • CSV and Parquet reading
  • Select / Update / Filter / Group / Aggregate
  • Regular and table variables
  • Function definitions
  • External C++ interop
  • Basic type inference
  • Basic joins

It can be run through a REPL and also transpile to C++ so the generated code can be used in larger projects. An example

$ rlwrap ./build-release/tools/ibex --plugin-path libraries/
[info] Ibex REPL started (verbose=false)
ibex> extern fn read_parquet(path: String) -> DataFrame from "parquet.hpp";
ibex> let flights = read_parquet("data/flights-1m.parquet");
ibex> flights
rows: 1000000
columns: FL_DATE DEP_DELAY ARR_DELAY AIR_TIME DISTANCE DEP_TIME ARR_TIME
2006-01-01 5 19 350 2475 9.083333015441895 12.483333587646484
2006-01-02 167 216 343 2475 11.783333778381348 15.766666412353516
2006-01-03 -7 -2 344 2475 8.883333206176758 12.133333206176758
2006-01-04 -5 -13 331 2475 8.916666984558105 11.949999809265137
2006-01-05 -3 -17 321 2475 8.949999809265137 11.883333206176758
2006-01-06 -4 -32 320 2475 8.933333396911621 11.633333206176758
2006-01-08 -3 -2 346 2475 8.949999809265137 12.133333206176758
2006-01-09 3 0 334 2475 9.050000190734863 12.166666984558105
2006-01-10 -7 -21 334 2475 8.883333206176758 11.816666603088379
2006-01-11 8 -10 321 2475 9.133333206176758 12
… (999990 more rows)
ibex> flights[filter FL_DATE == "2006-01-01"]
rows: 17618
columns: FL_DATE DEP_DELAY ARR_DELAY AIR_TIME DISTANCE DEP_TIME ARR_TIME
2006-01-01 5 19 350 2475 9.083333015441895 12.483333587646484
2006-01-01 3 3 281 2475 9.550000190734863 17.799999237060547
2006-01-01 -1 19 348 2475 11.983333587646484 15.366666793823242
2006-01-01 0 -16 279 2475 12.5 20.549999237060547
2006-01-01 -4 26 516 3784 10.016666412353516 15.116666793823242
2006-01-01 36 8 380 3711 18.516666412353516 5.066666603088379
2006-01-01 -4 24 504 3711 11.850000381469727 16.75
2006-01-01 36 26 393 3784 18.600000381469727 5.699999809265137
2006-01-01 22 15 261 2475 22.866666793823242 6.9666666984558105
2006-01-01 -1 -15 266 2486 23.649999618530273 6.333333492279053
… (17608 more rows)
ibex> flights[filter FL_DATE == "2006-01-01", m = mean(ARR_DELAY)]
error: 1:41: expected clause
ibex> flights[filter FL_DATE == "2006-01-01", select m = mean(ARR_DELAY)]
rows: 1
columns: m
10.206833919854693
ibex> flights[filter FL_DATE == "2006-01-01", select m = mean(ARR_DELAY), by AIR_TIME]
rows: 421
columns: AIR_TIME m
350 23.6
281 11.5
348 41
279 20.363636363636363
516 26
380 8
504 24
393 27.5
261 -0.7777777777777778
266 13.714285714285714
… (411 more rows)

All of these run quite fast, practically instantly on my machine and in a similar range as single threaded data.table.

What’s next?

Ibex is still incomplete, data frame manipulation needs to be extended and tuned. Broadcasting operators are a must have. Dates are represented by strings still. Time series and windowing support seem like a good idea. Multithreading support will yield significant speed ups. Graphing straight from the REPL is good to have. Since C++ interop is quite straightforward, I want to add numerical methods by importing a BLAS and support for random number generation and statistical tests.

Feedback and contributions are welcome. The repository is on GitHub.

AoC 3rd Advent Sunday Wrap Up

Be warned: spoilers ahead.

Days 5 to 11 posed a bit more challenge than the first four and gave the opportunity to explore various parts of R.

Day 5

The actual logic of the puzzle was quite easy:

do_move <- function(stacks, count, from, to, move_fun = identity) {
  stacks[[to]] <- c(stacks[[to]], move_fun(tail(stacks[[from]], count)))
  stacks[[from]] <- stacks[[from]][seq_len(length(stacks[[from]]) - count)]
  stacks
}

where move_fun was either identity() or rev(). Getting the data into shape was more interesting and the native pipe could be put to good use as well as a new experminental feature:

stacks <- gsub("    ", " [_]", parts[[1L]]) |>
  ustrsplit(split = "\n") |>
  stacks => head(stacks, length(stacks) - 1L) |>
  strsplit(split = " ") |>
  data.table::transpose() |>
  lapply(rev) |>
  filter_empty()

The pipebind operator => in the middle of this pipeline can be used to the current argument to a name in the middle of a pipe. This allows using the current argument of the pipeline without the need to resort to ad hoc anonymous functions. Since it’s an experimental feature, it must be activated. This can be done by putting this in your .Rprofile: Sys.setenv("R_USE_PIPEBIND"=TRUE). This also works in RStudio.

Day 6

Day 6 was the easiest puzzle until now but I learned one small trick: The base function match() has an optional argument nomatch which specifies the return value if no match is found. In this case an if statement can be avoided by setting nomatch=0. The code below gets the part from a index to the end or keeps the buffer with one char added:

index <- match(char, buffer, nomatch = 0L)
buffer <- c(buffer, char)
buffer <- buffer[(index + 1L):length(buffer)]

Day 7

This puzzle gave a good reason to start the Dictionary class in recollections! Unlike the builtin list datatype the recollections dictionary can be used by reference which allowed finding a directory and directly using it without the need to copy it back into the directory tree. On top of that, the C++ code that underlies the Dictionary class is more efficient than that of list. With the tree of dictionaries in place, the logic to find the sizes of all leaves in the tree is a standard use of recursion.

Day 8

I’m not completely happy with this solution to this puzzle. After tinkering and looking at profvis output I managed to create a solution that runs in less than 10 seconds on this old machine but if the forest gets much bigger this code will struggle. Putting the puzzle input into a data.table might not be the most natural thing to do but in the end to write a quite clear solution so maybe it’s not all bad.

Day 9

In this puzzle we were asked to implement some weird version of snake. Again, this seems to be best solved using recollections::Dictionary to keep track of which cells have been visited. My initial solution solution had quite a bit of logic to determine the moves in the tail but this Reddit comment that simplified the logic quite a bit. Despite being a similar solution it’s interesting to see that the Python and R solution use quite different language features.

Day 10

Now we are asked to emulate a simple instruction set. The instruction set is so simple that execution and a history of all states can be handled using just data.table (with a big help of shift() and nafill().

Day 11

In this exercise it seems logical to put all properties of the individual monkeys in some kind of class. So, this was a good moment to play around with S4 classes. This worked out quite well but I did notice there is a bit of overhead when one interacts with the slots of the classes. By batching slot manipulation a speed of 50% was achieved. This performance improvement is unlikely to be relevant for most uses of R though as a lot of R code won’t have such tight loops.

AoC 2022 2nd Advent Sunday wrap up

This Advent period I’m participating in the Advent of Code (AoC) using R base + data.table and put the code on GitHub. Till now, the given data is easily be manipulated and there was even no need to use conditions or loops to get to the answers. Just using the functions in R base or loading and manipulating the data with data.table was sufficient thanks to tstrsplit(), and foverlaps().

CSV benchmarking 1/n

In this series of posts we will be looking at a number of ways to store data using R in as little space as possible and also consider the portability of the different solutions. As an example, the New York City Flights data set for 2013 is used (available through CRAN). The first rows and columns are shown below:

# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

For portability across programming languages, CSV’s, optionally compressed, area great option as any platform for doing anything with data should be able to read CSV. Alternative formats such as fst, parquet and ORC have a number of advantages such as smaller sizes, better fidelity and built in integrity checking. These will be examined in a later post. For now, gzipped CSV is used as reference. If created from the R data.table package, it will take 2 cores a bit less than one second to write the file containing the 2013 flights. The file size is 6.6mb.

library(dplyr)
library(data.table)
# We drop the column with timestamps for reasons explained below.
flights <- nycflights13::flights %>% select(-time_hour)
flightsDt <- as.data.table(flights)
system.time(data.table::fwrite(flightsDt, 'flights.csv.gz'))
    user  system elapsed 
   1.886   0.027   0.984 

If you want to read or write a large amount of data to CSV using R and you want to do it rather quickly, there are two good options at the moment: data.table and vroom.

data.table and vroom in the current development version support writing gzip-compressed csv’s as well. Roughly, zip-like compression algorithms works by creating a mapping of shorter sequences to longer sequences of bits in such a way that the mapping + the input mapped from the set of longer sequences to the shorter sequences takes up less space than the original. Compression algorithms use clever techniques to create and maintain this mapping but for reasons of speed and memory use this mapping can’t grow without bounds. We can help the algorithm a bit by first transposing the data.

Transpose for a free lunch

There is a lot of repetition in the first three columns year, month and day.

> flightsDt[,    
  lapply(.SD, function(.) length(unique(.))),    
  .SDcols = c('year', 'month', 'day') 
]
    year month day
 1:    1    12  31

If all these are put close together, it helps the compression algorithm a bit:

TimeAndSize <- function(FUN, fileName) {
  filePath <- file.path(tempdir(), fileName)
  on.exit(unlink(filePath))
  timing <- unclass(system.time(FUN(filePath)))
  fileSize <- file.info(filePath)$size
  c(timing, FileSizeInMB = round(fileSize / 1024L / 1024L, 1))
}
dt_csv_fun <- function(fileName) fwrite(flightsDt, fileName) 
dt_gz_fun <- dt_fun vroom_csv_fun <- function(.) 
vroom_write(flights, .) 
vroom_gz_fun <- function(.) vroom_write(flights, pipe(paste('gzip >', .)))
vroom_mgz_fun <- function(.) vroom_write(flights, pipe(paste('pigz >', .)))
vroom_zstd_fun <- function(.) vroom_write(flights, pipe(paste('zstd >', .)))
dt_csv_fun_transposed <- function(.) fwrite(transpose(flightsDt), .)
dt_gz_fun_transposed <- function(.) fwrite(transpose(flightsDt), .)
experiments <- list(
  dt_csv_fun = 'flights.csv', dt_gz_fun = 'flights.csv.gz',
  vroom_csv_fun = 'flights.csv', vroom_gz_fun = 'flights.csv.gz',  
  vroom_mgz_fun = 'flights.csv.gz', vroom_zstd_fun = 'flights.csv.zstd',
  dt_csv_fun_transposed = 'flights.tcsv', 
  dt_gz_fun_transposed = 'flights.tcsv.gz'
 )
 PerformExperiments <- function(experiments) {
   t(sapply(names(experiments), function(.) {
     TimeAndSize(match.fun(.), experiments[[.]])
   }))
 }
 PerformExperiments(experiments)
                  user.self sys.self elapsed user.child sys.child FileSizeInMB
dt_csv_fun            0.403    0.012   0.215      0.000     0.000         22.8
dt_gz_fun             1.891    0.000   0.961      0.000     0.000          7.5
vroom_csv_fun         1.327    0.019   0.360      0.000     0.000         22.9
vroom_gz_fun          1.359    0.030   1.517      1.474     0.024          7.5
vroom_mgz_fun         1.341    0.046   0.943      1.560     0.039          7.5
vroom_zstd_fun        1.364    0.037   0.600      0.181     0.030          6.6
dt_csv_fun_transposed 2.318    0.047   1.964      0.000     0.000         22.8
dt_gz_fun_transposed  3.004    0.003   1.893      0.000     0.000          5.5

Comparing the results of data.table and vroom we see that timing and file sizes are almost equal. Both are equally good at writing CSV files. However, if the table is first transposed and then written, the total process takes a bit longer but file size is reduce by around 25%!

Warts

The column time_hour needs to be dropped as data.table‘s transpose() unpacks this column which is saved as more data and transposing these columns back doesn’t result in the same values in any case. In any case, transposing is slow even with `data.table` and handling very wide tables can be problematic for some software packages.

Conclusion

Compressed csv’s are a portable data format that require significantly less space than their uncompressed counterparts. The size of compressed csv’s can be further reduced by transposing the csv’s first and then compressing. Whether this extra compression is worth it depends on use case.