Ibex: A Typed DataFrame Language with C++ Code Generation

Anthropic used a swarm of Claude Opus 4.6 agents to build a C compiler for about $20,000 in API costs with minimal intervention. It did make me realise that while I have been a happy user of R’s data.table for quite some time, the time to build something better is here. Third party packages like data.table but also pandas, Polars, the Tidyverse are bolted on an existing scripting language. These packages are marvels of engineering but the required hacks do have both pros and cons. They can lead to surprising behaviour (often: bugs) and when performance matters, a deep understanding of what happens under the hood is needed. The trade-off seems structural. A language focused on data frames might be a much more powerful tool.

So I, together with my team of LLMs started building a new language. The result is Ibex. It is far from done but it does do some useful work already:

CSV and Parquet reading
Select / Update / Filter / Group / Aggregate
Regular and table variables
Function definitions
External C++ interop
Basic type inference
Basic joins

It can be run through a REPL and also transpile to C++ so the generated code can be used in larger projects. An example

$ rlwrap ./build-release/tools/ibex --plugin-path libraries/
[info] Ibex REPL started (verbose=false)
ibex> extern fn read_parquet(path: String) -> DataFrame from "parquet.hpp";
ibex> let flights = read_parquet("data/flights-1m.parquet");
ibex> flights
rows: 1000000
columns: FL_DATE DEP_DELAY ARR_DELAY AIR_TIME DISTANCE DEP_TIME ARR_TIME
2006-01-01 5 19 350 2475 9.083333015441895 12.483333587646484
2006-01-02 167 216 343 2475 11.783333778381348 15.766666412353516
2006-01-03 -7 -2 344 2475 8.883333206176758 12.133333206176758
2006-01-04 -5 -13 331 2475 8.916666984558105 11.949999809265137
2006-01-05 -3 -17 321 2475 8.949999809265137 11.883333206176758
2006-01-06 -4 -32 320 2475 8.933333396911621 11.633333206176758
2006-01-08 -3 -2 346 2475 8.949999809265137 12.133333206176758
2006-01-09 3 0 334 2475 9.050000190734863 12.166666984558105
2006-01-10 -7 -21 334 2475 8.883333206176758 11.816666603088379
2006-01-11 8 -10 321 2475 9.133333206176758 12
… (999990 more rows)
ibex> flights[filter FL_DATE == "2006-01-01"]
rows: 17618
columns: FL_DATE DEP_DELAY ARR_DELAY AIR_TIME DISTANCE DEP_TIME ARR_TIME
2006-01-01 5 19 350 2475 9.083333015441895 12.483333587646484
2006-01-01 3 3 281 2475 9.550000190734863 17.799999237060547
2006-01-01 -1 19 348 2475 11.983333587646484 15.366666793823242
2006-01-01 0 -16 279 2475 12.5 20.549999237060547
2006-01-01 -4 26 516 3784 10.016666412353516 15.116666793823242
2006-01-01 36 8 380 3711 18.516666412353516 5.066666603088379
2006-01-01 -4 24 504 3711 11.850000381469727 16.75
2006-01-01 36 26 393 3784 18.600000381469727 5.699999809265137
2006-01-01 22 15 261 2475 22.866666793823242 6.9666666984558105
2006-01-01 -1 -15 266 2486 23.649999618530273 6.333333492279053
… (17608 more rows)
ibex> flights[filter FL_DATE == "2006-01-01", m = mean(ARR_DELAY)]
error: 1:41: expected clause
ibex> flights[filter FL_DATE == "2006-01-01", select m = mean(ARR_DELAY)]
rows: 1
columns: m
10.206833919854693
ibex> flights[filter FL_DATE == "2006-01-01", select m = mean(ARR_DELAY), by AIR_TIME]
rows: 421
columns: AIR_TIME m
350 23.6
281 11.5
348 41
279 20.363636363636363
516 26
380 8
504 24
393 27.5
261 -0.7777777777777778
266 13.714285714285714
… (411 more rows)

All of these run quite fast, practically instantly on my machine and in a similar range as single threaded data.table.

What’s next?

Ibex is still incomplete, data frame manipulation needs to be extended and tuned. Broadcasting operators are a must have. Dates are represented by strings still. Time series and windowing support seem like a good idea. Multithreading support will yield significant speed ups. Graphing straight from the REPL is good to have. Since C++ interop is quite straightforward, I want to add numerical methods by importing a BLAS and support for random number generation and statistical tests.

Feedback and contributions are welcome. The repository is on GitHub.

What’s next?

Leave a Reply Cancel reply