## Terminating Knight Tours on Infinite Boards

NumberPhile did a video on potentially infinite Knight Tours that I found quite interesting. I wanted to test out literate programming using Quarto to solve it and I did (result) ðŸ™‚

The RStudio integration with Quarto is really good and for a problem like this it really helps to keep your thoughts and code in the same place. Plots like this one that shows the complete tour are displayed inline and the experience is comparable with a Jupyter notebook with the advantage that it is not in the browser but in a polished IDE.

## ChatGPT, Tikz and option strategies

In my previous post I looked at the Tikz capabilities of ChatGPT. The quality of the graphics produced were above my expectations but I wouldn’t call them useful. While doing research for this interesting Quant.SE question I came up with a more interesting question. Can ChatGPT explain and plot option strategies? Turns out it can not but since I’m doing science on this blog, I share with you the results below anyways.

Maybe it can be coaxed into giving better results with better prompts. I’d be happy to hear if someone can. In any case, these were mine:

Please explain me some popular option trading strategies together with Tikz graphics
For each of the graphics can you write Tikz code?
Can you give me the Tikz code for a straddle, a strangle and iron condor as well

For some reason it got distracted when generating the graphs. The results are summarized below. Note again: this is all ChatGPT output, I only combined the graphics and compiled the Tikz.

Sure, I can explain some popular option trading strategies, but as a text-based AI, I’m unable to directly create Tikz graphics. However, I can certainly describe them in a way that can help you visualize and perhaps draw them yourself.

1. Long Call: This is the most basic options strategy. When you buy a call option, you’re buying the right to purchase a stock at a specified price (the strike price) before a specified date (the expiration date). You would use this strategy if you believe the price of the underlying asset is going to rise. Visualization: Imagine a graph with profit/loss on the Y-axis and the price of the underlying asset on the X-axis. The graph would be a diagonal line starting from the negative Y-axis (representing the premium paid for the call) and moving upward as the price of the underlying asset increases.
2. Long Put: This strategy involves buying a put option, which gives you the right to sell a stock at a specified price before a specified date. You would use this strategy if you believe the price of the underlying asset is going to fall. Visualization: The graph for a long put is a mirror image of the long call. It starts at the negative Y-axis (representing the premium paid) and moves downward as the price of the underlying asset increases.
3. Covered Call: This strategy involves owning the underlying asset and selling a call option on that asset. This is typically done when you believe the price of the asset will stay relatively stable, and you want to earn the premium from selling the call. Visualization: The graph starts at the Y-axis at the point where the premium for selling the call is added to the price of the asset. It then moves horizontally to the strike price of the call option, at which point it starts moving downward as the price of the asset increases.
4. Protective Put: This involves owning the underlying asset and buying a put option on that asset. This is done when you want to protect against a possible drop in the price of the asset. Visualization: The graph starts at the Y-axis at the point where the price of the asset is reduced by the premium for the put. It then moves horizontally to the strike price of the put option, at which point it starts moving upward as the price of the asset decreases.
5. Straddle: This strategy involves buying a call and a put option on the same asset with the same strike price and expiration date. This is typically done when you expect a big move in the price of the asset, but you’re not sure in which direction. Visualization: The graph for a straddle looks like a V, with the point of the V at the strike price of the options. The left side of the V represents the profit/loss for the put option and the right side represents the profit/loss for the call option.
6. Strangle: Similar to a straddle, but the call and put options have different strike prices. This is typically done when you expect a big move in the price of the asset, but you’re not sure in which direction, and you want to reduce the cost of the strategy by choosing strike prices that are out of the money. Visualization: The graph for a strangle looks similar to a straddle, but the V is wider due to the different strike prices.
7. Iron Condor: This is a more advanced strategy that involves selling a call spread and a put spread on the same asset. This is typically done when you expect the price of the asset to stay within a certain range. Visualization: The graph for an iron condor looks like a bird in flight, with the body of the bird representing the range within which you profit and

Just as the descriptions the graphics aren’t very good unfortunately. It seems to have to have difficulty with combining the different pay off structures of the parts of the strategy.

## Drawing with ChatGPT and Tikz

Sometime ago through HackerNews I found this experiment on drawing a unicorn with ChatGPT on the blog of Adam K Dean. The results left something to be desired. I prefer using Tikz instead of SVG so I decided to an experiment myself with reasonable results. First I wanted a simple sine wave. In 2014, during an interview, I was asked to sketch a sine wave on a whiteboard as a preliminary step for a problem the interviewers wanted me to tackle. Regrettably, I ended up poorly drawing a cosine wave instead. For ChatGPT however, it’s not a problem:

This is pretty good, I know I would need to some searching to get to this. Next I wanted a house:

This is going better than expected, let’s try to draw a unicorn:

It does try hard to give me an Markowitz efficent frontier together with the Capital Allocation Line and the tangency portfolio but didn’t quite succeed. This is the final result of the prompts: “Now draw an Markowitz efficent frontier with some details”, “Can you add the CAL and the tangency portfolio?” and “Please don’t use path”. The latest command is necessary as my installation of Tikz doesn’t know about it but it appears to work in another version. The result isn’t great, by default the legend is over the points and I need to comment out the location of the tangency portfolio. That said, if one really wants to get this graph ChatGPT gives a great start and it writes around 40 lines of Tikz code in 2 minutes which I’m definitely not able to do.

## AoC 3rd Advent Sunday Wrap Up

Days 5 to 11 posed a bit more challenge than the first four and gave the opportunity to explore various parts of R.

# Day 5

The actual logic of the puzzle was quite easy:

``````do_move <- function(stacks, count, from, to, move_fun = identity) {
stacks[[to]] <- c(stacks[[to]], move_fun(tail(stacks[[from]], count)))
stacks[[from]] <- stacks[[from]][seq_len(length(stacks[[from]]) - count)]
stacks
}``````

where `move_fun` was either `identity()` or `rev()`. Getting the data into shape was more interesting and the native pipe could be put to good use as well as a new experminental feature:

``````stacks <- gsub("    ", " [_]", parts[[1L]]) |>
ustrsplit(split = "\n") |>
stacks => head(stacks, length(stacks) - 1L) |>
strsplit(split = " ") |>
data.table::transpose() |>
lapply(rev) |>
filter_empty()``````

The pipebind operator `=>` in the middle of this pipeline can be used to the current argument to a name in the middle of a pipe. This allows using the current argument of the pipeline without the need to resort to ad hoc anonymous functions. Since it’s an experimental feature, it must be activated. This can be done by putting this in your `.Rprofile`: `Sys.setenv("R_USE_PIPEBIND"=TRUE)`. This also works in RStudio.

# Day 6

Day 6 was the easiest puzzle until now but I learned one small trick: The base function `match()` has an optional argument `nomatch` which specifies the return value if no match is found. In this case an `if` statement can be avoided by setting `nomatch=0`. The code below gets the part from a `index` to the end or keeps the `buffer` with one `char` added:

``````index <- match(char, buffer, nomatch = 0L)
buffer <- c(buffer, char)
buffer <- buffer[(index + 1L):length(buffer)]``````

# Day 7

This puzzle gave a good reason to start the `Dictionary` class in recollections! Unlike the builtin list datatype the recollections dictionary can be used by reference which allowed finding a directory and directly using it without the need to copy it back into the directory tree. On top of that, the C++ code that underlies the `Dictionary` class is more efficient than that of `list`. With the tree of dictionaries in place, the logic to find the sizes of all leaves in the tree is a standard use of recursion.

# Day 8

I’m not completely happy with this solution to this puzzle. After tinkering and looking at profvis output I managed to create a solution that runs in less than 10 seconds on this old machine but if the forest gets much bigger this code will struggle. Putting the puzzle input into a data.table might not be the most natural thing to do but in the end to write a quite clear solution so maybe it’s not all bad.

# Day 9

In this puzzle we were asked to implement some weird version of snake. Again, this seems to be best solved using `recollections::Dictionary` to keep track of which cells have been visited. My initial solution solution had quite a bit of logic to determine the moves in the tail but this Reddit comment that simplified the logic quite a bit. Despite being a similar solution it’s interesting to see that the Python and R solution use quite different language features.

# Day 10

Now we are asked to emulate a simple instruction set. The instruction set is so simple that execution and a history of all states can be handled using just `data.table` (with a big help of `shift()` and `nafill()`.

# Day 11

In this exercise it seems logical to put all properties of the individual monkeys in some kind of class. So, this was a good moment to play around with S4 classes. This worked out quite well but I did notice there is a bit of overhead when one interacts with the slots of the classes. By batching slot manipulation a speed of 50% was achieved. This performance improvement is unlikely to be relevant for most uses of R though as a lot of R code won’t have such tight loops.

## AoC 2022 2nd Advent Sunday wrap up

This Advent period I’m participating in the Advent of Code (AoC) using R base + data.table and put the code on GitHub. Till now, the given data is easily be manipulated and there was even no need to use conditions or loops to get to the answers. Just using the functions in R base or loading and manipulating the data with data.table was sufficient thanks to `tstrsplit()`, and `foverlaps()`.

## CRAN reports UB in your R-package: How to fix it?

CRAN automatically checks C/C++ source code packages for Undefined Behaviour (UB). If the automated checks find UB it might result in the removal of the package from CRAN. This is a good thing, the presence of UB is a potential source of bugs and has no positive side effects. These checks have been improving and found scrypt package actually had some UB which needed to be fixed. This scared me a bit since although I’m a maintainer since the package contains some relatively complicated code which I didn’t write or change. A bit of history: Colin Percival conceived and implemented the algorithm in C and Andrew Kipp ported his code to R. I only revived the package after improved automated checks found problems that needed to be solved.

Reproducing the issue was my first challenge. My machine runs Ubuntu LTS and doesn’t have all the fancy tools installed to detect the UB. If I’m not able to reproduce the issue I can’t know whether I fixed it either so reproducing seemed like a good first step. I asked on Twitter and Dirk Eddelbuettel pointed me to the repo of a Docker image purpose made for this kind of analysis by Winston Chang. The instructions are worth reading but this is all I needed:

``docker run --rm -ti --security-opt seccomp=unconfined -v \$(pwd):/rscrypt wch1/r-debug``

This downloads the docker image if you don’t have it yet and starts it with the current directory mounted to `/rscrypt` inside the container. Within the container you can access versions of R with extra instrumentation. For detecting the UB two versions are available: `RDsan` (san is shorthand for sanitizer) which is compiled with gcc and `RDcsan` which is compiled with clang. So now I could do (some output elided):

``````R> RDcsan CMD INSTALL scrypt_0.1.3.tar.gz  # The last version with the problem
scrypt-1.1.6/lib/crypto/sha256.c:254:24: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:44:28: note: nonnull attribute specified here
#0 0x7f21fe3bde7f in scrypt_SHA256_Update /tmp/.../sha256.c:254:3
#1 0x7f21fe42b1e6 in scrypt_HMAC_SHA256_Update /tmp/.../sha256.c:335:2
#2 0x7f21fe42b6a1 in PBKDF2_SHA256 /tmp/../sha256.c:377:2
#3 0x7f21fe42c9e6 in crypto_scrypt /tmp/.../crypto_scrypt-ref.c:258:2
#4 0x7f21fe46f822 in getcpuperf(double*) /tmp/.../util.cpp:145:13
#5 0x7f21fe46420c in (anonymous namespace)::getparams(double, double, int*, unsigned int*, unsigned int*) /tmp/.../scrypt.cpp:49:15
#6 0x7f21fe462f81 in hashPassword(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, double, double) /tmp/.../scrypt.cpp:129:15
...
etc``````

The C code of the scrypt package is entered at #7.

The CRAN error report shows that UB was with R compiled by using both gcc and clang but for some reason when I tried the issue was only flagged by `RDcsan`. The important thing is the error can be reproduced and a possible fix can be validated.

In this particular case the error was easy to find. The code that triggers the UB is the following `memcpy()` and from the error message we know that the `src` argument is the culprit:

``memcpy(&ctx->buf[r], src, len);``

Tracing through the call stack I found that the `src` in this case is the salt used when figuring out how many iterations scrypt should perform to perform reasonably well on the current machine. More importantly, for this test the hash has the value `NULL`. This explains the error found: `memcpy()` can’t copy anything from `NULL`.

With the root cause identified, the solution is simple: set the initial salt to an actual value, since this code is only called when checking the CPU performance and since the old behaviour is UB anyway it doesn’t really make a difference what value is used. I choose to set it to all zeroes. After that, the UB disappeared, a commit was pushed and a submission to CRAN was accepted.

## Lessons learned

1. CRAN has been checking packages for UB since 2014 and the checks are still improving, great!
2. Finding UB like this requires help from good tooling which isn’t as easily available as R and CRAN
3. Luckily, smart people have figured out a way to get this tooling on your computer in a few minutes
4. Once I knew were to look, the problem was easily solved

## Scrypt package back on CRAN

The scrypt package is back on CRAN and I have become the maintainer. The package allow password hashing and verification using Colin Percival’s scrypt scheme. The advantage of the scrypt hashing scheme over other cryptographic hash functions such as SHA is that calculation of the hash takes much more time and memory and a random seed is always used. This makes it much more expensive and time-consuming for attackers to retrieve passwords from hashes obtained through database hacks.

Thanks to RStudio and Andy Kipp in particular for doing all the heavy lifting of creating and writing the package and allowing me to take over maintainership of the CRAN package and the GitHub-repo. Issues and patches welcome!

## CSV benchmarking 1/n

In this series of posts we will be looking at a number of ways to store data using R in as little space as possible and also consider the portability of the different solutions. As an example, the New York City Flights data set for 2013 is used (available through CRAN). The first rows and columns are shown below:

``````# A tibble: 336,776 x 19
year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
7  2013     1     1      555            600        -5      913            854
8  2013     1     1      557            600        -3      709            723
9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# â€¦ with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>``````

For portability across programming languages, CSV’s, optionally compressed, area great option as any platform for doing anything with data should be able to read CSV. Alternative formats such as fst, parquet and ORC have a number of advantages such as smaller sizes, better fidelity and built in integrity checking. These will be examined in a later post. For now, gzipped CSV is used as reference. If created from the R data.table package, it will take 2 cores a bit less than one second to write the file containing the 2013 flights. The file size is 6.6mb.

```library(dplyr)
library(data.table)
# We drop the column with timestamps for reasons explained below.
flights <- nycflights13::flights %>% select(-time_hour)
flightsDt <- as.data.table(flights)
system.time(data.table::fwrite(flightsDt, 'flights.csv.gz'))```
```    user  system elapsed
1.886   0.027   0.984 ```

If you want to read or write a large amount of data to CSV using R and you want to do it rather quickly, there are two good options at the moment: data.table and vroom.

data.table and vroom in the current development version support writing gzip-compressed csv’s as well. Roughly, zip-like compression algorithms works by creating a mapping of shorter sequences to longer sequences of bits in such a way that the mapping + the input mapped from the set of longer sequences to the shorter sequences takes up less space than the original. Compression algorithms use clever techniques to create and maintain this mapping but for reasons of speed and memory use this mapping can’t grow without bounds. We can help the algorithm a bit by first transposing the data.

### Transpose for a free lunch

There is a lot of repetition in the first three columns `year`, `month` and `day`.

```> flightsDt[,
lapply(.SD, function(.) length(unique(.))),
.SDcols = c('year', 'month', 'day')
]
year month day
1:    1    12  31```

If all these are put close together, it helps the compression algorithm a bit:

```TimeAndSize <- function(FUN, fileName) {
filePath <- file.path(tempdir(), fileName)
timing <- unclass(system.time(FUN(filePath)))
fileSize <- file.info(filePath)\$size
c(timing, FileSizeInMB = round(fileSize / 1024L / 1024L, 1))
}
dt_csv_fun <- function(fileName) fwrite(flightsDt, fileName)
dt_gz_fun <- dt_fun vroom_csv_fun <- function(.)
vroom_write(flights, .)
vroom_gz_fun <- function(.) vroom_write(flights, pipe(paste('gzip >', .)))
vroom_mgz_fun <- function(.) vroom_write(flights, pipe(paste('pigz >', .)))
vroom_zstd_fun <- function(.) vroom_write(flights, pipe(paste('zstd >', .)))
dt_csv_fun_transposed <- function(.) fwrite(transpose(flightsDt), .)
dt_gz_fun_transposed <- function(.) fwrite(transpose(flightsDt), .)
experiments <- list(
dt_csv_fun = 'flights.csv', dt_gz_fun = 'flights.csv.gz',
vroom_csv_fun = 'flights.csv', vroom_gz_fun = 'flights.csv.gz',
vroom_mgz_fun = 'flights.csv.gz', vroom_zstd_fun = 'flights.csv.zstd',
dt_csv_fun_transposed = 'flights.tcsv',
dt_gz_fun_transposed = 'flights.tcsv.gz'
)
PerformExperiments <- function(experiments) {
t(sapply(names(experiments), function(.) {
TimeAndSize(match.fun(.), experiments[[.]])
}))
}
PerformExperiments(experiments)```
``` `                 user.self sys.self elapsed user.child sys.child FileSizeInMB`
dt_csv_fun            0.403    0.012   0.215      0.000     0.000         22.8
dt_gz_fun             1.891    0.000   0.961      0.000     0.000          7.5
vroom_csv_fun         1.327    0.019   0.360      0.000     0.000         22.9
vroom_gz_fun          1.359    0.030   1.517      1.474     0.024          7.5
vroom_mgz_fun         1.341    0.046   0.943      1.560     0.039          7.5
vroom_zstd_fun        1.364    0.037   0.600      0.181     0.030          6.6
dt_csv_fun_transposed 2.318    0.047   1.964      0.000     0.000         22.8
dt_gz_fun_transposed  3.004    0.003   1.893      0.000     0.000          5.5```

Comparing the results of data.table and vroom we see that timing and file sizes are almost equal. Both are equally good at writing CSV files. However, if the table is first transposed and then written, the total process takes a bit longer but file size is reduce by around 25%!

## Warts

The column `time_hour` needs to be dropped as `data.table`‘s `transpose()` unpacks this column which is saved as more data and transposing these columns back doesn’t result in the same values in any case. In any case, transposing is slow even with `data.table` and handling very wide tables can be problematic for some software packages.

## Conclusion

Compressed csv’s are a portable data format that require significantly less space than their uncompressed counterparts. The size of compressed csv’s can be further reduced by transposing the csv’s first and then compressing. Whether this extra compression is worth it depends on use case.

## FastR sudoku’s

A few days ago @JozefHajnala pointed me to Oracle’s FastR which is an implementation of R on the Java Virtual Machine (GraalVM). This implementation aims to be completely compatible with the GNU R implementation we all know and love.

In September 2018 I gave a talk (GitHub, presentation) on using Rcpp for solving sudoku’s at Amsterdam SatRday. Point of the talk was to show the performance benefits that can be achieved using the Rcpp-package. This post describes my adventure of solving sudoku’s using FastR and compares performance between the Rcpp solver and the plain R sudoku solver ran using FastR.

## Installation of FastR

The easiest way to get started is to obtain the GraalVM community edition from the GitHub releases page. Download the graalvm-ce tar.gz relevant for your platform and untar it and use the gu-tool to install and start R:

`tar xzvf graal-ce-*-19.0.0.tar.gzcd graal-ce-19.0.0/bingu install R./R`

If all went well, you will now see a somewhat familiar sight:

`R version 3.5.1 (FastR)Copyright (c) 2013-19, Oracle and/or its affiliatesCopyright (c) 1995-2018, The R Core TeamCopyright (c) 2018 The R Foundation for Statistical ComputingCopyright (c) 2012-4 Purdue UniversityCopyright (c) 1997-2002, Makoto Matsumoto and Takuji NishimuraAll rights reserved.FastR is free software and comes with ABSOLUTELY NO WARRANTY.You are welcome to redistribute it under certain conditions.Type 'license()' or 'licence()' for distribution details.R is a collaborative project with many contributors.Type 'contributors()' for more information.Type 'q()' to quit R.`

## A quick test drive

Installing packages is as easy as with GNU R so doing the first test doesn’t take long:

```install.packages('microbenchmark')
sqr <- function(x) x * x microbenchmark::microbenchmark(sqr(runif(1000)))
Unit: microseconds
expr min  lq mean median  uq max neval
sqr(runif(1000)) 145 154  179    165 178 653   100 ```

To be fair, this is not very promising. Running the same snippet in GNU 3.4.4 gives much better performance:

`sqr <- function(x) x * xmicrobenchmark::microbenchmark(sqr(runif(1000)))Unit: microseconds             expr min lq mean median uq   max neval sqr(runif(1000))  24 25  152     25 25 11919   100`

Hmm, appears to me they should have called it SlowR instead. But before we dish out our final judgement let’s tweak the benchmark a bit (sqr.R in this gist):

`library(microbenchmark)                                                                                                                                                                      sqr <- function(x) x * xf1 <- function() for (n in 1:1000) sqr(runif(1:n))print(R.version)print(microbenchmark(f1(), control = list(warmup = 100L)))`

Running this snippet in both R’s motivated me to continue my research and actually made me excited to see what FastR has to offer:

```
> source('~/code/FastR/sqr.R')...
language       Rversion.string R version 3.4.4 (2018-03-15)
Unit: milliseconds
expr     min       lq     mean  median       uq      max neval
f1() 16.3209 17.15926 18.16575 17.4746 17.73782 51.27648   100
```
```> source('~/code/FastR/sqr.R')...
language       R
engine         FastR
version.string FastR version 3.5.1 (2018-07-02)
Unit: milliseconds
expr      min       lq     mean   median      uq      max neval
f1() 6.053445 6.374638 8.226994 6.765636 6.99983 30.11883   100

```

## Solving sudoku’s

The sudoku below has been claimed to be the world’s hardest sudoku:

`sudokuTxt <- " 8 0 0  0 0 0  0 0 0 0 0 3  6 0 0  0 0 0 0 7 0  0 9 0  2 0 0 0 5 0  0 0 7  0 0 0 0 0 0  0 4 5  7 0 0 0 0 0  1 0 0  0 3 0 0 0 1  0 0 0  0 6 8 0 0 8  5 0 0  0 1 0 0 9 0  0 0 0  4 0 0"`

On my machine, the R/Rcpp solver finds a solution within a second. The source can be found in the gist as sudoku.cpp and sudoku.R. Microbenchmarking this solution in GNU R gives

`R> source('sudoku.R')R> microbenchmark(cpp = solve2(sudoku, findChoicesCpp))Unit: milliseconds expr      min       lq     mean   median       uq      max neval  cpp 721.5412 735.6565 756.2961 748.2816 773.9773 835.9321   100 `

while plain R run using the FastR-engine is about two and a half times as fast:

`\$> ./R --vm.Xss5000kR> source('plain_R.R')R> microbenchmark::microbenchmark(  solve2(sudoku, findChoices2), control = list(warmup = 20L))Unit: milliseconds                         expr min  lq mean median  uq  max neval solve2(sudoku, findChoices2) 257 260  293    266 271 1217   100 `

## Caveats

Note in the example above the command line parameter to increase the stack size. If the stack size is not increased like that the function will crash.

Another important thing to note is that FastR will use more cores and will need more memory to run than GNU R.

## Conclusion

Oracle FastR is an interesting project that can improve the runtime of R calculations by quite a bit and I will be watching its development closely. Resources use can be a problem, after running the benchmark, almost 4GB of RAM is used against less than 500mb for the process running the Rcpp code.

## Longstaff Schwarz Method: First table

The Longstaff-Schwarz method described in “Valuing American Options by Simulation: A Simple Least Squares Approach” by Francis A. Longstaff and Eduardo S. Schwartz allows valuation of non-European options to a reasonable degree of accuracy using simulations and a clever way to choose between early exercise of the option or continuation. I’ve implemented in Python and put the code on GitHub. After installing the requirements from requirements.txt it should be easy to run the below code in a Jupyter notebook and get results quite close to those presented in the original paper.