Assignment 6. Low-level refactoring and performance

[course home > assignments]

Introduction

This assignment is designed to give you some skills with low-level programming, which is used in later courses like the operating system class, as well as in real-world applications like the Internet of things (IoT). You’ll start with a working program; you’ll add a few features, and tune and refactor the program to make it better.

Note: Use a private local Git repository (not a repository host like GitHub) to keep track of your work in this assignment when you’re modifying code, data, or notes.txt. Don’t put big output files into your repository; use it only for sources that you maintain by hand.

Useful pointers

Homework: Tuning and refactoring a C program

Keep a log in the file notes.txt of what you do in the homework so that you can reproduce the results later. This should not merely be a transcript of what you typed: it should be more like a true lab notebook, in which you briefly note down what you did and what happened.

You’re trying to generate large quantities of random numbers for use in a machine-learning experiment. You have a program randall that can generate random byte streams, but it has problems. You want it to be (a) faster and (b) better-organized.

You can find a copy of a repository for the randall source code in the tarball randall-git.tgz. Unpack that tarball, clone the resulting repository, and look at the resulting source code. It should contain:

Add notes.txt to your clone of the repository, and commit changes to it as needed while you work on this assignment.

Read and understand the code in randall.c and in Makefile.

Modify the Makefile so that the command ‘make check’ tests your program. You can supply just a simple test, e.g., that the output is the correct length. You’re doing this step first because you believe in test-driven development (TDD).

Next, split the randall implementation by copying its source code into the following modules, which you will need to likely need to modify to get everything to work:

You may add other modules if you like. Each module should #include only the files that it needs; for example, since rand64-hw.c doesn’t need to do I/O, it shouldn’t #include <stdio.h>. Also, each module should keep as many symbols private as it can.

Next, modify the Makefile to compile and link your better-organized program.

Next, add some options to your program to help you try to improve its performance. Redo the program so that it has an option ‘-i input’, where input is one of the following:

Also, redo the program so that it has an option -o output, where output is one of the following:

You can use getopt to implement your option processing.

Using an LLM to develop test cases

Large language models (LLMs) can simulate the understanding of code and can generate code, and can save developers some work so long as their limitations are understood. Although LLMs are sometimes perceived to be intelligent or creative, they are more accurately characterized as bloviating assistants. With this in mind, use an LLM-based tool to create ‘make check’ tests to check your additions to randall.

LLM choices

You may use any single LLM of your own choice. The recommended models with their traditional prompting interfaces are:

You can also access the models via genAI agents such as:

For a discussion of the difference between generative and agentic AI, see §2.1 of:

According to UCLA’s Available GenAI tools, you can freely access Gemini’s prompting interface via your UCLA logon. The other options will likely cost you; if you use one of them, please keep overall costs under $100 (less than the cost of a textbook, which this course doesn’t have), so that you don’t gain an unfair advantage over students who are cost-constrained.

Prompting LLMs

Prompts are the textual queries that you input to LLMs to instruct them to complete a certain task.

Choice of prompts is important to achieve better accuracy in downstream tasks like this. In natural language processing, approaches for designing and optimizing LLM prompts are called “prompt engineering”. The most vanilla way to prompt LLMs is to input your request directly, e.g., “Generate a few test cases for me to test my implementation.” To give the LLM more information about the task, you might want to briefly describe the target implementation (e.g., describe what a random number generator is and what language you are coding it in), as well as provide a chunk of your code to the model. This method is also called “zero-shot prompting”, meaning that you are instructing the model to complete a task that it probably has never seen.

To provide the model with more context, consider few-shot prompting, which provides models with additional demonstrations of the task through a number of examples in your input prompt. For instance, give the model one test case that you have in mind and instruct it to generate more to probe for edge cases. For more, see:

For advice tailored for each of the three suggested models, see:

Use LLM-generated test cases to test your code. If the model’s generated cases are not enough, add to the test cases on your own to make sure testing is comprehensive. Record the following key information in a file test-llm.txt.

Model choice
The name and version of the LLM that you used.
Prompts
Full prompts and code that you gave the LLM.
Raw model outputs
Everything that the model output.
Processed model-generated test cases
Test cases that you extracted from model outputs.
Evaluation and brief explanation of generation quality
Explain how well you think the model performed in the task of test case generation. Potential discussion aspects include:
  • Does the model successfully generate test cases?
  • Are the test cases comprehensive?
  • Does the model correctly understand key requirements when building the test cases (e.g. considering all edge cases)?
  • How easy is it to integrate the model test case into your code framework?
  • (optional) Did you need to refine LLM-generated test cases?
  • What is the behavior of your system using the test cases? Give some examples.

If you use an AI agent, record its equivalents to conventional prompts and responses as best you can, and also document any problems you have in making this record.

Implementation and debugging

With the help of LLM in generating test cases, you may now move on to complete the implementation.

When debugging, you may find the valgrind program useful. Also, the AddressSanitizer (asan) and the Undefined Behavior Sanitizer (ubsan) may be useful; these can be enabled with the GCC options -fsanitize=address and -fsanitize=undefined, respectively.

If the program encounters an error of any kind (including option, output and memory allocation failures), it should report the error to stderr and exit with status 1; otherwise, the program should succeed and exit with status 0. The program need not report stderr output errors.

Finally, time your implementation as follows ...

    # This is a sanity check to test whether you’re in the right ballpark.
    time dd if=/dev/urandom ibs=8192 obs=8192 count=16384 >/dev/null

    time ./randall 133562368 >/dev/null
    time ./randall 133562368 | cat >/dev/null
    time ./randall 133562368 >rand.data

... except that you may need different numbers if your implementation is faster or slower. Also you should try various combinations of the above options to see which gives you random data the fastest. One option that you should try is ‘-i /dev/urandom’.

Record your results (including your slow results) in notes.txt.

Submit

Submit three files:

  1. The file randall-submission.tgz, which you can build by running the command ‘make submission-tarball’. Test your tarball before submitting it, by extracting from it into a fresh directory and by running ‘make check’ there.
  2. The file randall-git.tgz, which is a gzipped tarball of your private local Git repository and configuration, created by the command ‘make repository-tarball’.
  3. The file test-llm.txt described above.

The first two submitted files should not be all that large, since they should contain only information about source files maintained by hand, as opposed to generated files.

All source files should be ASCII text files, with no carriage returns, and (except for test-llm.txt) with no more than 100 columns per line. The shell commands:

expand Makefile notes.txt *.c *.h | awk '/\r/ || 100 < length'
awk '/\r/' <test-llm.txt

should output nothing.