Test Case Generation with LLM At Runtime

Competition math and programming are hot topics nowadays. When we sat down for the internship program, we knew that we wanted to do something with competitive programming. After all, that's what X-Camp was created to teach students. However, we didn't know which direction within competitive programming we would choose. In fact, since many of the interns only had programming experience through competitive programming, they had to learn Git for version control, Python to use the rich variety of LLM libraries and resources, Google Colab and Lambda Labs for cloud compute, and VS Code/Cursor as the code editor. This would take the first one or two weeks of the internship. At the same time, we would brainstorm directions of research, and it was particularly hard to choose one.

First, we considered finding a better method to train LLMs to solve competitive programming problems or design a novel pipeline. After some cost analysis, this would be extremely expensive. Fine tuning a model to achieve decent results would mean at least a 7B parameter model. Any model with fewer parameters would spit out gibberish and just get stuck in a loop. Fine tuning on Colab wouldn't cut it. We had to train overnight, and even then it would run out of resources and crash. You could persist the data and training checkpoints, but this iteration time was much too slow. If we used Lambda, we did have the option of using a multiple GPU cluster. However, we found that libraries like TRL and VERL were quite difficult to get working, and much of the issues were caused by configuration errors. Using a single GPU would work, but iteration time was still too slow. Because of all these factors, we decided on taking a cheaper approach that didn't require so many iterations and didn't require much fine tuning.

Then, our mentor suggested we look at test case generation, suggesting that the latest papers focusing on improving test case generation, particularly TCGBench, were only a few months old. This was a rising area that we thought we could make progress on. Digging deeper, we found papers like TestCase-Eval, rStar-Coder, CodeContests+, CURE, and CodeHacks, all of which were fairly recent and directly related to test case generation. They focused on benchmarking test case generation, using test cases to train models, accurately validating and invalidating code based on test cases, and generating high-quality test case datasets. We scrutinized this area of research for any gaps.

Unlike competitive math, competitive programming has the advantage of being easily testable. Contestants always run their code with the sample test cases within the problem test cases as well as any test cases that they come up with. How often does their code work on the first try? Not very often, especially on harder problems. As blindly submitting code to the judge doesn't work for humans, neither does it work for LLMs. To fix their code, they first need to know where their code goes wrong. To even submit their code, they need to make sure it's correct for all the test cases that they know of. The backbone for all of these relies on test case generation at runtime for voting and debugging, simulating what humans already do.

To this end, we first considered Codeforces hacking. In Codeforces, users who have solved a problem can view the code of other users. If some submission is sketchy, they can submit a test case to prove that submission wrong compared to the ground truth code result. If the new test case successfully proves the submission wrong, the test case runs on all submissions. Following the release of the paper CodeHacks during our brainstorming week, we considered the direction of generating failure-inducing test cases, but we ultimately rejected it after deciding that it was too hard of a problem to solve with our limited resources. Still, generating error-inducing test cases both to hack submissions and also improve LLM code at runtime remains a viable and unexplored area. It may even be extended to unit test case generation for real-world software projects.

We finally settled on just generating test cases for diversity and accuracy. Generating input test cases is a field with some research already. CodeContests+ and the successive paper Klear-CodeTest use a generator-validator framework to generate inputs. The validator checks that the test case input follows all problem constraints, not only including 1 <= n <= 10^6 and also that, for example, the graphs are acyclic. The generator doesn't straight up generate test cases with tokens, as it is slow, costly, and prone to errors, even for medium sized inputs. Instead, it generates code to generate the test cases. Instead of having the LLM spit out tokens, it's better to have code to generate inputs that easily follows the constraints; of course, the validator is still needed to catch mistakes and difficult constraints. Benchmarks that test code that generates test cases include TCG-Bench and TestCase-Eval, where LLMs have been shown to do somewhat poorly.

While we must ensure diversity and cover all edge cases, CodeContests+ and Klear-CodeContest offer a good framework that has a high TPR and TNR, which means that correct code nearly always passes and incorrect code nearly always fails. However, these papers are catered to creating new datasets, not to generate test cases during runtime to test code as humans do. Thus, they simply plug the generated inputs into ground truth solutions and get some output values. In contrast to existing work, we need to find a way to effectively get the test case outputs from the test case inputs during runtime.

At first, this task seems impossible. We could obviously ask the LLM to generate a solution, and bam! We have correct outputs. We could maybe even implement majority voting for the most common output. However, if you can generate code to get outputs, you might as well just submit that as the answer to the problem! We need to somehow find better ways.

Let's say you're solving a competition programming problem. You'd sit down and work through a test case input to build some intuition for the problem. You probably wouldn't follow any specific algorithm, but simply understand why the output is what it is. That's what most research does for runtime output generation; they simply tell the LLM to reason through an input. However, this often leads to incorrect outputs, especially for spatial reasoning problems. Most existing research also does not cover extending this intuition with a single test case input to all test case inputs, something humans often exploit. To train a model on this task, we could distill a large reasoning model like DeepSeek R1 and RLVR based on how accurate the outputs are.

The second way is by generating brute force code to generate outputs for small inputs. In many competitive programming problems, especially ones that are harder, the difficulty comes not from getting a correct solution but from getting a speedy solution. It may be easy to write a brute force solution that passes for small inputs, providing a way to generate test case outputs. Training this would also require RLVR. As this is a novel idea, we needed humans to test if we could do it first. We had team members brute force a few randomly selected problems from CodeForces, which worked for a reasonable percentage of problems, so we decided to give this idea a shot.

When we tried to prompt an LLM to generate naive code, however, there were numerous problems. The LLM first wouldn't even follow the format to write its code. With a couple days of fine tuning, we managed to mostly fix this problem, though it still didn't follow the format 10% of the time. The next issue was for the AI to write brute force code. For small models like Qwen3-7B, it simply wouldn't follow instructions and write naive code. It kept writing optimized code, even though we were using the Instruct version. When we tried with large reasoning models like DeepSeek R1, it was particularly frustrating because OpenRouter was acting up. It kept spitting out insults in Chinese and Russian for whatever reason. When we did get it to finally work, it still wrote optimized code. When we tested R1 on easy problems, it achieved 73.5% accuracy once we filtered out all the TLE results, but when we analyzed the solutions, they were all optimized. Finally, the code execution landscape was quite terrible. To run the code, we first just copied from LiveCodeBench and OpenAI's HumanEval, which didn't work well. There were many multiprocessing errors that took a week or more to fix for our use case. It was still slow and clunky in the end. Then, we tried to find existing sandboxed code executors (sandboxed so our computers don't crash), and we found a few. There was LLM Sandbox, which used a Docker Container under the hood that just never would start up. We waited for more than an hour, and still nothing. We then tried SandboxFusion by ByteDance, but the exact same issue happened. We were beginning to think this was a computer issue, but it didn't work for any of our computers for some reason. In the end we still had to resort to our homemade cobbled-together code executor to avoid Docker issues. It was half working, at least.

With these two methods - reasoning and naive coding - of getting outputs for small inputs, how do we make them work together or at least know which one to use? We first considered simply creating an equal amount of reasoning traces and brute force solutions and doing majority voting. However, this is quite costly for us due to the number of responses. The brute force solver is more accurate for some problems, while the reasoner is more accurate for other problems; we cannot weigh them equally. For example, a maze problem might be very hard for the reasoner due to spatial reasoning, but it may be easy for the naive solver. Conceptually, you could say that the reasoner would simply give less consistent results than the naive solver for the maze problem, thereby making the more-accurate naive solver win majority voting over the reasoner. Unfortunately, we have lost the data to prove this, but we found that this was not true. After this, we took extra care to keep track of all our data.

We then thought of using a router model to determine whether we should trust the reasoner more or the naive solver more for a certain problem. For a maze problem, it may be quite easy for us to decide that a naive solver would do better. For other problems, it's not so clear. We thus trained an encoder-only transformer for this task using supervised learning, with the correct choice being the one with the higher accuracy. Overall, this didn't work that well, as it chose pretty much randomly. This could be due to the complexity of these problems overall. We also tried just using a generative LLM to choose, and unfortunately that didn't work that well either.

After this, we decided to focus on the reasoner and get its accuracy as high as possible. We first decided to create a web app to annotate reasoning traces and categorize whatever errors the LLM made into a few categories, as well as write detailed error analysis. This was quite difficult, as at the time we had no web development experience at all. Due to our tight schedule, we solely relied on Cursor to vibe code the web app, which turned out to be quite a terrible idea. We still knew what a well structured codebase looked like, and what Cursor generated was atrocious. Each file had thousands of lines, with long functions and plain HTML/JS/CSS. It even put our OpenRouter API key in the code itself, even though we had isolated it in .env already. The code it generated was terribly inefficient, creating multiple copies of the 4GB problem dataset and loading it all into memory. Perhaps more of the time was spent debugging bad code instead of testing and iterating! In hindsight, we should've used a JS framework, which we only learned much later, and learned better prompt engineering.

At the same time, we designed code to massively parallelize LLM calls and evaluate its output correctness. While doing this, we ran into a wall of issues regarding multiprocessing, async, and OpenRouter API calls. We tried a pool of 32 requests at a time with delays and backoff in case the prompt failed, but OpenRouter still kept returning errors and blank responses. We even accidentally committed the API key and lost a bunch of money, probably due to an API key crawler. Overall, we found working with OpenRouter quite a terrible experience, but we did manage to get reasoning traces for most of the test case inputs after filtering the nonresponses out. We probably should've realized it sooner, but we then ran into a major problem. The data from TACO had a few small test cases, but all of them already appeared in the problem description. When we tested the reasoning model on small inputs, it could simply copy the correct output from the problem statement and get 100% accuracy, which we observed for basically all problems. We thought that this was a dataset problem, so we switched to CodeContests from the AlphaCode paper. However, it still had the exact same problem. We finally realized that the test cases were extracted or slightly modified from the official test cases used for judging, which had a few small cases to test that it passed the given samples and many huge test cases.

To resolve this issue, we needed to either filter out the sample test cases from the problem description or get new small test cases. To filter out cases, we tried regex, but problem statements formats were too varied for it to work. We could have also asked a LLM to exactly restate the problem without the samples, but we feared some info would be lost in the process although we may try this in the future. Due to these setbacks, we tried the second option, finding additional test cases. With the release of Klear-CodeTest, an open source dataset of additional AI generated inputs, we thought this option was now feasible. However, this turned out to not be the case, as the inputs were still far too large. They released the code for the generator-validator framework, but at this point we found this research direction to be very unfeasible and dropped the project.

What can we learn from this? Quite a lot.

Working with a team is difficult, and that's not because team members are incompetent or anything. It's hard to parallelize tasks, especially when one task seems to block another task. There were frequently bottlenecks, often extremely time-consuming and hard-to-resolve bottlenecks. For example, we couldn't analyze the reasoning traces unless reasoning traces were generated correctly, but you have already seen that generating correct reasoning traces is an extremely difficult task. We had people split up to work on the generation - incredibly frustrating - and the web app - also incredibly frustrating. With many people working on these aspects at once, there were many many merge conflicts and other issues with git. We would definitely read up on how corporate teams handle such tasks first if we were to do this again.

Over the 6 weeks, we would write work logs every single day. Writing work logs is a hard habit to get into. When we didn't yet realize the benefits, it just seemed like busywork. Writing them at the end of the day doesn't work, as you just forget most things you did. If you don't write down how you fixed specific bugs, they'll haunt you again and you'll forget how to fix them. Unfortunately, we didn't do very well in the habit of writing work logs at every step of the way, so our reflection and iteration were somewhat more shallow and slower. We should've been much stricter on the team in terms of writing work logs as a continuous train of thought.

Coming up with novel ideas is a hard task. That's half of the reason why research is hard. Our strategy was just to have every person read many papers, and discuss with each other on the whiteboard about possible directions. We evaluated how feasible it was to implement and if there was even a valid reason for the idea. This was a good method, and we did come up with several novel ideas like the naive coder. However, it was hard for the whole team to contribute. To do even better, we could adopt a system like IDEO in their shopping cart design video, as we believe there was some unrealized potential.

We should've also known when to move on from an idea. Our original idea for training an LLM to solve competitive programming problems was too costly, and we scrapped that. We did well. However, when it came to the reasoner and naive solver, much of the infrastructure was simply too bad, and it was also quite costly. This time, we didn't abandon it even after the 6 week internship, which turned out to be a mistake as nothing really worked. While it's good to have a sense of closure and accomplishment after so much dedicated time, it should've simply been treated as a learning experience. There were some other ideas that we thought of, particularly interpretability with programming languages versus human languages, but we were too focused on this one task. This brought up iteration times significantly, and we felt burnt out after not getting meaningful results.

Overview of What We Learned

Given that most of the team members had no engineering experience outside of competitive programming, this internship was a huge win. We learned to use Git and Github with multiple people at once, forcing us to learn how to learn all of Git's version control and conflict resolution features. We learned Docker and how to boot and use servers, skills that are highly important when working with real projects. We learned Python and many libraries which will be useful in future research, projects, and USAAIO.

Beyond these engineering takeaways, we also delved deep into state of the art AI research and tooling. Each person read tons of papers, taking inspiration from and discussing each aspect. We now read papers much more efficiently and can grasp the key points as inspiration to generate novel ideas. In any future research, we can much more efficiently find new research directions. By learning many libraries deeply like torch, requests, openai, matplotlib, and numpy, we will get a headstart for future research and help in greatly improving the quality and speed of software development.

Above all, we learned how to brainstorm and work as a team, in such a real-life project.

Research Papers Studied

During this research project, we have studied a significant number of research papers focused on competitive programming (CP), large language model (LLM) reasoning, and computational efficiency.

Competitive Programming and Code Generation

CodeContests+: (Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen, 2025). This paper introduces a generator-validator agent system for creating and validating high-quality test cases for competitive programming.
CURE: (Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, and Mengdi Wang, 2025). This work proposes a reinforcement learning framework where an LLM coder and unit tester co-evolve through mutual supervision, significantly improving unit test and code generation accuracy.
TCGBench / Can LLMs Generate Reliable Test Case Generators?: (Yuhan Cao, Zian Chen, Kun Quan, Ziliang Zhang, et al., 2025). This study establishes a benchmark to evaluate LLMs on their ability to generate valid and targeted test cases, finding that models generally struggle with targeted cases designed to break incorrect code.
LogiCase: (Sicheol Sung, Yo-Sub Han, Sang-Ki Ko, et al., 2025). This methodology translates problem descriptions into formal grammar to improve the systematic generation of high-quality test cases.
Self-Edit: (Kechi Zhang, Zhuo Li, Jia Li, Ge Li, and Zhi Jin, 2023). This paper details an automated, fault-aware coding system that allows LLMs to debug their own generated code by analyzing execution error messages.
CodeT: (Bei Chen, Fengji Zhang, Anh Nguyen, et al., 2022). This research introduces dual-agreement for code generation, using generated tests to verify the correctness of different code solutions.
rStar-Coder: (Yifei Liu, Li Lyna Zhang, Yi Zhu, et al., 2025). This paper discusses scaling competitive code reasoning through the creation of large-scale verified datasets.
AlphaCode: (Yujia Li, David Choi, Junyoung Chung, et al., 2022). A seminal paper from DeepMind that pioneered competition-level code generation using large-scale models.
MapCoder: (Md Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez, 2024). This work explores multi-agent code generation for collaborative problem-solving in competitive contexts.
Xolver: (Md Tanzib Hosain, Salman Rahman, Md Kishor Morol, and Md Rizwan Parvez, 2025). This research utilizes a multi-agent reasoning pipeline that mimics an Olympiad-style team to solve complex problems.

Reasoning Efficiency and Optimization

GRPO (Group Relative Policy Optimization): (DeepSeek-AI, 2024). This paper introduces an RL algorithm that calculates advantages by comparing a group of outputs against their mean reward, eliminating the need for a separate Critic Model to save memory.
DAST / Difficulty-Adaptive Slow Thinking: (Brian Zhao et al. reference, 2025). This framework allows models to adapt their reasoning depth based on problem difficulty, reducing token usage by approximately 30% on average.
AutoL2S: (Andrew Ho reference, 2025). Similar to DAST, this framework enables models to decide how much time to spend on reasoning based on task difficulty, improving speed by up to 57%.
THINK PRUNE: (Andrew Ho reference, 2025). This method reduces reasoning redundancy and overthinking by penalizing overly long responses that do not provide additional reward.
DEER / Dynamic Early Exiting in Reasoning: (Aaron Luo reference, 2025). This approach enables models to generate outputs faster by identifying stopping points in reasoning and evaluating confidence in a trial answer.
C3oT / Conditioned Compressed Chain-of-Thought: (Aaron Luo reference, 2024). This paper details using a compressor model to train LLMs to reason well with shorter, more concise chains of thought.
TokenSkip: (Weibo Zhou reference, 2025). This method reduces token usage by scoring the importance of tokens in reasoning and skipping those with less impact.
CODI: (Weibo Zhou reference, 2025). This paper explores compressing Chain-of-Thought into latent hidden representations, matching explicit reasoning performance with significantly fewer tokens.
Mixed Distillation: (Weibo Zhou reference, 2023). This research proposes combining Chain-of-Thought and Program of Thought (PoT) to distill superior reasoning abilities from large models into smaller ones.
s1 / Simple Test-Time Scaling: (Weibo Zhou reference, 2025). This study uses budget forcing to control the model's thinking process, either terminating it early to save costs or adding "waits" to improve accuracy.
S2R / Self-Verify and Self-Correct: (Weibo Zhou reference, 2025). This paper teaches LLMs to review their own reasoning flaws and correct them through RL and curated examples.
LIMO / Less is More for Reasoning: (Weibo Zhou reference, 2025). This work demonstrates that models can learn effective reasoning patterns from a very small, high-quality dataset (only 817 examples).

Benchmarks and Specialized Architectures

S1-Bench: (Weibo Zhou reference, 2025). A benchmark designed to evaluate "System 1" (fast, intuitive) thinking in large reasoning models.
SDBench: (Daniel Yang reference, 2025). A benchmark for multi-speaker ASR and diarization that allows for fast iteration and integration of new systems.
KAT-V1: (Daniel Yang reference, 2025). This paper introduces Step-SRPO and AutoThink to automatically switch between deep reasoning and direct response modes.
U-Net: (Charlie Huang reference). Describes a network architecture primarily used for medical image segmentation that utilizes skip connections and latent space compression.
TECTON: (Aaron Luo reference, 2024). This paper shows how meta-reasoning can improve the selection and use of external tools by LLMs.