Posts

Notes on Dwarkesh-Karpathy

A summary of the Karpathy interview on the Dwarkesh podcast, with my thoughts in italics. I paraphrased what Dwarkesh and Karpathy said, they’re not direct quotes. Big insights: LLMs are still cognitively lacking in many ways — no continual learning, not very multimodal, don’t have a reflection process, have a collapsed distribution of responses. These are hard problems to solve and will take time. Karpathy argues these won’t just be solved by scaling up models and doing RL on different types of tasks. These behaviors are not emergent, they require algorithmic breakthroughs. Humans don’t learn what we think of as intelligent tasks by RL, they seem to learn by something different and more reflective and deliberate. RL is terrible, you’re updating everything in the trajectory of actions, even if intermediate steps were wrong. It’s also a slow, inefficient way of learning. AI progress will come from better everything — data, hardware, kernels, and algorithmic breakthroughs. LLMs are bad a...

Paper Summary: Method to Measure Benchmark Contamination

Image
This is a summary and explanation for a paper that measured dataset contamination using kernel diverge scores:  https://arxiv.org/pdf/2502.00678 . Benchmarks are a flawed estimation of a model’s real-world capabilities for a variety of reasons. One reason in benchmark dataset contamination - i.e. if part of benchmark’s dataset overlaps with training data, the model will have been specifically trained to do well on those tests. Therefore, the benchmark fails to accurately measure how well the model generalizes to examples it hasn’t seen. This is an interesting paper that looks at Kernel Divergence to estimate to dataset contamination. Aim Given a dataset D and model M, the paper aims to create a dataset contamination score S(D,M). The contamination ratio is the proportion of the benchmark dataset that has been seen in training data. This score should be monotonic (datasets with higher contamination should have higher scores) and consistent (datasets with similar contamination should...

Georgia Tech Part 1

My first semester at Georgia Tech is almost over. A lot has happened, but I still can't really believe it's gone this fast. I like it here. Georgia Tech is really beautiful, especially in the fall. I like walking across Tech Green and seeing the trees shed their orange leaves and watching squirrels run across the grass.  People here are, in my experience, very kind. When people say "have a good one" or "thank you" or "bless you", it really does feel like they mean it. It doesn't feel (too) awkward to start conversations with strangers. America has made me more extroverted. College feels like the high school experience I never had. Welcome Week in particular was incredible---an entire week of no classes, where I hung out with new people every day, sat at new tables at lunch. I know that strangers are friendly and everyone would still like talking to new people; however, talking with new people is not the default anymore, and so I do it less ofte...

Have LLMs passed the Turing Test?

What is the Turing Test? The Turing Test aims to find out if a human interrogator speaking simultaneously to an AI and human can tell the difference between them. There are many disagreements over what constitutes a true Turing Test. How can we decide a human interrogator? Does the interrogator represent the median human, the median human well-educated on LLMs, an expert on AI, etc.? How long should the human interrogator be allowed to talk with the AI? Have LLMs passed the Turing Test? Jones and Bergen 2025  sets the experiment up as follows: Two types of prompts were used. The baseline prompt was minimal: “You are about to participate in a Turing test. Your goal is to convince the interrogator that you are a human.” The second prompt included a persona for the model (a young, slang-using introvert). Participants were split into two groups: UCSD psychology undergrads and prolific workers. Very interestingly, there were "no consistent effects of any variable on participant accurac...

The Upholding of Proposition 12

Image
  Background: Farming Conditions and Prop 12 Currently in the US, most breeding pigs live in factory farmers, where they are confined in gestation crates which are small metal cages so small that pigs can’t even turn around, while egg-laying hens live in tiny, cramped battery cages that cause a range of psychological and physiological harm . The crowded conditions also have potential health harms by increasing the stress levels of pigs and weakening their immune systems, which can make them more susceptible to zoonotic diseases that may spread to humans. Starting in the early 2000s, a few animal welfare groups including the Humane Society of the Unites States aimed to ban the farming system of cages for hens, breeding pigs and veal calves. In 2008, Proposition 2 was passed which put in place a “production” ban on cages, which said that producers had to ensure pigs, hens, and calves could lie down, turn around, and extend their limbs or wings without hitting the side of an encl...