Vivek Haldar
Vivek Haldar
  • 248
  • 227 466
LLM Agents beat Human Debaters
arxiv.org/abs/2408.04472
github.com/ZhangYiqun018/agent-for-debate
00:00 Introduction to LLM Agent Systems for Debating
00:44 Overview of Competitive Debating Structure
02:01 Four Agents in the Debating System
02:52 The Searcher Agent
03:08 The Analyzer Agent
03:54 The Writer Agent
04:12 The Reviewer/Critic Agent
05:43 Evaluating the Debating System
06:54 Comparison with Baseline and Human Evaluators
07:31 Performance Results: Debatrix Evaluation
08:27 Performance Results: Human Evaluation
09:06 GitHub Repository and Prompts
Переглядів: 176

Відео

Laypeople cannot prompt LLMs
Переглядів 1,4 тис.14 днів тому
dl.acm.org/doi/10.1145/3544548.3581388 0:00 - Introduction to Large Language Models and Prompting 2:00 - Overview of the Prompting Tool Used in the Study 4:45 - Study Results: Challenges in Prompting for Non-Experts 6:49 - Fundamental Barriers and Evaluation Issues 8:10 - Conclusion: Difficulties of Prompting for General Users
GraphReader: RAG with multi-step reasoning over graphs
Переглядів 651Місяць тому
arxiv.org/abs/2406.14550v1 Previous video on GraphRAG: ua-cam.com/video/ODomovYfI6I/v-deo.html 00:00 Introduction and Background 00:31 Overview of Graph Reader System 00:55 Advantages over Traditional Models 01:36 Graph Construction Process 02:08 Query and Reasoning Workflow 03:32 Evaluation and Results 05:06 Importance of Knowledge Graph Quality 05:28 Prompts for Graph Reader System 06:20 Comp...
Studying GSM8K Leaderboard
Переглядів 198Місяць тому
paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k arxiv.org/pdf/2404.14963v3.pdf arxiv.org/pdf/2308.07921v1.pdf 0:00 Introduction to GSM 8K Benchmark 0:47 Flattening of Scores in Recent Years 1:20 Top Approach: "Deeply Understanding the Problem" 1:45 Three-Step Problem-Solving Method 3:08 Comparison to Zero-Shot Chain of Thought 3:39 Second Top Approach: Code Generation 4:15 Code Interprete...
Think-and-execute prompting for LLMs
Переглядів 316Місяць тому
arxiv.org/abs/2404.02575 00:00 Introduction 00:18 New Prompting Method for LLMs 00:37 Overview of Chain of Thought and Program of Thought 01:19 Think and Execute Approach Explained 02:06 Detailed Steps of Think and Execute 03:10 Instructor and Reasoner Models 04:02 Creating Pseudo Code in Think Phase 05:00 Experiment Results with Different Models 06:07 Benefits of Code Pre-trained Models 06:33 ...
AutoGen: Programming LLM Agents
Переглядів 3372 місяці тому
github.com/microsoft/autogen openreview.net/pdf?id=uAjxFFing2 0:00 Introduction to Agents and LLMs 0:29 Understanding the Need for Agents 1:28 Overview of the Autogen Framework 2:05 Why Agents Work with LLMs 3:17 Autogen Structure and Abstractions 4:27 Example: Math Problem Solving Agents 6:25 Example: Retrieval Augmented Q&A 8:59 Conclusion and Wrap-up vivekhaldar.com x.com/vivekhaldar
Fine-tuning LLMs encourages hallucinations
Переглядів 3172 місяці тому
arxiv.org/abs//2405.05904 0:00 Introduction and recap of previous paper 0:29 Fine-tuning LLMs can lead to hallucination 1:18 Constructing an experiment to test the conjecture 1:51 Categorizing knowledge into four categories 2:59 Fine-tuning with different percentages of unknown examples 3:31 Impact of unknown items on fine-tuning accuracy 4:02 Fine-tuning improves utilization of pre-existing kn...
Fine-tuning or RAG?
Переглядів 8312 місяці тому
arxiv.org/abs/2312.05934 0:00 Comparing Fine-tuning and Retrieval Augmented Generation 0:34 Using LLMs for Specialized Domains 1:13 Fine-tuning vs In-context Learning Techniques 2:23 Causes of LLM Factual Errors and Hallucinations 3:50 Constructing the Experiment Dataset 4:45 Models Tested and Accuracy Comparison 5:51 RAG Outperforms Fine-tuning Across Models 6:20 Why RAG Performs Better Than F...
Fixing RAG with GraphRAG
Переглядів 7 тис.3 місяці тому
arxiv.org/abs/2404.16130 0:00 Introduction to RAG and its Limitations 1:08 Sense-Making and Graph-Based Approaches to RAG 2:27 Overview of the Graph RAG Pipeline 4:19 Extracting Concepts and Relationships from Documents 5:07 Summarizing Graph Elements and Clustering into Communities 6:32 Answering Queries with Graph RAG 8:58 Evaluating Graph RAG: Datasets and Question Generation 10:42 Comparing...
LLMs improve writing-based knowledge work
Переглядів 2823 місяці тому
0:00 Introduction: Impact of LLMs on Knowledge Workers 0:33 Experiment Setup: Professionals & Writing Tasks 1:02 Results Overview: Positive Effects of LLMs 2:33 Detailed Results: Time & Grade Improvements 3:29 AI Impact: Lower vs Higher Performers 4:48 Time Allocation: Shifting to Editing with LLMs 5:35 Job Satisfaction: Increased with LLM Use 5:42 Summary of Benefits: Quality & Speed Improveme...
Co-intelligence: book review
Переглядів 4483 місяці тому
www.oneusefulthing.org/ www.hbs.edu/ris/Publication Files/24-013_d9b45b68-9e74-42d6-a1c6-c72fb70c7282.pdf ua-cam.com/video/ogQbgdZQiaI/v-deo.html www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html RobertRMorris/status/1611450197707464706 replika.com/ ckrybus.com/static/papers/Bainbridge_1983_Automatica.pdf ua-cam.com/video/bJIhXrfOH58/v-deo.html studio.ribbonf...
Winning prompt! $10k LLM reasoning challenge
Переглядів 5184 місяці тому
0:00 Introduction to the AB Problem Challenge 0:29 Overview of the Winning Prompt 1:52 Detailed Mechanics and Problem-Solving Steps 3:43 Few-Shot Prompting with Textual Solution Format 4:53 Key Lessons from the Winning Prompt 5:53 Emphasis on Repetition and Instruction Clarity 6:40 Trial and Error Process for Prompt Creation 7:31 Alternative Approach: LLM-Generated Code Solution 8:59 Concluding...
$10k for LLM reasoning
Переглядів 8974 місяці тому
0:00 Introduction and Problem Description 0:23 LLM Reasoning Challenge 0:51 Details of the $10,000 Challenge 1:15 Internet Takes Up the Challenge 2:28 Winning Entry and Success Rates 3:25 LLMs Can Do Reasoning with Prompting 4:10 Benchmarking LLM Reasoning Capabilities 5:07 Boundaries of LLM Reasoning Unclear 5:36 Claude Opus Outperforms GPT-4 6:08 Conclusion and Future Video Plans Original cla...
LLM agents do software engineering
Переглядів 7024 місяці тому
0:00 Introduction to Autodev and Agent-based LLM Systems 0:48 Limitations of Co-pilot and the Need for Automation 1:10 Autodev Architecture Overview 2:01 The Role of the Conversation Manager and Agents 3:22 Demonstrating Autodev's End-to-End Flow 5:03 Comparing Autodev's Performance to GPT-4 Baseline 6:03 Autodev's Performance on the Human Eval Benchmark 6:37 Autodev's Performance on Test Gener...
LLM benchmarks
Переглядів 7774 місяці тому
How are LLMs evaluated? 00:00 - Introduction and motivation for looking at LLM benchmarks 00:38 HumanEval benchmark for code synthesis 02:27 - Exploring the HumanEval dataset 03:24 - MMLU (Massive Multitask Language Understanding) benchmark 04:37 - Exploring the MMLU dataset 05:58 - BigBench meta-benchmark with 200 tasks 06:50 - Exploring a logical reasoning task in BigBench 08:13 - BigBench Ha...
LLMs eat entry-level SWEs
Переглядів 1,1 тис.5 місяців тому
LLMs eat entry-level SWEs
LLMs can debug with prints
Переглядів 4715 місяців тому
LLMs can debug with prints
Determinism ⇒ Fast LLMs (Groq)
Переглядів 6055 місяців тому
Determinism ⇒ Fast LLMs (Groq)
Self-discovery: Choosing the Best Prompt for the Problem
Переглядів 3905 місяців тому
Self-discovery: Choosing the Best Prompt for the Problem
Asleep at the wheel: can AI reduce performance?
Переглядів 1295 місяців тому
Asleep at the wheel: can AI reduce performance?
The Future of RAG in the Age of Large Context Windows
Переглядів 6695 місяців тому
The Future of RAG in the Age of Large Context Windows
GPT-4 passed the Turing Test!
Переглядів 2,2 тис.6 місяців тому
GPT-4 passed the Turing Test!
CS higher ed in North America: all the stats you should know
Переглядів 1486 місяців тому
CS higher ed in North America: all the stats you should know
I remember @AndrejKarpathy 's deleted tweet
Переглядів 3836 місяців тому
I remember @AndrejKarpathy 's deleted tweet
LLMs for real world knowledge work
Переглядів 2676 місяців тому
LLMs for real world knowledge work
Watch me build a GPT for journaling
Переглядів 3037 місяців тому
Watch me build a GPT for journaling
LLMs can "breed" their own prompts
Переглядів 1,2 тис.7 місяців тому
LLMs can "breed" their own prompts
LLMs with infinite context?
Переглядів 7777 місяців тому
LLMs with infinite context?
Can prompt engineering beat fine-tuning?
Переглядів 7067 місяців тому
Can prompt engineering beat fine-tuning?
Can LLMs discover new math and CS?
Переглядів 1,1 тис.7 місяців тому
Can LLMs discover new math and CS?

КОМЕНТАРІ

  • @somanshukumar1344
    @somanshukumar1344 2 дні тому

    Always wanted to see this type of content

  • @rtos
    @rtos 13 днів тому

    Unfortunately even the so called power users of LLM with their own UA-cam channels, always seem to have a small set of stock prompts, which get repeated with every new review. If LLMs were trained on these specific questions then they're going start appearing super intelligent! Things like 'why is the sky blue', or write a 'snake game in python' are hardly a test of machine intelligence, as all that is needed is to be trained in accurate code or factual data.

  • @matty-oz6yd
    @matty-oz6yd 16 днів тому

    I really value what you do my dude <3

  • @RAHUDAS
    @RAHUDAS 17 днів тому

    Bot designer ltd crashed, I not able to access

  • @ashwinnair5803
    @ashwinnair5803 17 днів тому

    Why not just use RAPTOR instead?

  • @yiwensin5913
    @yiwensin5913 18 днів тому

    Excellent! I didn't know you before and I just stumbled upon your video while searching for material on prompting LLMs (for a local LLM project). You now have a new sub :)

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w 18 днів тому

    can you discuss DSPy and give your opinion on it given how it is related to prompting

  • @arthurdhonneur276
    @arthurdhonneur276 18 днів тому

    Nice video thank you very much !

  • @Starhopp3r
    @Starhopp3r 26 днів тому

    Excellent review! Thank you. I really enjoyed this book and have been recommending it to people in order to help them set expectations about “AI” without excessive optimism or pessimism. Currently reading Deep Utopia; hope to see your review soon!

  • @user-bw6oi5mf9y
    @user-bw6oi5mf9y Місяць тому

    I think the main problem is how they traverse the graph by asking LLM at each step. I'm not sure if this is feasible in production.

  • @sasha297603ha
    @sasha297603ha Місяць тому

    Great paper, thanks for covering!

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w Місяць тому

    Excellent content. Well explained.

  • @PreetiGuptaAvril
    @PreetiGuptaAvril Місяць тому

    Sir, Do you explore the vision domain as well or do you any such youtuber whom i can follow for paper understanding.

  • @themax2go
    @themax2go Місяць тому

    very well "ragged"... both on the local domain (details) and global domain (overview of pros-cons) 😉😎

  • @wayneqwele8847
    @wayneqwele8847 Місяць тому

    Thank you for the video, that was a great paper to go through. I find RAG research techniques have so much insight to how we can develop and identify our own cognitive impediments to our own judgement. The Comprehensiveness, Diversity of perspective, Empowerment and Directness is such a good mental model to use in our own human judgement.

  • @fintech1378
    @fintech1378 Місяць тому

    super excellent video

  • @goelnikhils
    @goelnikhils Місяць тому

    Amazing explanation Vivek

  • @awakenwithoutcoffee
    @awakenwithoutcoffee Місяць тому

    great presentation Vivek. Some questions: - is graphRAG production ready ? if not, would it be difficult to upgrade RAG methods once we are in production ? - is there a RAG provider/stack that you prefer ? (datastax, pinecone, weaviate + a bunch of others who are all competing for attention) - what are your thoughts on LangChain vs LangGraph ?

  • @christopherconyers767
    @christopherconyers767 Місяць тому

    Awesome review - thanks for the great work!

  • @brandonheaton6197
    @brandonheaton6197 Місяць тому

    can you pontificate on the combination of upcoming transformer inference ASICs with deep agentic workflows employing GraphRAG style strategies? Seems like we will be close to our personal assistants writing a PhD thesis in the background whenever we ask a question. SOHU is reporting 500,000 tokens per second with Llama3 70B....

  • @RoulDukeGonzo
    @RoulDukeGonzo Місяць тому

    Seems clear that for 'current events' rag is going to win, but for broader, domain specific themes or logic, how does fine tuning stack up? E.g. create code using our internal suite of APIs... If context is big enough, icl should be fine, but rag may miss some key docs based on semantic similarity alone... I guess... I should write a paper 😂

  • @sasha297603ha
    @sasha297603ha Місяць тому

    Very interesting papers, thanks for covering!

  • @stevenwatson2927
    @stevenwatson2927 Місяць тому

    It's surprising to see ChatGPT achieving below 99% when Wolfram Alpha can basically answer anything just by having specific knowledge. It's also surprising to see that "playing" with the word prompt does anything at all yet alone give a better result. It makes no sense especially when we can clearly see from the research the information entropy is basically the same between prompts with and without extra steps.

  • @therobotocracy
    @therobotocracy Місяць тому

    Is it flattening out because it maxes at %100?

    • @VivekHaldar
      @VivekHaldar Місяць тому

      Yes, that too! People have started looking at harder benchmarks like GSM8k-Hard and MATH.

  • @karinlv890
    @karinlv890 Місяць тому

    Thank you for saving my group meeting! Your video helps a lot!

  • @wanfuse
    @wanfuse Місяць тому

    wouldn't it cut to the chase to train an llm on your own data? theres your graph use one of these OpenAI's GPT-3/4 Hugging Face Transformers (e.g., GPT-2, GPT-3 via third-party providers) Google's T5 (Text-to-Text Transfer Transformer) Meta's BART and BlenderBot Anthropic's Claude every week update the llm summarization is the death of real data, better off one level of summarization? Just a thought!

    • @mccleod6235
      @mccleod6235 Місяць тому

      Maybe you don't want to send all your valuable business data to third party companies.

    • @wanfuse
      @wanfuse Місяць тому

      @@mccleod6235 thats true but its not necessary, there are models that are open source you can train air gapped from a jetson

    • @bohnohboh676
      @bohnohboh676 Місяць тому

      "every week update the llm" yeah no way unless you have tons of cash, compute, and time

    • @wanfuse
      @wanfuse Місяць тому

      maybe maybe not, let you know! your probably right, will see if my idea pans out

  • @rafikyahia7100
    @rafikyahia7100 Місяць тому

    Excellent content summarizing cutting edge approaches, thank you!

  • @sasha297603ha
    @sasha297603ha Місяць тому

    Very interesting paper! Looks like team lead model and a bunch of juniors 😅 Thanks for covering!

  • @christopherd.winnan8701
    @christopherd.winnan8701 Місяць тому

    Are there any models where we can try this think and exe method for ourselves?

    • @VivekHaldar
      @VivekHaldar Місяць тому

      As described in the paper, authors tried it with GPT-3.5, and Llama. They have prompts in the paper, you could try it with any LLM of your choice.

  • @vida91963
    @vida91963 2 місяці тому

    Nice presentation thank you!

  • @jordycollingwood
    @jordycollingwood 2 місяці тому

    Really great explanation, I’m currently struggling to decide on my own KG structure for a 2000 medical pdf corpus, so this was very helpful

    • @awakenwithoutcoffee
      @awakenwithoutcoffee Місяць тому

      same here brother. There are so many techniques, everyday I learn something new which is both good and terrifying ha. What stack are you thinking of using ? We are researching DataStax, Pinecone, Weaviate and are learning to build agents with LangGraph.

  • @kaixiliu7469
    @kaixiliu7469 2 місяці тому

    Thanks for sharing the review Vivek! Would you mind sharing your book list as well?

    • @VivekHaldar
      @VivekHaldar Місяць тому

      Hey Kaixi! Don't have an explicit list, just pick up what looks interesting at the time... :-)

  • @btscheung
    @btscheung 2 місяці тому

    Really appreciate your in depth review of the book! This provides more thoughtful reading when I start the book.

  • @thankqwerty
    @thankqwerty 2 місяці тому

    Thanks for sharing the paper. In my experience with using Llama3-8B, in my benchmark dataset, I noticed that LLM has learned an incorrect fact or in contradiction with my application. I tried to clarify that in the prompt, but noticed the LLM is actually quite stubborn, and lead to quite fragile responses, i.e. the LLM sometimes get it right sometimes get it wrong with minimal changes in the prompt, could be as small as adding spaces. I wonder if you have come across similar situation or papers that discuss this behavior. Thanks.

    • @VivekHaldar
      @VivekHaldar 2 місяці тому

      Yes that kind of brittleness is a common issue unfortunately.

  • @harivarsha4016
    @harivarsha4016 2 місяці тому

    I love this kind of content, please never stop !!!

  • @atomwalk
    @atomwalk 2 місяці тому

    Awesome work! Thanks🤗

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w 2 місяці тому

    More agent paper please. Thanks 😊

  • @willtipton1698
    @willtipton1698 2 місяці тому

    Nice video ty

  • @colinwar
    @colinwar 2 місяці тому

    You ask vanilla questions if you can't un-cloak a machine response!. The reasoning is not there with language models, how stupid are people to not be able to ask the right questions? I call lies on these claims. Show the test as proof, I doubt you can or will show the actual test. This is absurd.

  • @gilinachum
    @gilinachum 2 місяці тому

    But why is the paper's fine tuning different than the original pre-training and alignment fine tuning that came before it. All expose the model to a mix of existing and new data...

    • @VivekHaldar
      @VivekHaldar 2 місяці тому

      You are correct -- in principle fine-tuning works the same way as pre-training (updating weights), so FT can be thought of as continued PT. Difference is in data used. One will FT when they have a domain-specific set of data that's very different from the PT data.

  • @hosseinmohammadi4574
    @hosseinmohammadi4574 2 місяці тому

    Interesting! Tnx

  • @sasha297603ha
    @sasha297603ha 2 місяці тому

    Very interesting paper, thanks for covering!

  • @HampusAhlgren
    @HampusAhlgren 2 місяці тому

    Just wanted to say I really appreciate your videos. Everything is short and concise and I love that you’re always using papers as the foundation for the conclusions. Keep it up!

    • @VivekHaldar
      @VivekHaldar 2 місяці тому

      Thanks for the kind words. That's the idea!

  • @dennyoviedo4102
    @dennyoviedo4102 2 місяці тому

    Good brother 😊thanks for an excellent explanation , Peer 2 peer of BTC formula. I’ll eat this info into my brain 🧠 until my neurons starting a new circuit.😂.

  • @sasha297603ha
    @sasha297603ha 3 місяці тому

    Very interesting paper, thanks for covering!

  • @MatySiman
    @MatySiman 3 місяці тому

    Great video! Why the > 1 was a mistake? Didn't it return False as it should?

    • @MatySiman
      @MatySiman 3 місяці тому

      More specifically, I find this sentence a bit weird: "This means that if any element has more than 1 duplicate , the function will return False . However , the task requires that if there are more than 1 duplicate of the same number , the function should return False ."

    • @MatySiman
      @MatySiman 2 місяці тому

      @VivekHaldar

  • @christopherd.winnan8701
    @christopherd.winnan8701 3 місяці тому

    Does this also mean that experts in their field might choose to wait for improved AI abilities so that they can do more than just superficial improvements? I predict that we will see a tsunami of low quality generations followed by a true paradigm leap in terms of content.

    • @VivekHaldar
      @VivekHaldar 3 місяці тому

      They don't need to wait. You can get pretty far before hitting the limits of current SOTA models. There is a tiny fraction of writers who are sought out for their unique voice. Everyone else is producing generic sounding copy, ripe for replacement.

    • @christopherd.winnan8701
      @christopherd.winnan8701 3 місяці тому

      @@VivekHaldar - I still cannot find a model that can handle more advanced tasks. Do you have any recs?

    • @VivekHaldar
      @VivekHaldar 3 місяці тому

      @@christopherd.winnan8701 Example of an advanced task you see LLMs having trouble with? See recent videos (two weeks ago) on the channel about the $10k reasoning challenge for an example problem and resulting prompt that solved it.

    • @christopherd.winnan8701
      @christopherd.winnan8701 3 місяці тому

      @@VivekHaldar Is it an open access LLM?