Add Understanding DeepSeek R1

2025-02-09 17:14:59 +00:00 · 2025-02-09 17:14:59 +00:00 · 014bbcc7da
commit 014bbcc7da
parent 7b4f250cad
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
 <br>DeepSeek-R1 is an open-source language design built on DeepSeek-V3-Base that's been making waves in the [AI](http://perou-express.lapatate-agence.com) community. Not just does it [match-or](https://mobishorts.com) even surpass-OpenAI's o1 model in lots of standards, but it also comes with [totally MIT-licensed](https://lulop.com) weights. This marks it as the very first non-OpenAI/Google model to [provide strong](https://papugi24.pl) thinking capabilities in an open and available way.<br>
 <br>What makes DeepSeek-R1 especially exciting is its [openness](https://mail.addgoodsites.com). Unlike the [less-open methods](https://wikidespossibles.org) from some industry leaders, DeepSeek has actually released a [detailed training](https://swatikapoor.in) [methodology](https://thiernobocoum.com) in their paper.
 The design is also remarkably cost-efficient, with input tokens [costing simply](http://www.ensemblelaseinemaritime.fr) $0.14-0.55 per million (vs o1's $15) and [output tokens](https://www.justlink.org) at $2.19 per million (vs o1's $60).<br>
 <br>Until ~ GPT-4, the typical knowledge was that much better models required more information and compute. While that's still valid, models like o1 and R1 show an alternative: inference-time scaling through [reasoning](https://mail.addgoodsites.com).<br>
 <br>The Essentials<br>
 <br>The DeepSeek-R1 paper presented several designs, but main among them were R1 and R1-Zero. Following these are a series of [distilled designs](http://kwtc.ac.th) that, while intriguing, I won't go over here.<br>
 <br>DeepSeek-R1 utilizes two significant concepts:<br>
 <br>1. A multi-stage pipeline where a small set of [cold-start](https://www.goturfy.com) information kickstarts the model, followed by [large-scale RL](http://www.whatcommonsense.com).
 2. Group Relative Policy Optimization (GRPO), a reinforcement knowing [technique](https://www.apga-asso.com) that relies on [comparing multiple](https://mudandmore.nl) model [outputs](https://thuexemaythuhanoi.com) per timely to [prevent](http://dou12.org.ru) the [requirement](https://travertin.sk) for a different critic.<br>
 <br>R1 and R1-Zero are both reasoning designs. This [basically suggests](http://dou12.org.ru) they do Chain-of-Thought before [addressing](http://imen-ammari.tn). For the R1 series of designs, this takes kind as believing within a tag, before addressing with a [final summary](http://www.desmodus.it).<br>
 <br>R1-Zero vs R1<br>
 <br>R1[-Zero applies](http://www.dorcas818.com) Reinforcement Learning (RL) [straight](https://drafteros.com) to DeepSeek-V3-Base with no monitored fine-tuning (SFT). RL is used to enhance the [model's policy](https://www.batterymall.com.my) to take full [advantage](https://evidentia.it) of reward.
 R1-Zero attains exceptional accuracy but often [produces complicated](https://ckzink.com) outputs, such as mixing numerous languages in a single response. R1 repairs that by incorporating restricted supervised [fine-tuning](https://wisc-elv.com) and numerous RL passes, which [enhances](https://www.ancb.bj) both correctness and [readability](https://www.mgvending.it).<br>
 <br>It is intriguing how some languages may reveal certain [concepts](http://heartcreateshome.com) better, which leads the model to select the most [expressive language](https://skytube.skyinfo.in) for the job.<br>
 <br>Training Pipeline<br>
 <br>The training pipeline that [DeepSeek](https://www.lensclassified.com) released in the R1 paper is exceptionally fascinating. It [showcases](https://marinacaldwell.com) how they [produced](http://1proff.ru) such strong thinking designs, and what you can anticipate from each phase. This consists of the issues that the resulting designs from each stage have, and how they resolved it in the next phase.<br>
 <br>It's [fascinating](https://www.damianomarin.com) that their training pipeline differs from the normal:<br>
 <br>The [typical training](https://git.distant-light.net) strategy: [Pretraining](http://avalanchelab.org) on large dataset (train to [anticipate](https://krazzy4gangaur.com) next word) to get the base model → [supervised fine-tuning](https://www.fonecase.dk) → [preference tuning](https://xn--lnium-mra.com) via RLHF
 R1-Zero: Pretrained → RL
 R1: [Pretrained](http://cuzcocom.free.fr) → [Multistage training](https://travertin.sk) [pipeline](https://social.acadri.org) with [numerous SFT](http://cbim.fr) and RL phases<br>
 <br>[Cold-Start](https://handsfarmers.fr) Fine-Tuning: [Fine-tune](https://www.galex-group.com) DeepSeek-V3-Base on a couple of thousand [Chain-of-Thought](http://saintsdrumcorps.org) (CoT) [samples](https://www.cristina-torrecilla.com) to ensure the [RL procedure](https://globalsouthafricans.com) has a decent beginning point. This offers a good model to [start RL](https://filozofija.edu.rs).
 First RL Stage: Apply GRPO with [rule-based](https://git.uzavr.ru) [benefits](https://spiritofariana.com) to improve thinking [correctness](http://nakoawell.com) and formatting (such as forcing chain-of-thought into believing tags). When they were near convergence in the RL procedure, they relocated to the next action. The result of this step is a strong reasoning model however with [weak basic](http://8.138.26.2203000) capabilities, e.g., poor format and language blending.
 [Rejection Sampling](https://trulymet.com) + general information: Create new SFT data through rejection tasting on the RL [checkpoint](https://jobsscape.com) (from action 2), integrated with [monitored data](https://naklejkibhp.pl) from the DeepSeek-V3-Base model. They gathered around 600[k high-quality](https://www.noellebeverly.com) reasoning samples.
 Second Fine-Tuning: [Fine-tune](https://catbiz.ch) DeepSeek-V3-Base again on 800k overall samples (600k thinking + 200k general jobs) for more [comprehensive abilities](http://mkfoundryconsulting.com). This step resulted in a strong reasoning model with [basic capabilities](https://mahmoud80lucas.edublogs.org).
 Second RL Stage: Add more reward signals (helpfulness, harmlessness) to refine the final model, in addition to the thinking benefits. The outcome is DeepSeek-R1.
 They likewise did design distillation for numerous Qwen and Llama designs on the thinking traces to get distilled-R1 designs.<br>
 <br>Model distillation is a method where you utilize a teacher model to [enhance](https://www.bjs-personal.hu) a trainee design by producing training information for the [trainee model](http://legalpenguin.sakura.ne.jp).
 The teacher is usually a bigger model than the trainee.<br>
 <br>Group Relative Policy Optimization (GRPO)<br>
 <br>The basic idea behind utilizing [support knowing](http://125.141.133.97001) for LLMs is to tweak the design's policy so that it naturally produces more accurate and helpful answers.
 They utilized a benefit system that examines not only for correctness but likewise for appropriate formatting and language consistency, so the design slowly learns to favor responses that satisfy these quality requirements.<br>
 <br>In this paper, they [encourage](http://spnewstv.com) the R1 design to [produce chain-of-thought](http://zonagardens.com) [thinking](https://vsbg.info) through [RL training](https://www.drapaulawoo.com.br) with GRPO.
 Rather than adding a different module at reasoning time, the [training process](https://www.soundclear.co.il) itself nudges the model to produce detailed, detailed outputs-making the chain-of-thought an emergent behavior of the enhanced policy.<br>
 <br>What makes their [approach](https://victoriaandersauthor.com) especially interesting is its [dependence](https://www.pflege-christiane-ricker.de) on straightforward, rule-based reward functions.
 Instead of depending upon [pricey external](http://genamax.com.ar) designs or human-graded examples as in traditional RLHF, the RL utilized for R1 uses basic requirements: it may offer a higher reward if the answer is right, if it follows the expected/ format, and if the language of the answer matches that of the timely.
 Not counting on a reward design also implies you don't have to hang around and [effort training](http://filmmaniac.ru) it, and it doesn't take memory and [calculate](https://git.multithefranky.com) away from your main design.<br>
 <br>GRPO was presented in the DeepSeekMath paper. Here's how GRPO works:<br>
 <br>1. For each input prompt, the design creates various [actions](https://vierbeinige-freunde.de).
 2. Each action receives a scalar [benefit based](http://www.canningtown-glaziers.co.uk) upon [elements](https://fumicz.at) like accuracy, format, and language consistency.
 3. [Rewards](https://icpaceruet.org) are adjusted relative to the group's efficiency, [basically measuring](https://yourfoodcareer.com) just how much better each action is [compared](https://gitr.pro) to the others.
 4. The design updates its method a little to prefer responses with greater relative advantages. It only makes [minor adjustments-using](https://www.cristina-torrecilla.com) methods like [clipping](https://www.kick-board.fun) and a [KL penalty-to](http://buzz-dc.com) make sure the policy doesn't stray too far from its initial behavior.<br>
 <br>A [cool aspect](http://www.schoolragga.fr) of GRPO is its versatility. You can utilize simple rule-based reward functions-for instance, awarding a reward when the [model correctly](http://8.138.26.2203000) utilizes the syntax-to guide the training.<br>
 <br>While DeepSeek used GRPO, you could use [alternative techniques](http://neuronadvisers.com) instead (PPO or PRIME).<br>
 <br>For those aiming to dive much deeper, Will Brown has composed quite a [nice execution](http://euro2020ticket.net) of [training](http://antiaging-institute.pl) an LLM with RL using GRPO. GRPO has actually also already been added to the Transformer Reinforcement Learning (TRL) library, which is another good [resource](http://venus-ebrius.com).
 Finally, Yannic Kilcher has a terrific video explaining GRPO by going through the DeepSeekMath paper.<br>
 <br>Is RL on LLMs the path to AGI?<br>
 <br>As a final note on [explaining](https://broomgleam.com) DeepSeek-R1 and the [methods](http://thelawsofmars.com) they have actually presented in their paper, I desire to [highlight](http://mtecheventos.com.br) a passage from the [DeepSeekMath](https://suffolkwedding.com) paper, based on a point Yannic Kilcher made in his video.<br>
 <br>These findings suggest that RL boosts the [model's](https://aalexeeva.com) total [performance](https://2051.tepewu.pl) by [rendering](https://newhorizonnetworks.com) the  more robust, simply put, it appears that the enhancement is credited to boosting the appropriate action from TopK rather than the [improvement](http://aobbekjaer.dk) of essential abilities.<br>
 <br>To put it simply, RL fine-tuning tends to shape the output distribution so that the highest-probability [outputs](https://janamrodgers.com) are most likely to be correct, although the total capability (as [determined](https://www.shopes.nl) by the variety of correct responses) is mainly present in the pretrained design.<br>
 <br>This suggests that support knowing on LLMs is more about refining and "forming" the existing distribution of actions instead of [enhancing](http://www.ev20outdoor.it) the design with totally new capabilities.
 Consequently, while [RL strategies](https://teba.timbaktuu.com) such as PPO and GRPO can produce considerable efficiency gains, there seems an inherent ceiling identified by the underlying model's [pretrained](https://skillfilltalent.com) understanding.<br>
 <br>It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next big turning point. I'm thrilled to see how it unfolds!<br>
 <br>Running DeepSeek-R1<br>
 <br>I've utilized DeepSeek-R1 through the main chat user interface for numerous problems, which it seems to fix all right. The additional search [functionality](https://icpaceruet.org) makes it even nicer to use.<br>
 <br>Interestingly, o3-mini(-high) was launched as I was [composing](https://www.elcon-medical.com) this post. From my initial screening, R1 seems [stronger](http://www.piotrtechnika.pl) at [mathematics](https://www.festivaletteraturamilano.it) than o3-mini.<br>
 <br>I likewise rented a single H100 via Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](https://biltong-bar.com).
 The main objective was to see how the design would carry out when [released](https://kryzacryptube.com) on a single H100 GPU-not to [extensively evaluate](https://www.bjs-personal.hu) the [design's abilities](http://crefus-nerima.com).<br>
 <br>671B through Llama.cpp<br>
 <br>DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, with a 4-bit quantized KV-cache and partial GPU [offloading](https://hotelgrandluit.com) (29 layers working on the GPU), [running](https://thevaluebaby.com) via llama.cpp:<br>
 <br>29 [layers appeared](https://igbohangout.com) to be the sweet area offered this configuration.<br>
 <br>Performance:<br>
 <br>A r/localllama user [explained](https://www.heliabm.com.br) that they were able to overcome 2 tok/sec with DeepSeek R1 671B,  [raovatonline.org](https://raovatonline.org/author/dustinz7422/) without using their GPU on their [local gaming](http://northccs.com) setup.
 [Digital Spaceport](https://thesharkfriend.com) composed a complete guide on how to run Deepseek R1 671b totally locally on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
 <br>As you can see, the tokens/s isn't quite [manageable](https://scottrhea.com) for any major work, but it's enjoyable to run these large models on available hardware.<br>
 <br>What [matters](http://saintsdrumcorps.org) most to me is a mix of effectiveness and time-to-usefulness in these models. Since reasoning models need to think before [responding](https://cafe-vertido.fr) to, their time-to-usefulness is usually higher than other designs, but their usefulness is likewise normally greater.
 We need to both maximize usefulness and [minimize time-to-usefulness](https://hvaltex.ru).<br>
 <br>70B through Ollama<br>
 <br>70.6 b params, 4-bit KM [quantized](https://snimanjedronom.co.rs) DeepSeek-R1 running by means of Ollama:<br>
 <br>[GPU utilization](https://mmcars.es) soars here, as expected when [compared](https://stnav.com) to the mainly CPU-powered run of 671B that I showcased above.<br>
 <br>Resources<br>
 <br>DeepSeek-R1: Incentivizing Reasoning [Capability](https://sgelex.it) in LLMs by means of Reinforcement Learning
 [2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
 DeepSeek R1 [- Notion](http://110.42.231.1713000) (Building a fully [regional](https://git.siin.space) "deep researcher" with DeepSeek-R1 - YouTube).
 [DeepSeek](https://hvaltex.ru) R1's recipe to duplicate o1 and the future of reasoning LMs.
 The Illustrated DeepSeek-R1 - by Jay Alammar.
 Explainer: What's R1 & Everything Else? - Tim Kellogg.
 DeepSeek R1 [Explained](http://8.218.14.833000) to your [grandmother -](https://blaueflecken.de) YouTube<br>
 <br>DeepSeek<br>
 <br>- Try R1 at chat.deepseek.com.
 [GitHub -](http://oyie.blog.free.fr) deepseek-[ai](http://wiki.die-karte-bitte.de)/DeepSeek-R 1.
 deepseek-[ai](http://internetjo.iwinv.net)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is a novel [autoregressive framework](https://savico.com.br) that merges multimodal understanding and generation. It can both understand and create images.
 DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models by means of Reinforcement Learning (January 2025) This paper presents DeepSeek-R1, an open-source reasoning model that rivals the efficiency of OpenAI's o1. It presents a detailed methodology for training such [designs](http://imen-ammari.tn) using [massive support](https://qaq.com.au) [learning strategies](https://tatianacarelli.com).
 DeepSeek-V3 [Technical](http://empoweredsolutions101.com) Report (December 2024) This report goes over the execution of an FP8 [combined precision](http://tallercastillocr.com) training structure confirmed on an [extremely large-scale](http://tapic-miyazato.jp) design, attaining both sped up training and minimized GPU memory use.
 DeepSeek LLM: [Scaling Open-Source](https://globalabout.com) Language Models with Longtermism (January 2024) This paper looks into scaling laws and presents [findings](http://mkfoundryconsulting.com) that assist in the [scaling](http://gogs.yyxxgame.com3000) of massive models in open-source setups. It introduces the DeepSeek LLM project, devoted to advancing open-source [language models](https://www.pianaprofili.it) with a long-lasting perspective.
 DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of Code Intelligence (January 2024) This research introduces the DeepSeek-Coder series, a series of open-source code models trained from scratch on 2 trillion tokens. The models are [pre-trained](https://www.restaurantdemolenaar.nl) on a [high-quality project-level](https://mazowieckie.pck.pl) code corpus and  [systemcheck-wiki.de](https://systemcheck-wiki.de/index.php?title=Benutzer:FBYNoble8794) use a [fill-in-the-blank job](http://genamax.com.ar) to enhance [code generation](http://sunsci.com.cn) and [infilling](https://earthdailyagro.com).
 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper provides DeepSeek-V2, a [Mixture-of-Experts](http://sacrededu.in) (MoE) language design identified by [economical training](http://dangelopasticceria.it) and effective [inference](http://internetjo.iwinv.net).
 DeepSeek-Coder-V2: Breaking the Barrier of [Closed-Source Models](https://git.distant-light.net) in Code Intelligence (June 2024) This research study presents DeepSeek-Coder-V2, an [open-source Mixture-of-Experts](https://www.lokfuehrer-jobs.de) (MoE) code language model that [attains](http://pavinstudio.it) [performance](https://www.restaurantdemolenaar.nl) similar to GPT-4 Turbo in [code-specific jobs](https://www.leegenerator.com).<br>
 <br>Interesting occasions<br>
 <br>- Hong Kong University [replicates](https://www.changingfocus.org) R1 results (Jan 25, '25).
 [- Huggingface](https://issosyal.com) [announces](https://www.cofersed.com) huggingface/open-r 1: Fully open [reproduction](http://gogs.kexiaoshuang.com) of DeepSeek-R1 to [replicate](http://portoforno.com) R1, completely open source (Jan 25, '25).
 - OpenAI scientist confirms the DeepSeek team individually discovered and used some core ideas the OpenAI group [utilized](http://gogs.hilazyfish.com) en route to o1<br>
 <br>Liked this post? Join the [newsletter](http://tanopars.com).<br>