Home Blog Page 86

Understanding LLM Distillation Techniques 

0


Modern large language models are no longer trained only on raw internet text. Increasingly, companies are using powerful “teacher” models to help train smaller or more efficient “student” models. This process, broadly known as LLM distillation or model-to-model training, has become a key technique for building high-performing models at lower computational cost. Meta used its massive Llama 4 Behemoth model to help train Llama 4 Scout and Maverick, while Google leveraged Gemini models during the development of Gemma 2 and Gemma 3. Similarly, DeepSeek distilled reasoning capabilities from DeepSeek-R1 into smaller Qwen and Llama-based models.

The core idea is simple: instead of learning solely from human-written text, a student model can also learn from the outputs, probabilities, reasoning traces, or behaviors of another LLM. This allows smaller models to inherit capabilities such as reasoning, instruction following, and structured generation from much larger systems. Distillation can happen during pre-training, where teacher and student models are trained together, or during post-training, where a fully trained teacher transfers knowledge to a separate student model.

In this article, we will explore three major approaches used for training one LLM using another: Soft-label distillation, where the student learns from the teacher’s probability distributions; Hard-label distillation, where the student imitates the teacher’s generated outputs; and Co-distillation, where multiple models learn collaboratively by sharing predictions and behaviors during training.

Soft-Label Distillation

Soft-label distillation is a training technique where a smaller student LLM learns by imitating the output probability distribution of a larger teacher LLM. Instead of training only on the correct next token, the student is trained to match the teacher’s softmax probabilities across the entire vocabulary. For example, if the teacher predicts the next token with probabilities like “cat” = 70%, “dog” = 20%, and “animal” = 10%, the student learns not just the final answer, but also the relationships and uncertainty between different tokens. This richer signal is often called the teacher’s “dark knowledge” because it contains hidden information about reasoning patterns and semantic understanding.

The biggest advantage of soft-label distillation is that it allows smaller models to inherit capabilities from much larger models while remaining faster and cheaper to deploy. Since the student learns from the teacher’s full probability distribution, training becomes more stable and informative compared to learning from hard one-word targets alone. However, this method also comes with practical challenges. To generate soft labels, you need access to the teacher model’s logits or weights, which is often not possible with closed-source models. In addition, storing probability distributions for every token across vocabularies containing 100k+ tokens becomes extremely memory-intensive at LLM scale, making pure soft-label distillation expensive for trillion-token datasets.

Hard-label distillation

Hard-label distillation is a simpler approach where the student LLM learns only from the teacher model’s final predicted output token instead of its full probability distribution. In this setup, a pre-trained teacher model generates the most likely next token or response, and the student model is trained using standard supervised learning to reproduce that output. The teacher essentially acts as a high-quality annotator that creates synthetic training data for the student. DeepSeek used this approach to distill reasoning capabilities from DeepSeek-R1 into smaller Qwen and Llama 3.1 models.

Unlike soft-label distillation, the student does not see the teacher’s internal confidence scores or token relationships — it only learns the final answer. This makes hard-label distillation computationally much cheaper and easier to implement since there is no need to store massive probability distributions for every token. It is also especially useful when working with proprietary “black-box” models like GPT-4 APIs, where developers only have access to generated text and not the underlying logits. While hard labels contain less information than soft labels, they remain highly effective for instruction tuning, reasoning datasets, synthetic data generation, and domain-specific fine-tuning tasks.

Co-distillation

Co-distillation is a training approach where both the teacher and student models are trained together instead of using a fixed pre-trained teacher. In this setup, the teacher LLM and student LLM process the same training data simultaneously and generate their own softmax probability distributions. The teacher is trained normally using the ground-truth hard labels, while the student learns by matching the teacher’s soft labels along with the actual correct answers. Meta used a form of this approach while training Llama 4 Scout and Maverick alongside the larger Llama 4 Behemoth model.

One challenge with co-distillation is that the teacher model is not fully trained during the early stages, meaning its predictions may initially be noisy or inaccurate. To overcome this, the student is usually trained using a combination of soft-label distillation loss and standard hard-label cross-entropy loss. This creates a more stable learning signal while still allowing knowledge transfer between models. Unlike traditional one-way distillation, co-distillation allows both models to improve together during training, often leading to better performance, stronger reasoning transfer, and smaller performance gaps between the teacher and student models.

Comparing the Three Distillation Techniques 

Soft-label distillation transfers the richest form of knowledge because the student learns from the teacher’s full probability distribution instead of only the final answer. This helps smaller models capture reasoning patterns, uncertainty, and relationships between tokens, often leading to stronger overall performance. However, it is computationally expensive, requires access to the teacher’s logits or weights, and becomes difficult to scale because storing probability distributions for massive vocabularies consumes enormous memory.

Hard-label distillation is simpler and more practical. The student only learns from the teacher’s final generated outputs, making it much cheaper and easier to implement. It works especially well with proprietary black-box models like GPT-4 APIs where internal probabilities are unavailable. While this approach loses some of the deeper “dark knowledge” present in soft labels, it remains highly effective for instruction tuning, synthetic data generation, and task-specific fine-tuning.

Co-distillation takes a collaborative approach where teacher and student models learn together during training. The teacher improves while simultaneously guiding the student, allowing both models to benefit from shared learning signals. This can reduce the performance gap seen in traditional one-way distillation methods, but it also makes training more complex since the teacher’s predictions are initially unstable. In practice, soft-label distillation is preferred for maximum knowledge transfer, hard-label distillation for scalability and practicality, and co-distillation for large-scale joint training setups.

The post Understanding LLM Distillation Techniques  appeared first on MarkTechPost.

France arrests Tunisian suspected of planning 'jihad-inspired' attack in Paris

0




French has arrested a 27-year-old Tunisian man suspected of planning attacks “inspired by jihadism”, the anti-terror prosecutor’s office said on Monday. A source said he was targeting a Paris museum and members of the Jewish community.

The competitive advantage of not knowing you’re wrong

0
The competitive advantage of not knowing you’re wrong



When not knowing the odds improves your chances

Bitcoin Tests $82K As Crypto Funds Notch Sixth Straight Week Of Inflows

0
Bitcoin Tests $82K As Crypto Funds Notch Sixth Straight Week Of Inflows




Crypto investment products absorbed $858 million last week, ahead of the upcoming CLARITY Act markup and Fed chair transition.

If your router or drone maker is banned in the US, it will get an update lifeline until 2029

0




The FCC extended update support for restricted routers and drones until 2029, aiming to avoid cybersecurity risks caused by unsupported and vulnerable devices.

Sui breaks out on heavy demand – Yet ONE overheating risk is rising

0
Sui breaks out on heavy demand – Yet ONE overheating risk is rising



Can Sui’s rising ecosystem demand absorb growing speculative pressure beneath the rally?

How to watch the PGA Championship without missing the early morning tee times

0



Golf fans are eagerly awaiting the start of the 2026 PGA Championship, which kicks off this week. From May 14 to the 17th, the biggest 156 names in golf will compete to earn the coveted Wanamaker trophy. 

Last year’s winner Scottie Scheffler, 29, who took home the trophy for the first time, will return as the defending champion. Other big names will include Rory McIlroy, who is coming off of two consecutive Masters titles and is trying for his third PGA win and seventh major title. Other star players to watch are Cameron Young, Jon Rahm, and Bryson DeChambeau.

This year, the tournament will take place at Aronimink Golf Club in Pennsylvania, a location that hasn’t hosted the event since the 1960s. According to the PGA website, tickets to the event have sold out for all four days. However, verified resale tickets are now available through SeatGeek. 

For the near five million viewers who are expected to tune in from their living rooms, there are a few ways to watch. 

The first round begins Thursday and coverage will begin at 7 a.m. EST on ESPN+, a subscription service that comes in two tiers. ESPN+ (also known as ESPN Select) costs $13 monthly or $130 a year. Meanwhile ESPN Unlimited, known as the “all-in-one” hub, is $29.99/monthly or $299.99 per year. The main broadcast will move to ESPN at noon. At 7 p.m., the main broadcast will move to ESPN2. Streaming coverage will be on SiriusXM from 7 a.m. until 9 p.m.

On Friday, main coverage will be featured on ESPN+ from 7 a.m. until noon before moving to ESPN from noon until 8 p.m.

Weekend coverage will follow a slightly different schedule. The main broadcast will begin at 8 a.m. on Saturday on ESPN+, then move to ESPN from 10 a.m. until 1 p.m. After that, CBS will cover the event until 7 p.m. Sunday’s final round will follow the same schedule. Streaming coverage on both Saturday and Sunday will be on SiriusXM radio, a subscription service that costs $25.99 per month for its all-access package that includes sports, from 9 a.m. until 9 p.m. Paramount+ will also stream the CBS afternoon coverage on Saturday and Sunday.

If the main competition itself doesn’t scratch your itch, you can tune in early to catch the pre-championship conferences, which begin on Monday, May 11. Find the full schedule on the PGA Championship’s site. 

Tron in Trouble? ‘Glaring Divergence’ Flagged Behind TRX’s Latest Surge

0
Tron in Trouble? ‘Glaring Divergence’ Flagged Behind TRX’s Latest Surge



Tron’s (TRX) performance so far in 2026 has been solid. In the past five months alone, the crypto asset has climbed more than 23%. Despite this, new data suggests that it faces correction risks.

According to CryptoQuant, TRX is showing a “glaring divergence” between its price and on-chain activity despite recently climbing back toward the $0.35 level.

Lack of Fundamental Support

The analytics platform found that while TRX has posted strong price gains over the past month, rising 10%, the network’s “Tokens Transferred (Total)” metric has moved sharply in the opposite direction.

Data revealed that the total volume of transferred tokens declined from nearly 17.3 billion to around 12.2 billion during the same period, even as the asset continued to rally. CryptoQuant said this disconnect has sparked concerns about the sustainability of TRX’s current upward momentum, as healthy price increases are typically accompanied by stronger network usage and utility.

The firm described the divergence as a sign that the latest rally may be driven more by speculation or token hoarding than by genuine user activity on the Tron network. It further warned that the absence of stronger transactional support could leave the $0.35 price level vulnerable if buying pressure weakens. This, in turn, could potentially increase the risk of a correction in the near term.

Justin Sun’s Troubles

TRX’s price has been largely immune to the growing dispute surrounding Tron founder Justin Sun and the Trump-linked crypto project World Liberty Financial, even as the conflict escalated into multiple lawsuits and public accusations. The tensions began in mid-April after WLFI proposed converting more than 62 billion locked tokens into a fixed vesting structure, while holders who rejected the terms risked having their assets remain locked indefinitely.

Sun described the proposal as coercive and argued that dissenting token holders were effectively being punished. He also alleged that his own WLFI tokens, which represented around 4% of the voting power, had been frozen, preventing him from participating in governance decisions. WLFI was also accused of operating through centralized controls hidden behind a decentralized governance structure, and the Tron founder claimed that anonymous parties could freeze assets and override decisions.

Days later, Sun filed a lawsuit in California seeking restoration of his voting rights and token access. WLFI, on the other hand, rejected the allegations and accused Sun of misconduct and spreading false claims. WLFI filed a defamation lawsuit against Sun in Florida this month for allegedly orchestrating a smear campaign against the project and its backers.

The post Tron in Trouble? ‘Glaring Divergence’ Flagged Behind TRX’s Latest Surge appeared first on CryptoPotato.

How to Build Technical Analysis and Backtesting Workflow with pandas-ta-classic, Strategy Signals, and Performance Metrics

0


In this tutorial, we implement how to use pandas-ta-classic to build a complete technical analysis and trading strategy workflow. We start by installing the required libraries, downloading historical OHLCV stock data with yfinance, cleaning the returned data structure, and inspecting the available indicator categories inside the library. We then calculate popular indicators such as SMA, EMA, RSI, ATR, MACD, Bollinger Bands, candlestick patterns, and a custom distance-from-EMA feature. Also, we combine daily and weekly signals, create entry and exit logic, backtest the strategy with shifted positions, calculate performance metrics, run a parameter sweep, and visualize price action, RSI behavior, trade signals, and equity curves in a structured way.

Copy CodeCopiedUse a different Browser
import subprocess, sys
def _pip(pkgs):
   subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *pkgs])
_pip(["pandas-ta-classic", "yfinance", "matplotlib"])
import numpy as np
import pandas as pd
import yfinance as yf
import pandas_ta_classic as ta
import matplotlib.pyplot as plt
from itertools import product
pd.set_option("display.max_columns", 80)
pd.set_option("display.width", 200)
TICKER, START, END = "AAPL", "2018-01-01", "2024-12-31"
raw = yf.download(TICKER, start=START, end=END, auto_adjust=True, progress=False)
if isinstance(raw.columns, pd.MultiIndex):
   raw.columns = raw.columns.get_level_values(0)
df = (raw.rename(columns=str.lower)
        [["open", "high", "low", "close", "volume"]]
        .dropna()
        .copy())
df.index.name = "date"
print(f"[data] {TICKER}: {len(df)} rows  "
     f"{df.index.min().date()} → {df.index.max().date()}")
print("[lib]  Categories:", list(ta.Category.keys()))
for cat in ("momentum", "overlap", "trend", "volatility", "volume"):
   names = ta.Category.get(cat, [])
   print(f"[lib]  {cat:<11} ({len(names):>3}): "
         f"{', '.join(names[:8])}{' ...' if len(names) > 8 else ''}")

We install the required packages and import the main libraries needed for technical analysis, data handling, plotting, and parameter combinations. We download Apple’s historical OHLCV data using yfinance, clean the returned DataFrame, and convert column names to lowercase for easier processing. We also review the available pandas-ta-classic indicator categories to understand which technical indicators we can use in the tutorial.

Copy CodeCopiedUse a different Browser
df.ta.sma(length=20,  append=True)
df.ta.sma(length=50,  append=True)
df.ta.ema(length=200, append=True)
df.ta.rsi(length=14,  append=True)
df.ta.atr(length=14,  append=True)
df.ta.macd(append=True)
df.ta.bbands(length=20, std=2.0, append=True)
my_strategy = ta.Strategy(
   name="AdvancedDemo",
   description="Trend + momentum + volume + volatility in one shot",
   ta=[
       {"kind": "hma",   "length": 30},
       {"kind": "adx",   "length": 14},
       {"kind": "aroon", "length": 14},
       {"kind": "stoch", "k": 14, "d": 3},
       {"kind": "obv"},
       {"kind": "mfi",   "length": 14},
       {"kind": "willr", "length": 14},
       {"kind": "cci",   "length": 20},
       {"kind": "kc",    "length": 20, "scalar": 2},
   ],
)
df.ta.strategy(my_strategy)
print(f"[strat] DataFrame now has {df.shape[1]} columns")
df["dist_ema200_pct"] = (df["close"] / df["EMA_200"] - 1.0) * 100
df.ta.cdl_doji(append=True)
df.ta.cdl_inside(append=True)
doji_col = next((c for c in df.columns if c.startswith("CDL_DOJI")), None)
print(f"[cdl]  Doji days detected: {int((df[doji_col] == 100).sum())}")

We apply several commonly used technical indicators directly through the .ta DataFrame extension. We calculate moving averages, RSI, ATR, MACD, Bollinger Bands, and then run a custom multi-indicator strategy using ta.Strategy. We also create a custom EMA-distance feature and detect candlestick patterns such as Doji and Inside candles.

Copy CodeCopiedUse a different Browser
weekly = (df[["open", "high", "low", "close", "volume"]]
         .resample("W-FRI")
         .agg({"open":"first","high":"max","low":"min","close":"last","volume":"sum"})
         .dropna())
weekly["RSI_W_14"] = ta.rsi(weekly["close"], length=14)
df = df.join(weekly[["RSI_W_14"]])
df["RSI_W_14"] = df["RSI_W_14"].ffill().shift(1)
trend     = df["SMA_20"] > df["SMA_50"]
mom_cross = (df["RSI_14"] > 50) & (df["RSI_14"].shift(1) <= 50)
mtf_ok    = df["RSI_W_14"] > 50
exit_cond = (df["RSI_14"] < 45) | (df["SMA_20"] < df["SMA_50"])
position = np.zeros(len(df), dtype=int)
in_pos = False
for i in range(len(df)):
   if not in_pos and trend.iat[i] and mom_cross.iat[i] and bool(mtf_ok.iat[i]):
       in_pos = True
   elif in_pos and exit_cond.iat[i]:
       in_pos = False
   position[i] = 1 if in_pos else 0
df["pos"] = position
df["ret"]       = df["close"].pct_change().fillna(0.0)
df["strat_ret"] = df["pos"].shift(1).fillna(0) * df["ret"]

We create a weekly version of the daily OHLCV data and calculate weekly RSI for higher-timeframe confirmation. We join the weekly RSI back to the daily DataFrame and shift it to avoid using future information in our trading logic. We then define the trend, momentum, multi-timeframe filter, exit condition, position state, daily returns, and strategy returns.

Copy CodeCopiedUse a different Browser
def perf(returns, ppy=252):
   r = returns.dropna()
   if len(r) == 0 or r.std() == 0:
       return {}
   cum  = (1 + r).cumprod()
   cagr = cum.iloc[-1] ** (ppy / len(r)) - 1
   vol  = r.std() * np.sqrt(ppy)
   sharpe = (r.mean() / r.std()) * np.sqrt(ppy)
   downside = r[r < 0].std() * np.sqrt(ppy)
   sortino  = (r.mean() * ppy) / downside if downside > 0 else np.nan
   mdd  = (cum / cum.cummax() - 1).min()
   nz   = r[r != 0]
   win  = (nz > 0).mean() if len(nz) else 0.0
   return {"CAGR": cagr, "Vol": vol, "Sharpe": sharpe,
           "Sortino": sortino, "MaxDD": mdd, "WinRate": win,
           "FinalEquity": cum.iloc[-1]}
summary = pd.DataFrame({
   "Buy & Hold": perf(df["ret"]),
   "Strategy":   perf(df["strat_ret"]),
}).T
print("\n[perf] ----------------------------------------")
print(summary.round(4))
def quick_bt(prices, fast, slow, rsi_thr=50):
   if fast >= slow:
       return None
   d = prices.copy()
   d["SMAf"] = ta.sma(d["close"], length=fast)
   d["SMAs"] = ta.sma(d["close"], length=slow)
   d["RSI"]  = ta.rsi(d["close"], length=14)
   sig = ((d["SMAf"] > d["SMAs"]) & (d["RSI"] > rsi_thr)).astype(int)
   sret = sig.shift(1).fillna(0) * d["close"].pct_change().fillna(0)
   return perf(sret)
prices = df[["open", "high", "low", "close", "volume"]]
rows = []
for fast, slow in product([5, 10, 20, 30], [50, 100, 150, 200]):
   m = quick_bt(prices, fast, slow)
   if m:
       rows.append({"fast": fast, "slow": slow, **m})
sweep = (pd.DataFrame(rows)
          .sort_values("Sharpe", ascending=False)
          .reset_index(drop=True))
print("\n[sweep] Top 5 (fast SMA, slow SMA) by Sharpe:")
print(sweep.head().round(4))

We define a performance function that calculates key metrics, including CAGR, volatility, Sharpe ratio, Sortino ratio, maximum drawdown, win rate, and final equity. We compare the strategy performance against a simple buy-and-hold baseline to see whether our signal logic adds value. We also run a parameter sweep across different fast and slow SMA combinations and rank the results by Sharpe ratio.

Copy CodeCopiedUse a different Browser
entries = df.index[(df["pos"].diff() == 1)]
exits   = df.index[(df["pos"].diff() == -1)]
fig, (ax1, ax2, ax3) = plt.subplots(
   3, 1, figsize=(13, 10), sharex=True,
   gridspec_kw={"height_ratios": [3, 1, 2]},
)
ax1.plot(df.index, df["close"],  lw=1.1, color="black", label="Close")
ax1.plot(df.index, df["SMA_20"], lw=0.9, label="SMA 20")
ax1.plot(df.index, df["SMA_50"], lw=0.9, label="SMA 50")
bbu, bbl = "BBU_20_2.0", "BBL_20_2.0"
if bbu in df and bbl in df:
   ax1.fill_between(df.index, df[bbl], df[bbu], alpha=0.12, label="Bollinger 20,2")
ax1.scatter(entries, df.loc[entries, "close"], marker="^", s=70,
           color="green", zorder=5, label="Entry")
ax1.scatter(exits,   df.loc[exits,   "close"], marker="v", s=70,
           color="red",   zorder=5, label="Exit")
ax1.set_title(f"{TICKER} — price, MAs, Bollinger, signals")
ax1.legend(loc="upper left"); ax1.grid(alpha=0.3)
ax2.plot(df.index, df["RSI_14"], lw=0.9, label="RSI 14")
ax2.axhline(70, color="red",   ls="--", lw=0.6)
ax2.axhline(30, color="green", ls="--", lw=0.6)
ax2.set_title("RSI 14"); ax2.legend(loc="upper left"); ax2.grid(alpha=0.3)
ax3.plot(df.index, (1 + df["ret"]).cumprod(),       lw=1.1, label="Buy & Hold")
ax3.plot(df.index, (1 + df["strat_ret"]).cumprod(), lw=1.1, label="Strategy")
ax3.set_title("Equity curves ($1 start)")
ax3.legend(loc="upper left"); ax3.grid(alpha=0.3)
plt.tight_layout(); plt.show()
print("\nTweak TICKER, the Strategy list, or the sweep grid to keep exploring.")

We identify the strategy’s entry and exit points from changes in the position column. We then create a three-panel chart showing price action with moving averages and Bollinger Bands, RSI behavior, and equity curves for both buy-and-hold and the strategy. We use these visuals to understand where trades happen, how momentum behaves, and how the strategy performs over time.

In conclusion, we built an end-to-end technical analysis pipeline that shows how pandas-ta-classic can support both quick indicator generation and more advanced strategy development. We used the library to compute individual indicators and also to create custom strategies, add multi-timeframe confirmation, reduce look-ahead bias, evaluate returns, and compare the strategy against buy-and-hold performance. We also ran a simple parameter sweep to understand how different moving-average combinations affect results and to help us identify stronger configurations. Also, we gained a foundation for experimenting with technical indicators, trading signals, backtesting logic, performance evaluation, and financial data visualization.


Check out the Codes with Notebook. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post How to Build Technical Analysis and Backtesting Workflow with pandas-ta-classic, Strategy Signals, and Performance Metrics appeared first on MarkTechPost.

Kenya summit represents 'demarcation point': France 'wants to do business' with African continent

0




Nadia Massih is pleased to welcome Dr. Douglas Yates, Author, Political Scientist and Professor and at the American Graduate School of International Relations and Diplomacy. President Emmanuel Macron is co-hosting the Africa Forward Summit alongside Kenyan President William Ruto. entering what Dr. Yates describes as “a demarcation point.” After years of mounting hostility in West Africa, military ruptures in Mali, Burkina Faso and Niger, and the growing presence of Russia and China across the continent, Paris is seeking to redefine its place beyond the historic confines of Françafrique.

Recent Posts