AI & ML interests

None defined yet.

Recent Activity

Sri-Vigneshwar-DJ 
posted an update about 22 hours ago
view post
Post
425
🦅 Introducing Hawky AI H1 Mini 4B: A Domain-Specific Model for Performance Marketing

Hey HuggingFace community! 👋

We're excited to share our first open-source release: **Hawky AI H1 Mini 4B Experimental** - a Gemma 3 4B model fine-tuned specifically for Meta advertising and performance marketing strategy.

🎯 Why We Built This

At [Hawky.ai](https://hawky.ai), we build AI-powered creative intelligence tools for performance marketers. We work with major agencies (WPP, Madison, GroupM) and brands (TVS Motors, Tanishq, Bajaj Finserv) on campaign optimization.

We wanted to explore: Can a small, domain-specific model provide expert-level guidance on performance marketing?

Specifically, we focused on Meta's Andromeda algorithm - the AI system that now powers ad delivery across Facebook and Instagram. Understanding Andromeda is crucial for modern media buying, but the knowledge is scattered and constantly evolving.

🧠 What Makes This Different

Chain-of-Thought Reasoning
The model doesn't just answer - it **thinks through problems** step-by-step:

Sri-Vigneshwar-DJ/hawky-ai-h1-mini-4b-experimental
Sri-Vigneshwar-DJ 
posted an update 5 days ago
view post
Post
906
Domain-specific reasoning is crucial when working with big-budget campaigns on Meta. That's why we've launched an experimental Chain-of-Thought (CoT) reasoning model for critical thinking, tailored to Meta's Andromeda algorithm-based campaign structuring and optimization.

Sri-Vigneshwar-DJ/hawky-ai-h1-mini-1b-experimental
Sri-Vigneshwar-DJ 
posted an update 7 days ago
view post
Post
2931
The recent update to Meta's ad algorithm is very difficult to crack, and even the latest models struggle to keep up with it. To address this, we've created a small experimental dataset for fine-tuning models to better tackle Meta's Andromeda algorithm: Sri-Vigneshwar-DJ/hawky-ai-andromeda-dataset
Sri-Vigneshwar-DJ 
posted an update 11 days ago
Aurelien-Morgan 
posted an update about 1 month ago
flozi00 
posted an update about 1 month ago
view post
Post
298
We have covered Tensor Parallelism for slicing matrices and Pipeline Parallelism for stacking layers. But what if your model isn't just deep or wide—it's a sprawling Mixture-of-Experts (MoE) architecture like Mixtral or DeepSeek, with trillions of parameters that are mostly idle per token?

Replicating those experts wastes VRAM. Slicing them with TP wastes bandwidth. The solution is Expert Parallelism (EP), which distributes the experts themselves across GPUs and routes tokens to wherever their "chosen" expert lives.

The hardware catch? It is not matrix splitting or pipeline bubbles—it's the "Router's Dilemma." You must shuffle massive volumes of tokens across the cluster using All-to-All communication, and any imbalance can leave expensive GPUs idle.

My latest guide dives into the mechanics of EP and why the interconnect becomes the ultimate bottleneck.

In this breakdown, we explore:

The Token Routing Lifecycle
A four-step hardware flow: Local routing to pick experts, Dispatch (All-to-All shuffle), Expert computation on the "home" GPU, and Combine (another All-to-All to return results).

The All-to-All Primitive
Unlike the ring-based syncs in TP, All-to-All creates a dense mesh of personalized data transfers. We compare it to All-Reduce and show why uneven token distribution (load imbalance) causes network congestion and compute skew.

Load Balancing: The Hardware Nightmare
If one expert gets 90% of the tokens, its GPU bottlenecks while others stall. We discuss mitigation strategies like token dropping and auxiliary losses to keep utilization high.

The article includes a raw PyTorch implementation of an EP layer using torch.distributed.all_to_all_single to reveal exactly how the data shuffles and where the stalls happen.

Read the full hardware-centric guide here:
https://flozi.net/en/guides/ai/scaling/expert_parallel
  • 1 reply
·
takarajordan 
posted an update about 1 month ago
flozi00 
posted an update about 1 month ago
view post
Post
3127
We recently discussed how Tensor Parallelism slices matrices to reduce latency within a single node. But what happens when you need to scale beyond that, where the bandwidth drops?

That is where Pipeline Parallelism (PP) takes over.

Instead of slicing the operation, PP slices the model depth. It turns your GPU cluster into an assembly line: GPU 0 handles layers 1-12, GPU 1 handles 13-24, and so on.

The hardware challenge here isn't the interconnect speed—it is the "Pipeline Bubble." In a naive setup, expensive H100s sit idle for most of the cycle waiting for data to flow through the chain.

My latest guide breaks down the scheduling strategies used to minimize this idle silicon time.

In this deep dive, we cover:

The Hardware Mechanics: Vertical Slicing
Unlike TP which requires "chatty" All-Reduce operations, PP relies on lightweight Point-to-Point (Send/Recv) communication. This makes it the only viable strategy for crossing node boundaries over Ethernet or InfiniBand.

Fighting the Bubble: 1F1B vs. GPipe
We analyze the scheduling algorithms that keep the GPUs fed:

GPipe: The "flush and fill" approach. Simple, but memory-intensive.
1F1B (One-Forward-One-Backward): The industry standard. By interleaving forward and backward passes, we aggressively free up memory and reduce the bubble size.
The Math of Efficiency
The "Bubble" is a mathematical inevitability. We look at the efficiency formula
M+N−1
M

to understand why you need massive global batch sizes to make PP worth the effort.

The article includes a conceptual PyTorch implementation of the 1F1B state machine to illustrate exactly how the data is handed off between stages.

Read the full breakdown here:
https://flozi.net/en/guides/ai/scaling/pipeline_parallel
flozi00 
posted an update about 1 month ago
view post
Post
3578
When models get too large for a single GPU, simply stacking layers vertically (Pipeline Parallelism) isn't always the answer. Sometimes, you need to slice the matrices themselves.

My latest guide breaks down the hardware mechanics of Tensor Parallelism (TP). We look at how to shard individual operations across devices to make a cluster function as one massive accelerator.

This isn't high-level theory—it is a look at the bare metal implementation.

Here is what is covered in the deep dive:

The Strategies: Column vs. Row Parallelism
We analyze how to split weight matrices (W) and inputs (X).

Column-Linear: Splits weights by columns. Requires an All-Gather to reconstruct the output.
Row-Linear: Splits weights by rows. Requires an All-Reduce to sum partial results.
The "Megatron-LM" Optimization
Efficiency comes from minimizing communication. By sandwiching the non-linearity (GeLU) between a Column-Parallel layer and a Row-Parallel layer, we can skip synchronization entirely during the activation phase. This cuts communication events by 50% per block.

The Hardware Reality: The Bandwidth Wall
In TP, the dist.all_reduce operation sits on the critical path. The CUDA cores effectively stall while waiting for the ring-reduce to finish.

Intra-Node: Works well because NVLink provides enough bandwidth to hide this latency.
Inter-Node: Fails at scale. Standard networking (Ethernet/InfiniBand) is too slow for the high-frequency syncs required by TP.
The article includes a raw PyTorch implementation using torch.distributed primitives to show exactly where the data moves and where the bottlenecks sit.

Read the full hardware-centric guide here:
https://flozi.net/en/guides/ai/scaling/tensor_parallel
takarajordan 
posted an update about 1 month ago
view post
Post
3194
Two weeks ago I had an engaging discussion with locals in Cockermouth about AI and the broader industry, a reminder that hearing candid perspectives beyond our professional circles is invaluable and something anyone working full-time in this field should make time for.

Thank you!
flozi00 
posted an update about 2 months ago
view post
Post
2768
Running large language models efficiently is more than just raw GPU power. The latest guide breaks down the essential math to determine if your LLM workload is compute-bound or memory-bound.

We apply these principles to a real-world example: Qwen's 32B parameter model on the new NVIDIA RTX PRO 6000 Blackwell Edition.

In this guide, you will learn how to:

Calculate your GPU's operational intensity (Ops:Byte Ratio)
Determine your model's arithmetic intensity
Identify whether your workload is memory-bound or compute-bound

Read the full guide here: https://flozi.net/en/guides/ai/llm-inference-math
flozi00 
posted an update about 2 months ago
takarajordan 
posted an update 2 months ago
view post
Post
264
🌞 LOVABLE IS CRACKED

Built a golden hour tracker in under 15 minutes with Lovable: uses your phone’s Geolocation API, the SunCalc library, and runs fully client-side with no servers. https://goldenhour.404missing.link
flozi00 
posted an update 2 months ago
view post
Post
1953
I just got asked about the differences between Blackwell systems and Grace Blackwell systems. What's the difference and how much of a performance gap is there between them?

https://flozi.net/en/hardware/nvidia/benchmarks/b200-vs-gb200-efficiency-comparison

Here's a summary of the key points from the article:

GB200 (Grace Blackwell) is a Superchip: It integrates a Grace CPU and two Blackwell GPUs into a single package.
B200 is a GPU-only module: It's designed to be paired with x86 or ARM CPUs in more traditional server setups.


Performance and Efficiency:

Based on MLPerf Training v5.0 benchmarks, the article concludes:

GB200 systems are approximately 42% more efficient than B200 systems on average. This is especially true in large-scale deployments (100+ GPUs), where the GB200's integrated design and high-speed NVLink interconnect provide a significant advantage.

In smaller, single-node systems (e.g., 8 GPUs), the performance difference is much smaller, around 10-15%.


Use Cases:

Choose GB200 for large-scale AI clusters, training massive models, and when maximum efficiency is the top priority.

Choose B200 for smaller deployments, when you need the flexibility to choose your own CPU, or for mixed AI and HPC workloads.
flozi00 
posted an update 2 months ago
view post
Post
3162
Some weeks ago, i've just decide its time to leave LinkedIn for me.
It got silent around my open source activities the last year, so i thought something has to change.

That's why my focus will move to share experiences and insights about hardware, drivers, kernels and linux. I won't post about how to use models, built agents or do prompting. I want to share about some deeper layers the actual hypes are built on.

I will start posting summarizations of my articles here on the hub.

English version:
https://flozi.net/en

German translated version:
https://flozi.net/de

Feel free to reach me if you want to read something specific.
  • 2 replies
·
s3nh 
posted an update 3 months ago
view post
Post
598
Eduhelp with more empathy, based on model finetuned on
psychotheraputic preferences just landed on


Beck-8B as a base model, 13000 steps on educational dataset.
Time to go further and build more 🥰
s3nh/EduHelp_Beck_8B
Thanks to @basilic_ai for computations <3
s3nh 
posted an update 3 months ago
view post
Post
4163
Just tried to create an educational assistant for younger people who can struggle with visualsation of 'what is this sorcery all about'.
Its first step of my spare time projects, sft on Qwen3-8B,

EduHelper is a child-friendly tutoring assistant fine-tuned from the Qwen3-8B base model using parameter-efficient fine-tuning (PEFT) with LoRA on the ajibawa-2023/Education-Young-Children dataset.

s3nh/EduHelp-8B

Glad to share my work, have a wonderful day!
  • 2 replies
·
Sri-Vigneshwar-DJ 
posted an update 3 months ago
view post
Post
353
Do you think domain-specific embedding fine-tuners are needed?
I've been working with embeddings for marketing use cases and noticed something: most embeddings don't get marketing concepts very well. They're trained in general-purpose ways.
The Issue I'm Seeing
When I search marketing content with general embeddings:

"organic growth" returns farming articles
"conversion funnel" matches industrial equipment
"brand lift" doesn't connect to campaign effectiveness
Marketing jargon like CAC, ROAS, CTR aren't properly understood

My Question
Do you think domain-specific embeddings are needed for marketing?
Some thoughts:

Marketing has its own vocabulary and concept relationships
General models trained on Wikipedia/web crawl miss these nuances
But is fine-tuning worth the effort vs just using more retrieval tricks?

Quick Example
I fine-tuned all-mpnet-base-v2 on ~1000 marketing concept pairs and saw 15-20% better retrieval accuracy. But I'm curious:

Has anyone else tried this for marketing or other domains?
When do you think domain-specific embeddings are actually necessary vs overkill?
Are there better approaches I'm missing?

https://huggingface.co/blog/Sri-Vigneshwar-DJ/why-your-marketing-rag-system-needs-domain-specifi
  • 6 replies
·
Sri-Vigneshwar-DJ 
posted an update 3 months ago
view post
Post
4437
🚀 Exciting News! We've released a Performance Marketing Expert Dataset from Hawky.ai [www.hawky.ai]
Hawky-ai



This dataset empowers AI models with cutting-edge strategies for Meta, Google Ads, and TikTok campaigns. It includes:
1. Multi-platform strategies for e-commerce, DTC, B2B, and more
2. Creative optimization and audience targeting insights
3. ROI-driven recommendations based on 2025 best practices

Sri-Vigneshwar-DJ/Performance-Marketing-Data