AI » LLM Model Parameters

Context Size, for real

Uopšteno su svi očajni, osim Google Gemini Pro, ali obrati pažnju i posebno na prompt formulisanje. Dva glavna testa.

NVIDIA/RULER: This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?

A ovi tvrde da njihov model ima najbolji kontekst - Long Context, But Actually a oni su testirali kroz RULER i Sonnet i ispada da je i sonnet odličan! Ima ih na openrouter AI21 | OpenRouter

The Needle In a Haystack Test: Evaluating the Performance of LLM RAG Systems - Arize AI i komentar Long context prompting for Claude 2.1 \ Anthropic

Long Context RAG Performance of LLMs | Databricks Blog ovaj test na kraju ima tabelu gde su claude 3.5 i gpt-4o poboljšani, ali i dalje očajni. Iako, obrati pažnju na 2. tekst i komentar Anothropic kakav prompt treba da bude.

Tokens

The number of tokens = length of input + output. Token counts kao prompt plus rezultat, znači sve zajedno.

GPT3 ima 4K tokena, dok GPT4 ima 8K.

The limit is currently a technical limitation, but there are often creative ways to solve problems within the limit, e.g. condensing your prompt, breaking the text into smaller pieces, etc. Znači, kada se dugo dopisuješ, on tu radi sumarizacije, deli na delove i slično od prethodnog sadržaja, ali i dalje mora da se upakuje u context size.

One token is generally ~4 characters of text for common English text. Kada je kodiranje u pitanju, više je 3 chars.

Ovaj tokenizer UI je savršen za sve modele: Tiktokenizer, pa ima čak i Claude-100K (cl100k_base) koji je od OpenAI originalnog repo-a.

Tiktokenizer je UI, custom made, sa repo na Tiktokenizer
openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI’s models. je originalni
OpenAI API je oficijalni tokenizer playground
What are tokens?
GPT-4 vs. GPT-3

Temperature and Top_P for coding and other

Refer to the Cheat Sheet: Mastering Temperature and Top_p with a special focus on coding.

Prompt Caching

context caching (this is how google calls it)

TTL for Cache Storage je obično 5 min za Claude i OpenAI, 1hr za Gemini ali Gemini charges for storing cache also. Kod OpenAI je prompt longer than 1024 tkones automatically added to the cache

Ali, pazi ovo vezano za prompt caching: Simply using a longer prompt. Sometimes the simplest solution is the best. If your knowledge base is smaller than 200,000 tokens (about 500 pages of material)

Google ima foru lpja čini to manje upotrebljivo - Gemini Introduced Context Caching for AI - It’s 4x Cheaper but No One Will Use It - Reprompt AI.

Prompt Caching | OpenRouter oko toga kako se naplaćuje - specijalno je povoljan kod Claude, iako je najbolji kod DeepSeek jer je automatski a jefin baš

Ovo je odličan playground za Promptove - ali imaju oni sve - commerical - Vellum AI, ali je poenta da imaju odlična uputstva: How to use Prompt Caching?

Inference speed

TheFastest.ai LLM API Provider Leaderboard | Artificial Analysis

date 20. May 2023 | modified 13. Feb 2025

filename: AI » LLM » Model Parameters