Large Language Models (LLMs)

Concepts
- Generative AI
  - Generative AI is called generative because the AI creates something that didn’t previously exist. That’s what makes it different from discriminative AI, which draws distinctions between different kinds of input. To say it differently, discriminative AI tries to answer a question like “Is this image a drawing of a rabbit or a lion?” whereas generative AI responds to prompts like “Draw me a picture of a lion and a rabbit sitting next to each other.”[www.infoworld.com]
- Models
  - AI model is a tool or algorithm which is based on a certain data set through which it can arrive at a decision – all without the need for human interference in the decision-making process.
    - A deep learning model, or a DL model, is a neural network that has been trained to learn how to perform a task, such as recognizing objects in digital images and videos, or understanding human speech. Deep learning models are trained by using large sets of data and algorithms that enable the model to learn how to perform the task. The more data the model is trained on, the better it can learn to perform the task.
  - Small Language Models
    - An SLM is generally five to 10 times smaller than LLMs and are open source projects. The smaller size means much lower energy consumption. They can also be hosted on a single GPU. They allow for faster training and inference times, which provides for much lower latency.
      - Their reduced complexity makes them a good choice for on-prem deployment, meeting strict compliance and data privacy standards.
      - Despite their reduced size, SLMs demonstrate capabilities that are remarkably close to LLMs in various NLU(natural language understanding) tasks. This is especially the case when they are effectively fine-tuned (or retrained) for specialized use cases, say health care or coding.
    - Some of the top SLMs in the market include Llama-2-13b and CodeLlama-7b from Meta, Mistral-7b and Mixtral 8x7b from Mistral and Phi-2 and Orca-2 from Microsoft.[aibusiness.com]
- Formats & Quantization Methods**
  - Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32). Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), and operations like matrix multiplication can be performed much faster with integer arithmetic. It also allows to run models on embedded devices, which sometimes only support integer data types.[huggingface.co]
  - A Guide to Quantization in LLMs | Symbl.ai[symbl.ai]
  - Techniques
    - GPTQ (General Pre-Trained Transformer Quantization) is a quantization technique designed to reduce the size of models so they can run on a single GPU. [symbl.ai]
    - GGML (github)(which is said to stand for Georgi Gerganov Machine Learning, after its creator, or GPT-Generated Model Language) is a C-based machine learning library designed for the quantization of Llama models so they can run on a CPU. More specifically, the library allows you to save quantized models in the GGML binary format, which can be executed on a broader range of hardware. [symbl.ai]
      - is primarily a tensor library.
    - GGUF (GPT-Generated Unified Format), meanwhile, is a successor to GGML and is designed to address its limitations – most notably, enabling the quantization of non-Llama models.[symbl.ai]
  - Tools
    - Converting HuggingFace Models to GGUF/GGML | Substratus.AI[[www.substratus.ai](https://www.substratus.ai/blog/converting-hf-model-gguf-model/)]
- Model Parameters Count (“Parameters”): Model parameter count or “number of parameters” refers to the number of weights in all layers of a model. As a general rule, the more parameters a model has, the more capable and accurate it is assumed to be,[sambanova.ai]
- Prompt: An interface where a user can interact with a generative AI model using natural language. Within these prompt interfaces, a user can make a specific request of a model such as “Write a paragraph summary of generative AI”, or “Draw a picture with horses running through a field with a forest in the background, during the spring”[sambanova.ai]
  - Giving Large Language Models Context | by Simon Attard | Medium[medium.com]
- Fine Tuning
- Application Architecture
  - Emerging Architectures for LLM Applications | Andreessen Horowitz[a16z.com]
  - LLM Apps Are Mostly Data Pipelines[meltano.com]
- RAG
  - A Simple Guide To Retrieval Augmented Generation Language Models — Smashing Magazine[www.smashingmagazine.com]
  - Retrieval augmented generation: Keeping LLMs relevant and current - Stack Overflow[stackoverflow.blog]
  - In-context learning (used with RAG) sends context to pre-trained models as part of the prompt at runtime to help it understand more about your question. Whereas fine tuning is further training the pre-trained models to take into account a new set of contextual data then all future prompts go directly to your new iteration of the model.[meltano.com]
  - Value of LLM if the actual source is the Retrieving data source
    - Seems like the natural language processing part and in summarizing/blending
    - Can LLM help decipher user query so that the Retriever get more precise information for retrieving information
    - remember the similarity search result can be 2 or 3 seperate pages of a document, and you dont want the read the whole the pages. the llm will present only the necessary stuff.[www.youtube.com]
Agents
- Unlike conventional chains, where actions are preordained in code, Agents harness the reasoning capabilities of a language model to dynamically determine the next steps and their order.[medium.com]
General Topics
- How to use with your own data
- How to use with your own tool
  - say run code in response to a query in natural language
How to use with your own data
Tools
- Langchain
- LlamaIndex
- HuggingFace
- Ray Serve
- Open WebUI is an extensible, feature-rich, and user-friendly self-hosted WebUI designed to operate entirely offline. It supports various LLM runners, including Ollama and OpenAI-compatible APIs[docs.openwebui.com]
open source language models
- Ollama[ollama.com]: Run Llama 2, Code Llama, and other models. Customize and create your own
  - https://github.com/ollama/ollama
- LLAMA 2
  - https://ai.meta.com/llama/
- GPT4ALL - runs on mac

Ecosystem

open-source libraries for LLM inference and serving.
- vLLM, Ray Serve, Text Generation Inference(TGI)(from Hugging Face), OpenLLM, etc
- Reference: Frameworks for Serving LLMs. A comprehensive guide into LLMs inference and serving | by Sergei Savvov | Jul, 2023 | Medium | Better Programming[betterprogramming.pub]
- Comparing LLM serving frameworks — LLMOps | by Thiyagarajan Palaniyappan | Medium[medium.com]
- Text Generation Inference (Hugging Face)
  - Run your LLM on Text Generation Inference without the Internet and make your Security team happy! - YouTube[www.youtube.com]
  - How to install the Enterprise grade AI Playground from Hugging Face: Text Generation Inference (TGI) - YouTube[www.youtube.com]

LLMs on Cloud

GCP
- Running AI and ML workloads on GCP (with GPUs, TPUs, etc)
  - https://cloud.google.com/kubernetes-engine/docs/how-to/serve-llm-l4-ray
    - https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/ray-on-gke
    - https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/blob/main/ai-ml/gke-ray/rayserve/models/llama2-7b-chat-hf.yaml
- GKE examples
  - Guide to Serving Mistral 7B-Instruct v0.1 on GKE Utilizing Nvidia L4-GPUs (HF-text-generation-inference)
    - Serving Mistral 7B on L4 GPUs running on GKE | by Javier Cañadillas | Google Cloud - Community | Medium[medium.com]
  - Tutorial: Serving Llama 2 70b on GKE L4 GPUs (uses text-generation-inference )

Terms

Local Inference: Inference means interacting with the model — i.e. asking it questions. [medium.com]

Resources

Courses
- Generative AI with Large Language Models — New Hands-on Course by DeepLearning.AI and AWS | AWS News Blog[aws.amazon.com]
  - Generative AI with LLMs - DeepLearning.AI[www.deeplearning.ai]
    - Generative AI with Large Language Models | Coursera[www.coursera.org]
  - By that DeepLearning.AI (Andrew Ng) and AWS - 16 hours ( 3 weeks - 5hrs/week) - Intermediate Level
    - including scoping the problem, choosing an LLM, adapting the LLM to your domain, optimizing the model for deployment, and integrating into business applications. The course not only focuses on the practical aspects of generative AI but also highlights the science behind LLMs
    - Describe in detail the transformer architecture that powers LLMs, how they’re trained, and how fine-tuning enables LLMs to be adapted to a variety of specific use cases
- Introduction to Large Language Models | Coursera[www.coursera.org] - Beginner Level - 1 hour (Instructor: Google Cloud Training)
- Introduction to Large Language Models | Google Cloud Skills Boost[www.cloudskillsboost.google]
- Princeton
- Welcome to LLM University![docs.cohere.com]
- Introduction - Hugging Face NLP Course[huggingface.co]
- ChatGPT Prompt Engineering for Developers - DeepLearning.AI[www.deeplearning.ai]
- Misc
Home - Google Cloud AI Infrastructure - Open LLMs on GKE - Llama 2 and Beyond[cloudonair.withgoogle.com]
GitHub - GoogleCloudPlatform/ai-on-gke[github.com]
- ai-on-gke/tutorials/serving-llama2-70b-on-l4-gpus at main · GoogleCloudPlatform/ai-on-gke · GitHub[github.com]
AI/ML orchestration on GKE documentation | Google Kubernetes Engine (GKE) | Google Cloud[cloud.google.com]
Llama 2 - Meta AI[ai.meta.com]
Open source large language models: Benefits, risks and types - IBM Blog[www.ibm.com]
Navigating the AI Hype and Thinking about Niche LLM Applications | by Hadi Javeed | Better Programming[betterprogramming.pub]
12 days of no-cost training to learn generative AI this December google
AI Demos[ai-demos.dev]
Neural Networks to LLM
General
- Machine Learning | Resources | Google for Developers[developers.google.com]
  - Introduction to Large Language Models | Machine Learning | Google for Developers[developers.google.com]

Concepts

Generative AI

Models

Small Language Models

Formats & Quantization Methods**

RAG

Ecosystem

LLMs on Cloud

Terms