Talk To Your Slide Deck Using Multimodal Foundation Models Hosted On Amazon Bedrock – Part 2 | Amazon Web Services

In Ingxenye 1 of this series, we presented a solution that used the I-Amazon Titan Multimodal Embeddings model to convert individual slides from a slide deck into embeddings. We stored the embeddings in a vector database and then used the Large Language-and-Vision Assistant (LLaVA 1.5-7b) model to generate text responses to user questions based on the most similar slide retrieved from the vector database. We used AWS services including I-Amazon Bedrock, I-Amazon SageMaker, Futhi Amazon OpenSearch Serverless kulesi sixazululo.

In this post, we demonstrate a different approach. We use the Anthropic Claude 3 Sonnet model to generate text descriptions for each slide in the slide deck. These descriptions are then converted into text embeddings using the I-Amazon Titan Text Embeddings model and stored in a vector database. Then we use the Claude 3 Sonnet model to generate answers to user questions based on the most relevant text description retrieved from the vector database.

You can test both approaches for your dataset and evaluate the results to see which approach gives you the best results. In Part 3 of this series, we evaluate the results of both methods.

Ukubukwa kwesisombululo

The solution provides an implementation for answering questions using information contained in text and visual elements of a slide deck. The design relies on the concept of Retrieval Augmented Generation (RAG). Traditionally, RAG has been associated with textual data that can be processed by large language models (LLMs). In this series, we extend RAG to include images as well. This provides a powerful search capability to extract contextually relevant content from visual elements like tables and graphs along with text.

Lesi sixazululo sihlanganisa izingxenye ezilandelayo:

Amazon Titan Text Embeddings is a text embeddings model that converts natural language text, including single words, phrases, or even large documents, into numerical representations that can be used to power use cases such as search, personalization, and clustering based on semantic similarity.
Claude 3 Sonnet is the next generation of state-of-the-art models from Anthropic. Sonnet is a versatile tool that can handle a wide range of tasks, from complex reasoning and analysis to rapid outputs, as well as efficient search and retrieval across vast amounts of information.
OpenSearch Serverless is an on-demand serverless configuration for Amazon OpenSearch Service. We use OpenSearch Serverless as a vector database for storing embeddings generated by the Amazon Titan Text Embeddings model. An index created in the OpenSearch Serverless collection serves as the vector store for our RAG solution.
Amazon OpenSearch Ingestion (OSI) is a fully managed, serverless data collector that delivers data to OpenSearch Service domains and OpenSearch Serverless collections. In this post, we use an OSI pipeline API to deliver data to the OpenSearch Serverless vector store.

The solution design consists of two parts: ingestion and user interaction. During ingestion, we process the input slide deck by converting each slide into an image, generating descriptions and text embeddings for each image. We then populate the vector data store with the embeddings and text description for each slide. These steps are completed prior to the user interaction steps.

In the user interaction phase, a question from the user is converted into text embeddings. A similarity search is run on the vector database to find a text description corresponding to a slide that could potentially contain answers to the user question. We then provide the slide description and the user question to the Claude 3 Sonnet model to generate an answer to the query. All the code for this post is available in the GitHub i-repo.

Umdwebo olandelayo ubonisa ukwakheka kokungenisa.

Ukugeleza komsebenzi kuqukethe izinyathelo ezilandelayo:

Slides are converted to image files (one per slide) in JPG format and passed to the Claude 3 Sonnet model to generate text description.
The data is sent to the Amazon Titan Text Embeddings model to generate embeddings. In this series, we use the slide deck Qeqesha futhi usebenzise i-Stable Diffusion usebenzisa i-AWS Trainium & AWS Inferentia from the AWS Summit in Toronto, June 2023 to demonstrate the solution. The sample deck has 31 slides, therefore we generate 31 sets of vector embeddings, each with 1536 dimensions. We add additional metadata fields to perform rich search queries using OpenSearch’s powerful search capabilities.
The embeddings are ingested into an OSI pipeline using an API call.
The OSI pipeline ingests the data as documents into an OpenSearch Serverless index. The index is configured as the sink for this pipeline and is created as part of the OpenSearch Serverless collection.

Umdwebo olandelayo ubonisa ukwakheka kokusebenzelana komsebenzisi.

Ukugeleza komsebenzi kuqukethe izinyathelo ezilandelayo:

Umsebenzisi uhambisa umbuzo ohlobene nedekhi yesilayidi engenisiwe.
The user input is converted into embeddings using the Amazon Titan Text Embeddings model accessed using Amazon Bedrock. An OpenSearch Service vector search is performed using these embeddings. We perform a k-nearest neighbor (k-NN) search to retrieve the most relevant embeddings matching the user query.
The metadata of the response from OpenSearch Serverless contains a path to the image and description corresponding to the most relevant slide.
A prompt is created by combining the user question and the image description. The prompt is provided to Claude 3 Sonnet hosted on Amazon Bedrock.
Umphumela walokhu okucatshangwayo ubuyiselwa kumsebenzisi.

We discuss the steps for both stages in the following sections, and include details about the output.

Okudingekayo

Ukuze usebenzise isisombululo esinikezwe kulokhu okuthunyelwe, kufanele ube ne I-akhawunti ye-AWS kanye nokujwayelana nama-FM, i-Amazon Bedrock, i-SageMaker, ne-OpenSearch Service.

This solution uses the Claude 3 Sonnet and Amazon Titan Text Embeddings models hosted on Amazon Bedrock. Make sure that these models are enabled for use by navigating to the Ukufinyelela imodeli page on the Amazon Bedrock console.

If models are enabled, the Access status will state Ukufinyelela kunikeziwe.

If the models are not available, enable access by choosing Phatha ukufinyelela kwemodeli, selecting the models, and choosing Cela ukufinyelela imodeli. The models are enabled for use immediately.

Use AWS CloudFormation to create the solution stack

You can use AWS CloudFormation to create the solution stack. If you have created the solution for Part 1 in the same AWS account, be sure to delete that before creating this stack.

Isifunda se-AWS	isixhumanisi
`us-east-1`
`us-west-2`

Ngemuva kokuthi isitaki senziwe ngempumelelo, zulazulela kuthebhu ethi Okukhiphayo kwesitaki kukhonsoli ye-AWS CloudFormation futhi uqaphele amanani we MultimodalCollectionEndpoint futhi OpenSearchPipelineEndpoint. You use these in the subsequent steps.

Isifanekiso se-CloudFormation sidala izinsiza ezilandelayo:

Izindima ze-IAM - Okulandelayo Ubunikazi be-AWS Nokuphathwa Kokufinyelela (IAM) roles are created. Update these roles to apply least-privilege permissions, as discussed in Izindlela ezihamba phambili zokuphepha.
- SMExecutionRole nge Isevisi ye-Amazon Simple Storage (Amazon S3), SageMaker, OpenSearch Service, and Amazon Bedrock full access.
- OSPipelineExecutionRole with access to the S3 bucket and OSI actions.
SageMaker notebook – All code for this post is run using this notebook.
Iqoqo le-OpenSearch Serverless - Lesi yisizindalwazi se-vector sokugcina nokukhipha okushumekiwe.
Ipayipi le-OSI - Leli ipayipi lokungenisa idatha ku-OpenSearch Serverless.
S3 ibhakede - Yonke idatha yalokhu okuthunyelwe igcinwe kuleli bhakede.

The CloudFormation template sets up the pipeline configuration required to configure the OSI pipeline with HTTP as source and the OpenSearch Serverless index as sink. The SageMaker notebook 2_data_ingestion.ipynb displays how to ingest data into the pipeline using the Izicelo HTTP library.

Isifanekiso se-CloudFormation siyadala Inethiwekhi, ukubethela futhi ukufinyelela kwedatha policies required for your OpenSearch Serverless collection. Update these policies to apply least-privilege permissions.

The CloudFormation template name and OpenSearch Service index name are referenced in the SageMaker notebook 3_rag_inference.ipynb. If you change the default names, make sure you update them in the notebook.

Hlola ikhambi

After you have created the CloudFormation stack, you can test the solution. Complete the following steps:

Ku-console ye-SageMaker, khetha notebooks kufasitela lokuhambisa.
Khetha MultimodalNotebookInstance bese ukhetha Open JupyterLab.
In Isiphequluli Sefayela, traverse to the notebooks folder to see notebooks and supporting files.

The notebooks are numbered in the sequence in which they run. Instructions and comments in each notebook describe the actions performed by that notebook. We run these notebooks one by one.

Khetha 1_data_prep.ipynb ukuyivula ku-JupyterLab.
Use Qalisa imenyu, khetha Qalisa Wonke Amaseli ukusebenzisa ikhodi kule notebook.

This notebook will download a publicly available indawo yama-slide, convert each slide into the JPG file format, and upload these to the S3 bucket.

Khetha 2_data_ingestion.ipynb ukuyivula ku-JupyterLab.
Use Qalisa imenyu, khetha Qalisa Wonke Amaseli ukusebenzisa ikhodi kule notebook.

In this notebook, you create an index in the OpenSearch Serverless collection. This index stores the embeddings data for the slide deck. See the following code:

session = boto3.Session()
credentials = session.get_credentials()
auth = AWSV4SignerAuth(credentials, g.AWS_REGION, g.OS_SERVICE)

os_client = OpenSearch(
  hosts = [{'host': host, 'port': 443}],
  http_auth = auth,
  use_ssl = True,
  verify_certs = True,
  connection_class = RequestsHttpConnection,
  pool_maxsize = 20
)

index_body = """
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "vector_embedding": {
        "type": "knn_vector",
        "dimension": 1536,
        "method": {
          "name": "hnsw",
          "engine": "nmslib",
          "parameters": {}
        }
      },
      "image_path": {
        "type": "text"
      },
      "slide_text": {
        "type": "text"
      },
      "slide_number": {
        "type": "text"
      },
      "metadata": { 
        "properties" :
          {
            "filename" : {
              "type" : "text"
            },
            "desc":{
              "type": "text"
            }
          }
      }
    }
  }
}
"""
index_body = json.loads(index_body)
try:
  response = os_client.indices.create(index_name, body=index_body)
  logger.info(f"response received for the create index -> {response}")
except Exception as e:
  logger.error(f"error in creating index={index_name}, exception={e}")

You use the Claude 3 Sonnet and Amazon Titan Text Embeddings models to convert the JPG images created in the previous notebook into vector embeddings. These embeddings and additional metadata (such as the S3 path and description of the image file) are stored in the index along with the embeddings. The following code snippet shows how Claude 3 Sonnet generates image descriptions:

def get_img_desc(image_file_path: str, prompt: str):
    # read the file, MAX image size supported is 2048 * 2048 pixels
    with open(image_file_path, "rb") as image_file:
        input_image_b64 = image_file.read().decode('utf-8')
  
    body = json.dumps(
        {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1000,
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "source": {
                                "type": "base64",
                                "media_type": "image/jpeg",
                                "data": input_image_b64
                            },
                        },
                        {"type": "text", "text": prompt},
                    ],
                }
            ],
        }
    )
    
    response = bedrock.invoke_model(
        modelId=g.CLAUDE_MODEL_ID,
        body=body
    )

    resp_body = json.loads(response['body'].read().decode("utf-8"))
    resp_text = resp_body['content'][0]['text'].replace('"', "'")

    return resp_text

The image descriptions are passed to the Amazon Titan Text Embeddings model to generate vector embeddings. These embeddings and additional metadata (such as the S3 path and description of the image file) are stored in the index along with the embeddings. The following code snippet shows the call to the Amazon Titan Text Embeddings model:

def get_text_embedding(bedrock: botocore.client, prompt_data: str) -> np.ndarray:
    body = json.dumps({
        "inputText": prompt_data,
    })    
    try:
        response = bedrock.invoke_model(
            body=body, modelId=g.TITAN_MODEL_ID, accept=g.ACCEPT_ENCODING, contentType=g.CONTENT_ENCODING
        )
        response_body = json.loads(response['body'].read())
        embedding = response_body.get('embedding')
    except Exception as e:
        logger.error(f"exception={e}")
        embedding = None

    return embedding

The data is ingested into the OpenSearch Serverless index by making an API call to the OSI pipeline. The following code snippet shows the call made using the Requests HTTP library:

data = json.dumps([{
    "image_path": input_image_s3, 
    "slide_text": resp_text, 
    "slide_number": slide_number, 
    "metadata": {
        "filename": obj_name, 
        "desc": "" 
    }, 
    "vector_embedding": embedding
}])

r = requests.request(
    method='POST', 
    url=osi_endpoint, 
    data=data,
    auth=AWSSigV4('osis'))

Khetha 3_rag_inference.ipynb ukuyivula ku-JupyterLab.
Use Qalisa imenyu, khetha Qalisa Wonke Amaseli ukusebenzisa ikhodi kule notebook.

This notebook implements the RAG solution: you convert the user question into embeddings, find a similar image description from the vector database, and provide the retrieved description to Claude 3 Sonnet to generate an answer to the user question. You use the following prompt template:

  llm_prompt: str = """

  Human: Use the summary to provide a concise answer to the question to the best of your abilities. If you cannot answer the question from the context then say I do not know, do not make up an answer.
  <question>
  {question}
  </question>

  <summary>
  {summary}
  </summary>

  Assistant:"""

Amazwibela wekhodi alandelayo ahlinzeka ngokugeleza komsebenzi we-RAG:

def get_llm_response(bedrock: botocore.client, question: str, summary: str) -> str:
    prompt = llm_prompt.format(question=question, summary=summary)
    
    body = json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                ],
            }
        ],
    })
        
    try:
        response = bedrock.invoke_model(
        modelId=g.CLAUDE_MODEL_ID,
        body=body)

        response_body = json.loads(response['body'].read().decode("utf-8"))
        llm_response = response_body['content'][0]['text'].replace('"', "'")
        
    except Exception as e:
        logger.error(f"exception while slide_text={summary[:10]}, exception={e}")
        llm_response = None

    return llm_response


# create prompt and convert to embeddings question: str = "How does Inf2 compare in performance to comparable EC2 instances? I need numbers." text_embedding = get_text_embedding(bedrock, question) # vector db search vector_db_response: Dict = find_similar_data(text_embeddings) # download image for local notebook display s3_img_path = vector_db_response.get('hits', {}).get('hits')[0].get('_source').get('image_path') logger.info(f"going to answer the question="{question}" using the image "{s3_img_path}"") !aws s3 cp {s3_img_path} . local_img_path = os.path.basename(s3_img_path) display(filename=local_img_path) # Ask Claude 3 Sonnet slide_text = vector_db_response.get('hits', {}).get('hits')[0].get('_source').get('slide_text') llm_response = get_llm_response(bedrock, question, slide_text) print(llm_response)

Imiphumela

The following table contains some user questions and responses generated by our implementation. The Umbuzo ikholomu ithwebula umbuzo womsebenzisi, kanye ne- Impendulo column is the textual response generated by Claude 3 Sonnet. The isithombe column shows the k-NN slide match returned by the OpenSearch Serverless vector search.

Imiphumela ye-RAG ye-Multimodal

Umbuzo	Impendulo	isithombe
I-Inf2 iqhathaniswa kanjani ekusebenzeni nezimo ze-EC2 ezifanayo? Ngidinga izinombolo.	Based on the information provided in the summary, the key performance comparison numbers given for the Amazon EC2 Inf2 instances compared to other instances are: 1. Up to 4x higher throughput. 2. Up to 10x lower latency. The summary does not provide specific numbers comparing the Inf2 instances to other EC2 instance types. However, it highlights the throughput and latency improvements as the main performance advantages of the Inf2 instances over comparable instances for running generative AI models.
Njengokwe-AI/ML flywheel, izinsiza ze-AWS AI/ML zihlinzeka ngani?	Based on the summary, the AWS AI/ML services provide a range of capabilities that fuel an AI/ML flywheel. This flywheel consists of innovation in developing new and better algorithms, enabling more use-cases and applications, driving wider adoption and demand, which in turn leads to further investment in research and optimization to drive more innovation. The AWS AI/ML services seem to offer the tools, resources, and infrastructure to support this continuous cycle of innovation, application development, adoption, and reinvestment in the field of artificial intelligence and machine learning.
Uma kuqhathaniswa ne-GPT-2, mangaki amapharamitha e-GPT-3 ngaphezulu? Uyini umehluko wezinombolo phakathi kosayizi wepharamitha we-GPT-2 ne-GPT-3?	According to the information provided in the summary, GPT-3 from 2020 had 175B (175 billion) parameters, while GPT-2 from 2019 had 1.5B (1.5 billion) parameters. Therefore, the numerical difference between the parameter size of GPT-2 and GPT-3 is 173.5 billion parameters. GPT-3 has approximately 115 times more parameters than GPT-2.
Ayini ama-quark ku-particle physics?	Unfortunately, the given summary does not contain any information about quarks in particle physics. The summary describes an image related to the progression of natural language processing and generative AI technologies, but it does not mention anything about particle physics or the concept of quarks.

Query your index

You can use OpenSearch Dashboards to interact with the OpenSearch API to run quick tests on your index and ingested data.

Hlanza

To avoid incurring future charges, delete the resources. You can do this by deleting the stack using the AWS CloudFormation console.

Isiphetho

Enterprises generate new content all the time, and slide decks are a common way to share and disseminate information internally within the organization and externally with customers or at conferences. Over time, rich information can remain buried and hidden in non-text modalities like graphs and tables in these slide decks.

You can use this solution and the power of multimodal FMs such as the Amazon Titan Text Embeddings and Claude 3 Sonnet to discover new information or uncover new perspectives on content in slide decks. You can try different Claude models available on Amazon Bedrock by updating the CLAUDE_MODEL_ID ku globals.py Ifayela.

This is Part 2 of a three-part series. We used the Amazon Titan Multimodal Embeddings and the LLaVA model in Part 1. In Part 3, we will compare the approaches from Part 1 and Part 2.

Portions of this code are released under the Apache 2.0 License.

Mayelana nababhali

Amit Arora iyi-AI kanye ne-ML Specialist Architect e-Amazon Web Services, esiza amakhasimende ebhizinisi ukuthi asebenzise izinsiza zokufunda zomshini ezisekelwe emafini ukuze akhulise ngokushesha amasu awo. Uphinde abe ngumfundisi osizayo ohlelweni lwe-MS data science and analytics eGeorgetown University eWashington DC.

UManju Prasad is a Senior Solutions Architect at Amazon Web Services. She focuses on providing technical guidance in a variety of technical domains, including AI/ML. Prior to joining AWS, she designed and built solutions for companies in the financial services sector and also for a startup. She is passionate about sharing knowledge and fostering interest in emerging talent.

Archana Inapudi is a Senior Solutions Architect at AWS, supporting a strategic customer. She has over a decade of cross-industry expertise leading strategic technical initiatives. Archana is an aspiring member of the AI/ML technical field community at AWS. Prior to joining AWS, Archana led a migration from traditional siloed data sources to Hadoop at a healthcare company. She is passionate about using technology to accelerate growth, provide value to customers, and achieve business outcomes.

U-Antara Raisa is an AI and ML Solutions Architect at Amazon Web Services, supporting strategic customers based out of Dallas, Texas. She also has previous experience working with large enterprise partners at AWS, where she worked as a Partner Success Solutions Architect for digital-centered customers.

I-SEO Powered Content & PR Distribution. Khuliswa Namuhla.
I-PlatoData.Network Vertical Generative Ai. Zinike Amandla. Finyelela Lapha.
I-PlatoAiStream. I-Web3 Intelligence. Ulwazi Lukhulisiwe. Finyelela Lapha.
I-PlatoESG. Ikhabhoni, I-CleanTech, Amandla, Environment, Ilanga, Ukuphathwa Kwemfucuza. Finyelela Lapha.
I-PlatoHealth. I-Biotech kanye ne-Clinical Trials Intelligence. Finyelela Lapha.
Source: https://aws.amazon.com/blogs/machine-learning/talk-to-your-slide-deck-using-multimodal-foundation-models-hosted-on-amazon-bedrock-and-amazon-sagemaker-part-2/

I-Generative Data Intelligence

Khuluma nedekhi yakho yesilayidi usebenzisa amamodeli esisekelo se-multimodal aphethwe ku-Amazon Bedrock - Ingxenye 2 | Izinsizakalo Zewebhu ze-Amazon

Ukubukwa kwesisombululo

Okudingekayo

Use AWS CloudFormation to create the solution stack

Hlola ikhambi

Imiphumela

Query your index

Hlanza

Isiphetho

Mayelana nababhali

South Africa to Become a Focal Point for Crypto Innovation with Bitcoin Events’ Crypto Fest and Blockchain Africa Conference 2024

South Africa to Host Pioneering Crypto Fest and Blockchain Africa Conference in 2024: A Beacon for Global Crypto Innovation

Latest Intelligence

Ingqungquthela ye-Crypto Fest kanye neBlockchain Africa 2024: Imicimbi Ephambili YaseNingizimu Afrika Yabashisekeli Be-Crypto Nabaholi Bemboni - Imibono evela ku-CryptoCurrencyWire

INingizimu Afrika izosingatha iNkomfa kaNdunankulu ye-Crypto Fest kanye neBlockchain Africa ngo-2024: Ihabhu Lokusungula Nokubambisana Lochwepheshe

INingizimu Afrika Ithatha Isikhungo Sesiteji ku-Crypto Sphere nge-Bitcoin Events' Crypto Fest 2024 kanye neBlockchain Africa Conference

INingizimu Afrika izosingatha Imicimbi Emikhulu ye-Crypto kanye ne-Blockchain ngo-2024 njengoba Umsunguli we-Binance Ebhekene Nejele kanye ne-Bitcoin Ibambelela Ekusekelweni Kwe-$ 60K

Abashisekeli be-Crypto kanye neBlockchain Balungiselela I-Crypto Fest yaseNingizimu Afrika ka-2024 kanye neNkomfa Ye-Blockchain Africa Phakathi Kwe-Global Crypto Dynamics

INingizimu Afrika izobamba i-Pivotal Crypto Fest kanye neBlockchain Africa Conference ngo-2024 Phakathi Kwezinselele Ze-Global Crypto Dynamics kanye Nokulawula.

Xoxa nathi