Cortex Parse Document Guide & Best Practices

  • Home
  • Blog
  • Cortex Parse Document Guide & Best Practices

The purpose of this article is to provide a step-by-step guide and best practices for Snowflake Cortex’s Parse Document functionality. If you’re interested in an overview of this functionality, you can read more here.

Function Overview 

SNOWFLAKE.CORTEX.PARSE_DOCUMENT() enables automated document processing and text extraction in Snowflake’s Cortex framework. It supports multiple document formats and parsing modes. This is currently in preview and you can access the product documentation here.

Syntax and Parameters 

SNOWFLAKE.CORTEX.PARSE_DOCUMENT( 
    location STRING,     -- Stage location 
    file_name STRING,    -- Document to process 
    options OBJECT       -- Configuration options

Options Object Parameters 

  • LAYOUT mode: JSON with key “content” containing markdown with tables extracted from the document. 
  • OCR mode: JSON with key “content” containing the text content from the document. 

Step-By-Step Guide

Step 1 – First create a stage with snowflake server side encryption in your preferred db.schema. 

CREATE STAGE CORTEX_AI.MAIN.input_stage 
    DIRECTORY = ( ENABLE = true ) 
    ENCRYPTION = ( TYPE = 'SNOWFLAKE_SSE' );

Step 2 – Then load your pdfs into that stage. Below is example pdf 

Step 3 – run parse_document() function and pass parameters according to above syntax. Below are examples and outputs in different mode. 

Layout mode:

SELECT 
  SNOWFLAKE.CORTEX.PARSE_DOCUMENT( 
    @CORTEX_AI.MAIN.input_stage, 
    'Coffee_Invoices.pdf', 
    {'mode': 'LAYOUT'} 
  ) AS layout; 
 
OUTPUT: 
{ "content": "   

Invoice Number: INV-33800 Date: 2024-12-06 Customer: J ane S mith   

Coffee Details: 

|Type|Quantity (kg)|Price per kg ($)|Subtotal ($)| 

| :---: | :---: | :---: | :---: | 

|Arabica Beans|0.82|20.00|13.74| 

Tax: $7.86 Shipping: $11.21 Total: $73.55", 

  "metadata": { "pageCount": 1 }, 

}

OCR mode:

SELECT 
  SNOWFLAKE.CORTEX.PARSE_DOCUMENT( 
    @CORTEX_AI.MAIN.input_stage, 
    'Coffee_Invoices.pdf', 
    {'mode': 'OCR'} 
  ) AS layout; 
OUTPUT: 
Invoice Number: INV-33800 

Date: 2024-12-06 

Customer: J ane S mith 

Coffee Details: 

Type Quantity (kg) Price per kg ($) Subtotal ($) 

Arabica Beans 0.82 20.00 13.74 

Tax: $7.86 

Shipping: $11.21 

Total: $73.55

Step 4 – Parse is using some regex and load it in table. You can get the output like below directly from pdfs:

Best Practices 

As a Select partner of Snowflake, here are our best practices when leveraging Cortex Parse Document:

  • Optimize file formats 
  • Implement error handling 
  • Validate outputs 
  • Manage permissions 
  • Follow data governance 

Summary

Snowflake Cortex Parse Document function enables automated document processing and text extraction within Snowflake’s Cortex framework, supporting various document formats and parsing modes, including layout and OCR modes. The function allows users to extract structured data (like tables and text) from documents, load them into stages with encryption, and handle them using SQL queries

Comments are closed