The purpose of this article is to provide a step-by-step guide and best practices for Snowflake Cortex’s Parse Document functionality. If you’re interested in an overview of this functionality, you can read more here.
Function Overview
SNOWFLAKE.CORTEX.PARSE_DOCUMENT() enables automated document processing and text extraction in Snowflake’s Cortex framework. It supports multiple document formats and parsing modes. This is currently in preview and you can access the product documentation here.
Syntax and Parameters
SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
location STRING, -- Stage location
file_name STRING, -- Document to process
options OBJECT -- Configuration options
Options Object Parameters
- LAYOUT mode: JSON with key “content” containing markdown with tables extracted from the document.
- OCR mode: JSON with key “content” containing the text content from the document.
Step-By-Step Guide
Step 1 – First create a stage with snowflake server side encryption in your preferred db.schema.
CREATE STAGE CORTEX_AI.MAIN.input_stage
DIRECTORY = ( ENABLE = true )
ENCRYPTION = ( TYPE = 'SNOWFLAKE_SSE' );
Step 2 – Then load your pdfs into that stage. Below is example pdf
Step 3 – run parse_document() function and pass parameters according to above syntax. Below are examples and outputs in different mode.
Layout mode:
SELECT
SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
@CORTEX_AI.MAIN.input_stage,
'Coffee_Invoices.pdf',
{'mode': 'LAYOUT'}
) AS layout;
OUTPUT:
{ "content": "
Invoice Number: INV-33800 Date: 2024-12-06 Customer: J ane S mith
Coffee Details:
|Type|Quantity (kg)|Price per kg ($)|Subtotal ($)|
| :---: | :---: | :---: | :---: |
|Arabica Beans|0.82|20.00|13.74|
Tax: $7.86 Shipping: $11.21 Total: $73.55",
"metadata": { "pageCount": 1 },
}
OCR mode:
SELECT
SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
@CORTEX_AI.MAIN.input_stage,
'Coffee_Invoices.pdf',
{'mode': 'OCR'}
) AS layout;
OUTPUT:
Invoice Number: INV-33800
Date: 2024-12-06
Customer: J ane S mith
Coffee Details:
Type Quantity (kg) Price per kg ($) Subtotal ($)
Arabica Beans 0.82 20.00 13.74
Tax: $7.86
Shipping: $11.21
Total: $73.55
Step 4 – Parse is using some regex and load it in table. You can get the output like below directly from pdfs:
Best Practices
As a Select partner of Snowflake, here are our best practices when leveraging Cortex Parse Document:
- Optimize file formats
- Implement error handling
- Validate outputs
- Manage permissions
- Follow data governance
Summary
Snowflake Cortex Parse Document function enables automated document processing and text extraction within Snowflake’s Cortex framework, supporting various document formats and parsing modes, including layout and OCR modes. The function allows users to extract structured data (like tables and text) from documents, load them into stages with encryption, and handle them using SQL queries
Comments are closed