\n",
+ " \n",
+ " \n",
+ " # \n",
+ " | \n",
+ " \n",
+ " Description of the item \n",
+ " | \n",
+ " \n",
+ " Entity name \n",
+ " | \n",
+ " \n",
+ " Occurrence Type \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " 1. \n",
+ " | \n",
+ " \n",
+ " Total line item (includes Check Number, check date and check amount, check description) [Parent] \n",
+ " | \n",
+ " \n",
+ " check_item \n",
+ " | \n",
+ " \n",
+ " Optional multiple \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " 2. \n",
+ " | \n",
+ " \n",
+ " Check number [child] \n",
+ " | \n",
+ " \n",
+ " check_number \n",
+ " | \n",
+ " \n",
+ " Optional once \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " 3. \n",
+ " | \n",
+ " \n",
+ " Check date [child] \n",
+ " | \n",
+ " \n",
+ " check_date \n",
+ " | \n",
+ " \n",
+ " Optional once \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " | \n",
+ " \n",
+ " Check amount [child] \n",
+ " | \n",
+ " \n",
+ " check_amount \n",
+ " | \n",
+ " \n",
+ " Optional once \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " 5. \n",
+ " | \n",
+ " \n",
+ " Check Description [child] \n",
+ " | \n",
+ " \n",
+ " check_desc \n",
+ " | \n",
+ " \n",
+ " Optional once \n",
+ " | \n",
+ "
\n",
+ "
\n",
+ "\n",
+ "* *Sample Schema* \n",
+ "
+
+ Bank Statement parser output entity type Before post processing |
+ After post processing |
+
+
+ account_number |
+ account_0_number account_1_number ..etc |
+
+
+ account_type |
+ account_0_name account_1_name ..etc |
+
+
+ starting_balance |
+ account_0_beggining_balance account_1_beggining_balance ..etc |
+
+
+ ending_balance |
+ account_0_ending_balance account_1_ending_balance ..etc |
+
+
+ table_item/transaction_deposit_date |
+ account_0_transaction/deposit_date account_1_transaction/deposit_date ..etc |
+
+
+ table_item/transaction_deposit_description |
+ account_0_transaction/deposit_description account_1_transaction/deposit_description ..etc |
+
+
+ table_item/transaction_deposit |
+ account_0_transaction/deposit account_1_transaction/deposit ..etc |
+
+
+ table_item/transaction_withdrawal_date |
+ account_0_transaction/withdrawal_date account_1_transaction/withdrawal_date ..etc |
+
+
+ table_item/transaction_withdrawal_description |
+ account_0_transaction/withdrawal_description account_1_transaction/withdrawal_description ..etc |
+
+
+ table_item/transaction_withdrawal |
+ account_0_transaction/withdrawal account_1_transaction/withdrawal ..etc |
+
+
+ table_item |
+ account_0_trasaction account_1_transaction ..etc |
+
+
diff --git a/incubator-tools/categorizing_bank_statement_transactions_by_account_number/categorizing_bank_statement_transactions_by_account_number.ipynb b/incubator-tools/categorizing_bank_statement_transactions_by_account_number/categorizing_bank_statement_transactions_by_account_number.ipynb
new file mode 100644
index 000000000..1a248439b
--- /dev/null
+++ b/incubator-tools/categorizing_bank_statement_transactions_by_account_number/categorizing_bank_statement_transactions_by_account_number.ipynb
@@ -0,0 +1,750 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Categorizing Bank Statement Transactions by Account Number"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "* Author: docai-incubator@google.com"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Disclaimer\n",
+ "\n",
+ "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Objective\n",
+ "This document guides to categorize the transactions for each account number from the bank statement parsed json.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Prerequisite\n",
+ "* Python : Jupyter notebook (Vertex) \n",
+ "* GCS storage bucket\n",
+ "* Bank Statement Parser"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Step by Step Procedure"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 1. Import ModulesPpackages"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Note: you may need to restart the kernel to use updated packages.\n",
+ "Note: you may need to restart the kernel to use updated packages.\n"
+ ]
+ }
+ ],
+ "source": [
+ "%pip install google-cloud-documentai --quiet\n",
+ "%pip install google-cloud-documentai-toolbox --quiet"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "--2024-01-12 12:45:47-- https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py\n",
+ "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
+ "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
+ "HTTP request sent, awaiting response... 200 OK\n",
+ "Length: 29735 (29K) [text/plain]\n",
+ "Saving to: ‘utilities.py’\n",
+ "\n",
+ "utilities.py 100%[===================>] 29.04K --.-KB/s in 0.002s \n",
+ "\n",
+ "2024-01-12 12:45:47 (13.3 MB/s) - ‘utilities.py’ saved [29735/29735]\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import re\n",
+ "from collections import Counter, defaultdict\n",
+ "from difflib import SequenceMatcher\n",
+ "from typing import Dict, List, Union\n",
+ "\n",
+ "from google.cloud import documentai_v1beta3 as documentai\n",
+ "from google.cloud.documentai_toolbox import gcs_utilities\n",
+ "\n",
+ "import utilities"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2. Input Details"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "`gcs_input_path`: Input GCS path which contains bank statement parser JSON files \n",
+ "`gcs_output_path`: GCS path to store post processed(JSON) results"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Bank statement parser jsons path\n",
+ "gcs_input_path = \"gs://bucket/path_to/pre/input\"\n",
+ "# post process json path\n",
+ "gcs_output_path = \"gs://bucket/path_to/post/output/\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 3. Run Below Code-cells"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Categorizing Bank Statement Transactions by Account Number Process Started...\n",
+ "\tFile: 1941000828-0.json\n",
+ "\t\tPost processed data uploaded to gs://siddamv/categorizing_bank_statement_transactions_by_account_number/post/output/1941000828-0.json\n",
+ "\tFile: 2016398000-0.json\n",
+ "\t\tPost processed data uploaded to gs://siddamv/categorizing_bank_statement_transactions_by_account_number/post/output/2016398000-0.json\n",
+ "\tFile: 2016654464-0.json\n",
+ "\t\tPost processed data uploaded to gs://siddamv/categorizing_bank_statement_transactions_by_account_number/post/output/2016654464-0.json\n",
+ "\tFile: 2017496199-0.json\n",
+ "\t\tPost processed data uploaded to gs://siddamv/categorizing_bank_statement_transactions_by_account_number/post/output/2017496199-0.json\n",
+ "\tFile: 2024616717-0.json\n",
+ "\t\tPost processed data uploaded to gs://siddamv/categorizing_bank_statement_transactions_by_account_number/post/output/2024616717-0.json\n",
+ "\tFile: SampleBank-0.json\n",
+ "\t\tPost processed data uploaded to gs://siddamv/categorizing_bank_statement_transactions_by_account_number/post/output/SampleBank-0.json\n",
+ "Process Completed\n"
+ ]
+ }
+ ],
+ "source": [
+ "def del_ent_attrs(ent: documentai.Document.Entity) -> None:\n",
+ " \"\"\"To delete empty attributes of Entity object\n",
+ "\n",
+ " Args:\n",
+ " ent (documentai.Document.Entity): DocumentAI doc-proto object\n",
+ " \"\"\"\n",
+ "\n",
+ " if not ent.normalized_value:\n",
+ " del ent.normalized_value\n",
+ " if not ent.confidence:\n",
+ " del ent.confidence\n",
+ " if not ent.page_anchor:\n",
+ " del ent.page_anchor\n",
+ " if not ent.id:\n",
+ " del ent.id\n",
+ " if not ent.mention_text:\n",
+ " del ent.mention_text\n",
+ " if not ent.text_anchor:\n",
+ " del ent.text_anchor\n",
+ "\n",
+ "\n",
+ "def boundary_markers(doc: documentai.Document) -> documentai.Document:\n",
+ " \"\"\"It will rename all entities & child_entities type_\n",
+ "\n",
+ " Args:\n",
+ " doc (documentai.Document): DocumentAI Doc-Proto object\n",
+ "\n",
+ " Returns:\n",
+ " documentai.Document: It returns DocumentAI Doc-Proto object with new entity-type\n",
+ " \"\"\"\n",
+ "\n",
+ " # find ent_ids of Json\n",
+ " ent_ids = defaultdict(list)\n",
+ " all_entities = []\n",
+ " for idx, entity in enumerate(doc.entities):\n",
+ " if entity.id:\n",
+ " ent_ids[idx].append(int(entity.id))\n",
+ " all_entities.append(entity)\n",
+ " for prop in entity.properties:\n",
+ " ent_ids[idx].append(int(prop.id))\n",
+ " all_entities.append(prop)\n",
+ " all_entities = sorted(all_entities, key=lambda x: x.id)\n",
+ " # Single Level Entities file : json_dict\n",
+ " json_dict = defaultdict(list)\n",
+ " for entity in all_entities:\n",
+ " json_dict[\"confidence\"].append(entity.confidence)\n",
+ " json_dict[\"id\"].append(entity.id)\n",
+ " json_dict[\"mentionText\"].append(entity.mention_text)\n",
+ " json_dict[\"normalizedValue\"].append(entity.normalized_value)\n",
+ " json_dict[\"pageAnchor\"].append(entity.page_anchor)\n",
+ " json_dict[\"textAnchor\"].append(entity.text_anchor)\n",
+ " json_dict[\"type\"].append(entity.type_)\n",
+ "\n",
+ " acc_dict = {}\n",
+ " idx = 0\n",
+ " for ent in doc.entities:\n",
+ " if ent.type_ != \"account_number\":\n",
+ " continue\n",
+ " pg_no = ent.page_anchor.page_refs[0].page\n",
+ " y_min = min(\n",
+ " vertex.y\n",
+ " for vertex in ent.page_anchor.page_refs[0].bounding_poly.normalized_vertices\n",
+ " )\n",
+ " acn = re.sub(\"\\D\", \"\", ent.mention_text.strip(\".#:' \"))\n",
+ " acc_dict[idx] = {\"page\": pg_no, \"account_number\": acn, \"min_y\": y_min}\n",
+ " idx += 1\n",
+ "\n",
+ " sorted_data = sorted(acc_dict.values(), key=lambda x: (int(x[\"page\"]), x[\"min_y\"]))\n",
+ " # acns -> acns\n",
+ " acns = {}\n",
+ " idx = 0\n",
+ " for data in sorted_data:\n",
+ " acn = data[\"account_number\"]\n",
+ " if acn not in acns and len(acn) > 6:\n",
+ " acns[acn] = f\"account_{idx}_number\"\n",
+ " idx += 1\n",
+ " acn_dict = {}\n",
+ " acn_page_dict = {}\n",
+ " for key, value in acns.items():\n",
+ " si_ei_pn = []\n",
+ " pg_nos = set()\n",
+ " zip_data = zip(\n",
+ " json_dict[\"mentionText\"], json_dict[\"pageAnchor\"], json_dict[\"textAnchor\"]\n",
+ " )\n",
+ " for mt, pa, ta in zip_data:\n",
+ " if re.sub(\"\\D\", \"\", mt.strip(\".#:' \")) == key:\n",
+ " page = pa.page_refs[0].page\n",
+ " ts = ta.text_segments[0]\n",
+ " si_ei_pn.append((ts.start_index, ts.end_index, page))\n",
+ " pg_nos.add(page)\n",
+ " acn_page_dict[value] = pg_nos\n",
+ " acn_dict[value] = si_ei_pn\n",
+ "\n",
+ " page_no = set(range(len(doc.pages)))\n",
+ " pages_temp = set()\n",
+ " for pn_set in acn_page_dict.values():\n",
+ " page_no = page_no & pn_set\n",
+ " if page_no:\n",
+ " pages_temp = page_no\n",
+ " page_no = list(pages_temp)\n",
+ " for value in acn_dict.values():\n",
+ " value.sort(key=lambda x: x[2])\n",
+ "\n",
+ " acns_to_delete = []\n",
+ " for key, value in acn_dict.items():\n",
+ " if key != \"account_0_number\":\n",
+ " min_si = min_ei = min_page = float(\"inf\")\n",
+ " data_to_rm = []\n",
+ " if len(value) <= 1:\n",
+ " acns_to_delete.append(key)\n",
+ " continue\n",
+ " for si_ei_pn in value:\n",
+ " check_length = len(data_to_rm) < len(value) - 1\n",
+ " check_if = (si_ei_pn[2] in page_no) and (si_ei_pn[2] < 3)\n",
+ " if check_if and check_length:\n",
+ " data_to_rm.append(si_ei_pn)\n",
+ " continue\n",
+ " min_si = min(min_si, si_ei_pn[0])\n",
+ " min_ei = min(min_ei, si_ei_pn[1])\n",
+ " min_page = min(min_page, si_ei_pn[2])\n",
+ " data_to_rm.append(si_ei_pn)\n",
+ " for k in data_to_rm:\n",
+ " value.remove(k)\n",
+ " acn_dict[key] = [(min_si, min_ei, min_page)]\n",
+ " continue\n",
+ " min_si = min_ei = 0\n",
+ " min_page = float(\"inf\")\n",
+ " data_to_rm = []\n",
+ " for si_ei_pn in value:\n",
+ " if si_ei_pn[2] != page_no[0]:\n",
+ " continue\n",
+ " min_si = max(min_si, si_ei_pn[0])\n",
+ " min_ei = max(min_ei, si_ei_pn[1])\n",
+ " min_page = min(min_page, si_ei_pn[2])\n",
+ " data_to_rm.append(si_ei_pn)\n",
+ "\n",
+ " for k in data_to_rm:\n",
+ " value.remove(k)\n",
+ " acn_dict[\"account_0_number\"] = [(min_si, min_ei, min_page)]\n",
+ "\n",
+ " for i in acns_to_delete:\n",
+ " del acn_dict[i]\n",
+ "\n",
+ " txt_len = len(doc.text)\n",
+ " if len(acns) > 1:\n",
+ " border_idx = []\n",
+ " for si_ei_pn in acn_dict.values():\n",
+ " border_idx.append((si_ei_pn[0][0], si_ei_pn[0][1]))\n",
+ "\n",
+ " region_splitter = []\n",
+ " for bi in border_idx:\n",
+ " region_splitter.append(bi[0])\n",
+ "\n",
+ " region_splitter_dict = {}\n",
+ " for idx, rs in enumerate(region_splitter):\n",
+ " region_splitter_dict[rs] = f\"account_{idx}\"\n",
+ " region_splitter_dict[txt_len] = \"last_index\"\n",
+ " else:\n",
+ " region_splitter_dict = dict([(txt_len, \"account_0\")])\n",
+ " region_splitter_dict[txt_len + 1] = \"last_index\"\n",
+ "\n",
+ " for i, _ in enumerate(json_dict[\"id\"]):\n",
+ " sub_str = re.sub(\"\\D\", \"\", json_dict[\"mentionText\"][i].strip(\".#:' \"))\n",
+ " ent_type = json_dict[\"type\"][i]\n",
+ " if ent_type == \"account_number\" and len(sub_str) > 5:\n",
+ " json_dict[\"type\"][i] = acns[sub_str]\n",
+ "\n",
+ " TYPE_MAPPING = {\n",
+ " \"starting_balance\": \"_beginning_balance\",\n",
+ " \"ending_balance\": \"_ending_balance\",\n",
+ " \"table_item/transaction_deposit_date\": \"_transaction/deposit_date\",\n",
+ " \"table_item/transaction_deposit_description\": \"_transaction/deposit_desc\",\n",
+ " \"table_item/transaction_deposit\": \"_transaction/deposit_amount\",\n",
+ " \"table_item/transaction_withdrawal_date\": \"_transaction/withdraw_date\",\n",
+ " \"table_item/transaction_withdrawal_description\": \"_transaction/withdraw_desc\",\n",
+ " \"table_item/transaction_withdrawal\": \"_transaction/withdraw_amount\",\n",
+ " }\n",
+ " for i, _id in enumerate(json_dict[\"id\"]):\n",
+ " try:\n",
+ " si = json_dict[\"textAnchor\"][i].text_segments[0].start_index\n",
+ " except IndexError:\n",
+ " # To skip entity type checking if there is no TextAnchor object in Doc Proto\n",
+ " continue\n",
+ " ent_type = json_dict[\"type\"][i]\n",
+ " keys = list(region_splitter_dict.keys())\n",
+ " for j in range(1, len(region_splitter_dict)):\n",
+ " if ent_type in TYPE_MAPPING and si < keys[j]:\n",
+ " json_dict[\"type\"][i] = (\n",
+ " region_splitter_dict[keys[j - 1]] + TYPE_MAPPING[ent_type]\n",
+ " )\n",
+ " break\n",
+ "\n",
+ " new_entities = []\n",
+ " for i, _ in enumerate(all_entities):\n",
+ " entity = documentai.Document.Entity(\n",
+ " confidence=json_dict[\"confidence\"][i],\n",
+ " id=json_dict[\"id\"][i],\n",
+ " mention_text=json_dict[\"mentionText\"][i],\n",
+ " normalized_value=json_dict[\"normalizedValue\"][i],\n",
+ " page_anchor=json_dict[\"pageAnchor\"][i],\n",
+ " text_anchor=json_dict[\"textAnchor\"][i],\n",
+ " type_=json_dict[\"type\"][i],\n",
+ " )\n",
+ " new_entities.append(entity)\n",
+ " new_entities_to_id_dict = {}\n",
+ " for ent in new_entities:\n",
+ " new_entities_to_id_dict[int(ent.id)] = ent\n",
+ " all_entities_new = [\"\"] * len(ent_ids)\n",
+ " for i, _ids in ent_ids.items():\n",
+ " if len(_ids) == 1:\n",
+ " all_entities_new[i] = new_entities_to_id_dict[_ids[0]]\n",
+ " continue\n",
+ " sub_entities = []\n",
+ " for _id in _ids:\n",
+ " sub_entities.append(new_entities_to_id_dict[_id])\n",
+ " all_entities_new[i] = doc.entities[i]\n",
+ " all_entities_new[i].properties = sub_entities\n",
+ " for ent in all_entities_new:\n",
+ " del_ent_attrs(ent)\n",
+ " for child_ent in ent.properties:\n",
+ " del_ent_attrs(child_ent)\n",
+ " for i in all_entities_new:\n",
+ " if i.type_ == \"table_item\":\n",
+ " i.type_ = i.properties[0].type_.split(\"/\")[0]\n",
+ "\n",
+ " doc.entities = all_entities_new\n",
+ " return doc\n",
+ "\n",
+ "\n",
+ "def match_ent_type(doc: documentai.Document, ent_type: str) -> Dict[str, str]:\n",
+ " \"\"\"It will look for provided `ent_type` with all entities in doc-proto object & clean its matched mention_text\n",
+ "\n",
+ " Args:\n",
+ " doc (documentai.Document): DocumentAI doc-proto object\n",
+ " ent_type (str): A string-data to look in all entities\n",
+ "\n",
+ " Returns:\n",
+ " Dict[str, str]: All matched entity-types with provided `ent_type` as key and Its most-frequent mention_text as value\n",
+ " \"\"\"\n",
+ "\n",
+ " types = set()\n",
+ " for entity in doc.entities:\n",
+ " if ent_type in entity.type_:\n",
+ " types.add(entity.type_)\n",
+ " types_dict = {}\n",
+ " for unique_type in types:\n",
+ " cleaned_mts = []\n",
+ " for entity in doc.entities:\n",
+ " if unique_type == entity.type_:\n",
+ " cleaned_mts.append(entity.mention_text.strip(\"$#\"))\n",
+ " data = Counter(cleaned_mts).most_common(1)[0][0]\n",
+ " types_dict[unique_type] = data\n",
+ " return types_dict\n",
+ "\n",
+ "\n",
+ "def fix_account_balance(doc: documentai.Document) -> documentai.Document:\n",
+ " \"\"\"It will fix account balance for doc-proto entities whose entity-types matches with `beginning_balance` or `ending_balance`\n",
+ "\n",
+ " Args:\n",
+ " doc (documentai.Document): DocumentAI doc-proto object\n",
+ "\n",
+ " Returns:\n",
+ " documentai.Document: It returns updated DocumentAI Doc-Proto object\n",
+ " \"\"\"\n",
+ "\n",
+ " beg_end_dict = dict()\n",
+ " beg_end_dict.update(match_ent_type(doc, \"beginning_balance\"))\n",
+ " beg_end_dict.update(match_ent_type(doc, \"ending_balance\"))\n",
+ " for entity in doc.entities:\n",
+ " mt = entity.mention_text.strip(\"$#\")\n",
+ " et = entity.type_\n",
+ " keys = list(beg_end_dict.keys())\n",
+ " values = list(beg_end_dict.values())\n",
+ " if et in beg_end_dict:\n",
+ " if mt != beg_end_dict[et] and mt in values:\n",
+ " entity.type_ = keys[values.index(mt)]\n",
+ " elif mt != beg_end_dict[et]:\n",
+ " doc.entities.remove(entity)\n",
+ " return doc\n",
+ "\n",
+ "\n",
+ "def find_account_number(\n",
+ " data: List[Dict[str, Union[int, float]]], page_no: int, y_coord: float\n",
+ ") -> Union[None, str]:\n",
+ " \"\"\"It will look for nearest account_number in provided page number based on y_coord\n",
+ "\n",
+ " Args:\n",
+ " data (List[Dict[str, Union[int, float]]]): It contains account-numbers and its corresponding page_no & y-coordinate\n",
+ " page_no (int): Page number to look for account-number\n",
+ " y_coord (float): minimum y-coordinate of token which matches with r\"\\sstatement\"\n",
+ "\n",
+ " Returns:\n",
+ " Union[None,str]: It returns either None or closest account number from given `page_no`\n",
+ " \"\"\"\n",
+ " closest_acc = None\n",
+ " min_dst = float(\"inf\")\n",
+ " for acn, page_info_list in data.items():\n",
+ " for page_info in page_info_list:\n",
+ " page = page_info.get(\"page\")\n",
+ " y = page_info.get(\"y\")\n",
+ " dst = abs(y_coord - y)\n",
+ " if page == page_no and dst < min_dst:\n",
+ " min_dst = dst\n",
+ " closest_acc = acn\n",
+ " return closest_acc\n",
+ "\n",
+ "\n",
+ "def detials_account(\n",
+ " doc: documentai.Document, account_type: str\n",
+ ") -> List[\n",
+ " Dict[\n",
+ " str,\n",
+ " Dict[str, Union[str, int, documentai.Document.TextAnchor.TextSegment, float]],\n",
+ " ]\n",
+ "]:\n",
+ " \"\"\"It will look for entities whose type_ matches with `account_type`\n",
+ "\n",
+ " Args:\n",
+ " doc (documentai.Document): DocumentAI doc-proto object\n",
+ " account_type (str): String data to match with individual entity.type_\n",
+ "\n",
+ " Returns:\n",
+ " List[Dict[str,Dict[str, Union[str,int,documentai.Document.TextAnchor.TextSegment, float]]]]:\n",
+ " it returnsList which has dictionary of mention_text and its id, page_number, text_segment, x_max & y_max\n",
+ " \"\"\"\n",
+ " acc_dict_lst = []\n",
+ " for ent in doc.entities:\n",
+ " if ent.properties:\n",
+ " continue\n",
+ " match_ratio = SequenceMatcher(None, ent.type_, account_type).ratio()\n",
+ " if match_ratio >= 0.9:\n",
+ " id1 = ent.id\n",
+ " page1 = ent.page_anchor.page_refs[0].page\n",
+ " text_segment = ent.text_anchor.text_segments[0]\n",
+ " x_coords = []\n",
+ " y_coords = []\n",
+ " nvs = ent.page_anchor.page_refs[0].bounding_poly.normalized_vertices\n",
+ " for nv in nvs:\n",
+ " x_coords.append(nv.x)\n",
+ " y_coords.append(nv.y)\n",
+ " x_max = max(x_coords, default=\"\")\n",
+ " y_max = max(y_coords, default=\"\")\n",
+ " acc_dict_lst.append(\n",
+ " {\n",
+ " ent.mention_text: {\n",
+ " \"id\": id1,\n",
+ " \"page\": page1,\n",
+ " \"textSegments\": text_segment,\n",
+ " \"x_max\": x_max,\n",
+ " \"y_max\": y_max,\n",
+ " }\n",
+ " }\n",
+ " )\n",
+ " return acc_dict_lst\n",
+ "\n",
+ "\n",
+ "def accounttype_change(doc: documentai.Document) -> documentai.Document:\n",
+ " \"\"\"It will rename entity type_ for all target entities in doc-proto object\n",
+ "\n",
+ " Args:\n",
+ " doc (documentai.Document): DocumentAI doc-proto object\n",
+ "\n",
+ " Returns:\n",
+ " documentai.Document: It returns updated doc-proto object\n",
+ " \"\"\"\n",
+ "\n",
+ " acc_name_dict = detials_account(doc, \"account_type\")\n",
+ " acn_dict = detials_account(doc, \"account_i_number\")\n",
+ " temp_del = []\n",
+ " for item in acc_name_dict:\n",
+ " for key in item:\n",
+ " if re.search(\"\\sstatement\", key, re.IGNORECASE):\n",
+ " temp_del.append(key)\n",
+ " for idx, item in enumerate(acc_name_dict):\n",
+ " for key in item:\n",
+ " for m in temp_del:\n",
+ " if key == m:\n",
+ " del acc_name_dict[idx]\n",
+ " acc_comp = []\n",
+ " for name_item in acc_name_dict:\n",
+ " for acn_item in acn_dict:\n",
+ " for key, value in name_item.items():\n",
+ " for acn, value_2 in acn_item.items():\n",
+ " y_diff = abs(value[\"y_max\"] - value_2[\"y_max\"])\n",
+ " acc_comp.append({key: {acn: y_diff}})\n",
+ "\n",
+ " ymin_dict = {}\n",
+ " for entry in acc_comp:\n",
+ " for acc_type, account_info in entry.items():\n",
+ " # acn -> account_number\n",
+ " for acn, miny in account_info.items():\n",
+ " if acn in ymin_dict:\n",
+ " curr_min = ymin_dict[acn][\"min_value\"]\n",
+ " if miny < curr_min:\n",
+ " ymin_dict[acn] = {\"account_type\": acc_type, \"min_value\": miny}\n",
+ " else:\n",
+ " ymin_dict[acn] = {\"account_type\": acc_type, \"min_value\": miny}\n",
+ "\n",
+ " # Extract one account name based on min y\n",
+ " result_dict = {acn: data[\"account_type\"] for acn, data in ymin_dict.items()}\n",
+ " acn_ymin = {}\n",
+ " map_acc_type = {}\n",
+ " for ent in doc.entities:\n",
+ " match_ratio = SequenceMatcher(None, ent.type_, \"account_i_number\").ratio()\n",
+ " if match_ratio > 0.8:\n",
+ " acc_num1 = re.sub(\"\\D\", \"\", ent.mention_text.strip(\".#:' \"))\n",
+ " if len(acc_num1) > 5:\n",
+ " nvs = ent.page_anchor.page_refs[0].bounding_poly.normalized_vertices\n",
+ " min_y1 = min(nv.y for nv in nvs)\n",
+ " page = ent.page_anchor.page_refs[0].page\n",
+ " if acc_num1 in acn_ymin.keys():\n",
+ " acn_ymin[acc_num1].append({\"y\": min_y1, \"page\": page})\n",
+ " else:\n",
+ " acn_ymin[acc_num1] = [{\"y\": min_y1, \"page\": page}]\n",
+ " cond1 = ent.mention_text in result_dict.keys()\n",
+ " cond2 = ent.mention_text not in map_acc_type.keys()\n",
+ " if cond1 and cond2:\n",
+ " map_acc_type[ent.mention_text] = ent.type_\n",
+ "\n",
+ " for ent in doc.entities:\n",
+ " cond1 = ent.type_ == \"account_type\"\n",
+ " cond2 = re.search(\"\\sstatement\", ent.mention_text, re.IGNORECASE)\n",
+ " if cond1 and cond2:\n",
+ " doc.entities.remove(ent)\n",
+ " elif cond1:\n",
+ " nvs = ent.page_anchor.page_refs[0].bounding_poly.normalized_vertices\n",
+ " ymin_2 = min(nv.y for nv in nvs)\n",
+ " page = ent.page_anchor.page_refs[0].page\n",
+ " x1 = find_account_number(acn_ymin, page, ymin_2)\n",
+ " try:\n",
+ " data = map_acc_type[x1].split(\"_\")[1]\n",
+ " except KeyError:\n",
+ " continue\n",
+ " else:\n",
+ " ent.type_ = f\"account_{data}_name\"\n",
+ " return doc\n",
+ "\n",
+ "\n",
+ "input_bucket, _ = gcs_utilities.split_gcs_uri(gcs_input_path)\n",
+ "output_bucket, output_files_dir = gcs_utilities.split_gcs_uri(gcs_output_path)\n",
+ "_, file_dict = utilities.file_names(gcs_input_path)\n",
+ "print(f\"Categorizing Bank Statement Transactions by Account Number Process Started...\")\n",
+ "for fn, fp in file_dict.items():\n",
+ " print(f\"\\tFile: {fn}\")\n",
+ " doc = utilities.documentai_json_proto_downloader(input_bucket, fp)\n",
+ " try:\n",
+ " doc = boundary_markers(doc)\n",
+ " except Exception as e:\n",
+ " doc = doc\n",
+ " print(\"Unable to update the account details because of {}\".format(e.args))\n",
+ " try:\n",
+ " doc = fix_account_balance(doc)\n",
+ " except Exception as e:\n",
+ " print(\n",
+ " \"Unable to update the starting and ending balance because of {}\".format(\n",
+ " e.args\n",
+ " )\n",
+ " )\n",
+ " try:\n",
+ " doc = accounttype_change(doc)\n",
+ " except Exception as e:\n",
+ " print(\"Unable to update the account type because of {}\".format(e).args)\n",
+ " str_data = documentai.Document.to_json(\n",
+ " doc,\n",
+ " use_integers_for_enums=False,\n",
+ " including_default_value_fields=False,\n",
+ " preserving_proto_field_name=False,\n",
+ " )\n",
+ " output_file_path = f\"{output_files_dir.rstrip('/')}/{fn}\"\n",
+ " target_path = output_file_path if output_files_dir else fn\n",
+ " utilities.store_document_as_json(str_data, output_bucket, target_path)\n",
+ " print(f\"\\t\\tPost processed data uploaded to gs://{output_bucket}/{target_path}\")\n",
+ "print(f\"Process Completed\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 4. Output Details"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The bank statement parser entities for transactions will be mapped relating to the account \n",
+ "Mapping as below \n",
+ "\n",
+ " \n",
+ " Bank Statement parser output entity type Before post processing | \n",
+ " After post processing | \n",
+ "
\n",
+ " \n",
+ " account_number | \n",
+ " account_0_number account_1_number ..etc | \n",
+ "
\n",
+ " \n",
+ " account_type | \n",
+ " account_0_name account_1_name ..etc | \n",
+ "
\n",
+ " \n",
+ " starting_balance | \n",
+ " account_0_beggining_balance account_1_beggining_balance ..etc | \n",
+ "
\n",
+ " \n",
+ " ending_balance | \n",
+ " account_0_ending_balance account_1_ending_balance ..etc | \n",
+ "
\n",
+ " \n",
+ " table_item/transaction_deposit_date | \n",
+ " account_0_transaction/deposit_date account_1_transaction/deposit_date ..etc | \n",
+ "
\n",
+ " \n",
+ " table_item/transaction_deposit_description | \n",
+ " account_0_transaction/deposit_description account_1_transaction/deposit_description ..etc | \n",
+ "
\n",
+ " \n",
+ " table_item/transaction_deposit | \n",
+ " account_0_transaction/deposit account_1_transaction/deposit ..etc | \n",
+ "
\n",
+ " \n",
+ " table_item/transaction_withdrawal_date | \n",
+ " account_0_transaction/withdrawal_date account_1_transaction/withdrawal_date ..etc | \n",
+ "
\n",
+ " \n",
+ " table_item/transaction_withdrawal_description | \n",
+ " account_0_transaction/withdrawal_description account_1_transaction/withdrawal_description ..etc | \n",
+ "
\n",
+ " \n",
+ " table_item/transaction_withdrawal | \n",
+ " account_0_transaction/withdrawal account_1_transaction/withdrawal ..etc | \n",
+ "
\n",
+ " \n",
+ " table_item | \n",
+ " account_0_trasaction account_1_transaction ..etc | \n",
+ "
\n",
+ "
\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "environment": {
+ "kernel": "conda-root-py",
+ "name": "workbench-notebooks.m113",
+ "type": "gcloud",
+ "uri": "gcr.io/deeplearning-platform-release/workbench-notebooks:m113"
+ },
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel) (Local)",
+ "language": "python",
+ "name": "conda-root-py"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.13"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/incubator-tools/docai_pdf_clustering_analysis_tool/README.md b/incubator-tools/docai_pdf_clustering_analysis_tool/README.md
new file mode 100644
index 000000000..04a165d66
--- /dev/null
+++ b/incubator-tools/docai_pdf_clustering_analysis_tool/README.md
@@ -0,0 +1,22 @@
+# DocAI PDF Clustering Analysis Tool
+
+## Objective
+
+The tool is designed to perform advanced image analysis and clustering on PDF documents.
+It utilizes the VGG16 deep learning model to extract and process image features from PDF pages,
+applies PCA for dimensionality reduction, and employs K-Means clustering to categorize the images into distinct groups.
+The tool aims to facilitate efficient organization and analysis of visual data contained in large sets of PDF files.
+
+## Practical Application
+This tool was created to aid in extracting tables from documents with varied layouts, responding to a
+customer's need to handle hundreds of uniquely formatted documents efficiently. By using clustering
+analysis, it helps in categorizing documents to facilitate easier management and analysis. This enables
+users to better understand their document variations and streamline the extraction process, making it
+highly beneficial for those looking to efficiently manage and analyze a large volume of PDF documents.
+
+
+## Clustering Analysis Output
+
+