Technique For Embedding And Extracting A Watermark In A Text Document

Abstract: Method, apparatus and computer readable code for embedding and extracting a watermark in a text document using digital watermarking processes is disclosed. When the text document is watermarked, the following steps are performed. The pages of the text document are transformed into corresponding images. Then, the margins on each of the images are detected and cropped to generate the cropped images. The cropped images are segmented into blocks among which some blocks are selected based on content of each block. The watermark is embedded in the selected blocks using a digital watermarking process. When the watermark from the watermarked text document is extracted, the watermark-embedding process is referred to determine the block information, for selecting each block of a watermarked text document, from where the watermark needs to be extracted. REF FIG: 1

Patent Information

Application #

Filing Date

23 September 2013

Publication Number

13/2015

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

INFOSYS LIMITED

IP CELL, PLOT NO.44, ELECTRONIC CITY, HOSUR ROAD, BANGALORE - 560 100

Inventors

1. SACHIN MEHTA

S/O SH. NARESH MEHTA, WARD NO.-1, KAISTHWARI ROAD, NAGROTA BAGWAN, DISTT - 176 047

2. DR. RAJARATHNAM NALLUSAMY

C/O N. MANIYASEKARAN, NORTH STREET, CHITRAPPATTY, THURAIYUR TALUK, TRICHY DIST 621 010

Specification

TECHNIQUE FOR EMBEDDING AND EXTRACTING A WATERMARK IN A TEXT DOCUMENT FIELD OF THE INVENTION [0001] The invention relates to the field of watermarking technology and more particularly to a technique for embedding and extracting a watermark in a text document. BACKGROUND [0002] The advancement in technology especially innovations related to information dissemination and connectivity has led to the development of portable and web enabled devices. However, these advancements have increased the Intellectual Property Rights (IPR) violations. To distribute the digital document securely and protect the text document from IPR violations, watermarking of text documents is gaining interest. Watermarking has emerged as an eminent solution for the protection of digital media (text documents, videos, audio, and images). However, watermarking in text documents is very different than other digital media since text documents lack rich gray scale or color texture information which is abundantly available in digital images and videos. [0003] Generally, text watermarking methods used are Character Feature method, Open Space method, Zero Watermarking method, Content Watermarking method, Syntax Watermarking method, and the like. In Character Feature method, features of characters such as shape, size, or position are manipulated. In Open space method the watermark is embedded by modulating either the inter-line distance or inter-word space or inter-character space. In Zero Watermarking method, instead of embedding a watermark inside the text document, watermark is generated using the features of the text document. In Content Watermarking method, words are replaced by their synonyms or sentences are transformed via suppression or inclusion of noun phrases. In Syntax Watermarking method, marking is achieved by changing the structure of the sentences. There are other watermarking methods wherein the watermark is embedded visually as an image. Majority of these methods carry very less amount of information which limits their applicability to document authentication, copyright protection, and tamper proofing. Additionally, some of these methods utilize the specific characteristics of a particular language which makes their application into other language documents very difficult. Thirdly, syntax and semantic methods are based on substitution. Sometimes substitution may change the meaning of the sentence. Hence, every watermarked document needs to be manually inspected. This is a tedious process and makes the method practically infeasible. [0004] Watermarking of text documents is a less matured area in comparison to digital images and videos. Significant amount of work has been done for digital images and videos ranging from copyright protection to traitor tracing. Since text documents lacks rich gray scale or color texture information, watermarking in text documents is very different than other digital media. [0005] Though techniques might exist to cater the problem of watermarking the text document, the existing techniques do not leverages application of digital image and video watermarking methods in a text documents. [0006] Therefore, there is a general need to implement a technique which utilizes any digital image or video watermarking method to watermark text documents. Several aspects of the present disclosure disclose a method and a system for embedding and extracting a watermark in a text document. SUMMARY [0007] Accordingly, the present disclosure is directed to a system, a computer program product and a method for embedding a watermark in a text document, comprising receiving a watermark and the text document containing one or more pages and transforming the pages of the text document into corresponding images. The margins on each image are detected and cropped to generate cropped image. The cropped image is segmented into plurality of blocks. One or more blocks are selected from the plurality of blocks using selection protocols and the watermark is embedded in each of the selected block. The watermark embedded blocks are superimposed onto the corresponding blocks of one or more images and these images are converted into pages of the text document with watermark embedded. [0008] In one embodiment, the margins are detected by applying the discrete differentiation operator over the images and computing a distance of a first white pixel from the sides of the images. Further, the cropped images are generated by cropping the one or more images from the sides based on the computed distance of the first white pixel from the sides of the images. In another embodiment, one or more blocks are selected from the plurality of blocks by applying a discrete cosine transform (DCT) on each of the plurality of blocks to compute a DCT co-efficient of the block and classifying the plurality of blocks into texture blocks or non-texture blocks using the DCT co-efficient. The blocks are classified by comparing the DCT co-efficient of each of the one or more block with content thresholds. In yet another embodiment, the watermarking process is either an image or a video watermarking process. [0009] Further, the present disclosure is directed to a system, a computer program product and a method for extracting a watermark from a watermarked text document, comprising receiving the watermarked text document comprising of one or more pages and block information, wherein the block information comprises of segmentation process details such as block size, an original cropped image size, and content thresholds. The pages are converted into the corresponding images. The margins on each image are detected and cropped to generate cropped image. The cropped images are resized based on the received original cropped image size. Then, the cropped image is segmented into plurality of blocks based on the segmentation process details. One or more blocks are selected from the plurality of blocks based on the content thresholds and the watermark is extracted from the selected blocks. In one embodiment, resizing of the one or more cropped images is based on an interpolation process. However, other resizing methods can be used. [0010] Further, in another embodiment, the margins are detected by applying the discrete differentiation operator over the images and computing a distance of a first white pixel from the sides of the images. Further, the cropped images are generated by cropping the one or more images from the sides based on the computed distance of the first white pixel from the sides of the image. In another embodiment, one or more blocks are selected from the plurality of blocks by applying a Discrete Cosine Transform (DCT) on each of the plurality of blocks to compute a DCT co-efficient of the block and classifying the plurality of blocks into texture blocks or non-texture blocks using the DCT co-efficient. The blocks are classified by comparing the DCT co-efficient of each of the one or more blocks with content thresholds. In yet another embodiment, the watermarking process is either an image or a video watermarking process. BRIEF DESCRIPTION OF THE DRAWINGS [0011] FIG. 1 illustrates a process flow for embedding a watermark in a text document. [0012] FIG. 2 illustrates an embodiment depicting the manner of generating a cropped image. [0013] FIG. 3 illustrates exemplary blocks of the cropped page image. [0014] FIG. 4 illustrates a process flow for extracting a watermark from a watermarked text document. [0015] FIG. 5 illustrates a system blocks depicting the manner for embedding a watermark in a text document. [0016] FIG. 6 illustrates a system blocks depicting the manner for extracting a watermark from a watermarked text document. [0017] FIG. 7 shows an exemplary computing device useful for performing processes disclosed herein. DETAILED DESCRIPTION [0018] The following description is the full and informative description of the best method and system presently contemplated for carrying out the present invention which is known to the inventors at the time of filing the patent application. Of course, many modifications and adaptations will be apparent to those skilled in the relevant arts in view of the following description in view of the accompanying drawings. While the invention described herein is provided with a certain degree of specificity, the present technique may be implemented with either greater or lesser specificity, depending on the needs of the user. Further, some of the features of the present technique may be used to get an advantage without the corresponding use of other features described in the following paragraphs. As such, the present description should be considered as merely illustrative of the principles of the present technique and not in limitation thereof. [0019] FIG. 1 illustrates a process flow for embedding a watermark in a text document. As used herein, a "text document" refers to any structured or unstructured document which comprises of text or graphics or the combination thereof. The text document could be in any file format such as word, PDF, Excel, PPT, CHM, TXT and the like. The text document comprises of one or more pages. At step 110, the text document on which watermark need to be embedded is received by a watermark embedding computing device. The text document could be selected by a user. Further, the watermark embedding computing device receives the watermark which needs to be embedded on the pages of the document. For the purpose of illustration, let us say that text document has P pages of dimension N X M and watermark of dimension n X m. The watermark can be either static (for applications such as copyright protection) or dynamic (for applications such as traitor tracing). Dynamic watermark is generated on-the-fly. [0020] At step 120, the pages of the text document are transformed into image format such as TIFF, GIF, JPEG, and the like. For the purpose of illustration, the text document is converted into images I such that each page represents one image and dimension of each page is H X M. As appreciated by a person skilled in the art, the conversion of the text document into image format can be performed using any known technique or tool. [0021] Typically, pages of text document contain margins. As used herein "margins" refers to blank space at the top, bottom, and sides of the page that frames the body of written, typed, or printed matter (which include text or graphics or the combination thereof). At step 130, the margins of each page of the text document are detected. As shown in FIG 2, document 220a shows margins dv d^, dr, and db for left, top, right, and bottom side of the page respectively. At step 140, the detected margins are cropped to generate cropped image C of each page of the text document. The method of cropping the margins from the image is explained in detail in FIG 2 . [0022] At step 150, the cropped image C is segmented into different blocks. Let's say the cropped image C is divided into b blocks of dimension^ X b2. At step 160, one or more blocks are selected from b blocks. For selecting the blocks, a discrete cosine transform (DCT) is applied on each of the blocks to compute a DCT co-efficient of each block. Let us represent the transformed block as hdci. Then, at step 160, the blocks are classified into texture blocks or non- texture blocks based on the value of DCT co-efficient of each block as explained below. [0023] A block is considered as a non-texture block if: bda(0,0) > Ti, wherein bdct represent the transformed block after applying DCT on a block. [0024] Texture block can be either completely text or completely graphics or partial text or partial graphics and partial text. Texture blocks are classified as: [0025] T15 T2, T3, T4, and T5 are the content thresholds used to classify the blocks. The DCT co-efficient of each block is compared with content thresholds. Different types of block based on the content are illustrated and explained with respect to FIG 3. btxt herein represents the blocks that are classified as texture blocks. The content thresholds may be decided by a user (fixed) or could be automatically calculated by a system (adaptive). [0026] At step 170, the watermark is embedded in the selected texture blocks. The watermark can be embedded using any image or video watermarking algorithm in blocks which are classified as texture blocks. The reason of embedding the watermark in texture block is due to imperceptibility of the watermark. Embedding the watermark in non-texture blocks has chances of being either perceptible or lost. For instance, completely white block, a non-texture block, has pixels having value 255. If we add watermark to such non-texture block using an image or video watermarking algorithm, the value of the pixels in that block will increase i.e. > 255. Since pixels in a block can have a value between 0 and 255, the value will be truncated to 255 leading to automatic removal of watermark. The watermarked texture blocks are superimposed onto the corresponding blocks of the image to get watermarked image and then, watermarked image is converted back into the text document to get watermarked text document. [0027] FIG. 2 illustrates an embodiment depicting the manner of generating a cropped image. The cropped page image 230 is generated by detecting the margins on the page image and then cropping the margins. For the purpose of illustration, let us say the page image 210 is of dimension N x M. To detect the margins on the page image 210, a discrete differentiation operator such as SOBEL or SCHARR and the like is applied. The differentiator operator finds the high intensity variations in the text image such as text area (including images, equations, etc.) The output of the discrete differentiation operator is image 220 in FIG 2. Now from each sides of the image 220, the first white pixel is identified to determine the margins on the image 220. Let us represent the distance of first white pixel from the top, bottom, left, and right as dt, db, dr, and dr, respectively as shown in 220a (which shows the expansion of 220). After the margin distances are determined, the text area from page image / 210 is cropped to generate cropped page image C 230. [0028] FIG. 3 illustrates exemplary blocks of the cropped page image. The cropped page image is segmented into b blocks of dimension bx X b2. Among these b blocks, one or more blocks are selected using selection protocols for embedding the watermark on the selected blocks. For selecting the blocks, a discrete cosine transform (DCT) is applied on each of the blocks to compute a DCT co-efficient of each block. Let us represent the transformed block as b^t. Now, bdEt value is compared with different content thresholds to classify if the block is a texture block or a non-texture block. 310 in FIG 3 indicate a block with minimal text. This block may be classified as a non-texture block. 320 represent a complete texture block. 330 shows a block with partial text and partial empty space. 340 represent two blocks wherein the blocks contain partial text and partial graphics. The classification of blocks 310, 320, 330, and 340 as texture block or a non-texture block is dependent on different content thresholds. [0029] A block is considered a non-texture block if: &Arf(0^0) > Tv wherein &det represent the transformed block after applying DCT on a block. [0030] Texture block can be either completely text or completely graphics or partial text or partial graphics and partial text. Texture blocks are classified as: Parfal text, if T2 < bdctj(0,0) < Tt r00311 b =- Compfeta-t^tifT3

Documents

Application Documents

#	Name	Date
1	4299-CHE-2013 FORM-3 23-09-2013.pdf	2013-09-23
1	4299-CHE-2013-AbandonedLetter.pdf	2020-02-11
2	4299-CHE-2013-FER.pdf	2019-08-09
2	4299-CHE-2013 FORM-2 23-09-2013.pdf	2013-09-23
3	4299-CHE-2013 FORM-18 17-11-2014.pdf	2014-11-17
3	4299-CHE-2013 FORM-1 23-09-2013.pdf	2013-09-23
4	4299-CHE-2013 FORM-1 30-07-2014.pdf	2014-07-30
4	4299-CHE-2013 DRAWINGS 23-09-2013.pdf	2013-09-23
5	4299-CHE-2013 DESCRIPTION (COMPLETE) 23-09-2013.pdf	2013-09-23
5	4299-CHE-2013 CORRESPONDENC OTHERS 30-07-2014.pdf	2014-07-30
6	abstract4299-CHE-2013.jpg	2014-07-08
6	4299-CHE-2013 CLAIMS 23-09-2013.pdf	2013-09-23
7	4299-CHE-2013 ABSTRACT 23-09-2013.pdf	2013-09-23
8	abstract4299-CHE-2013.jpg	2014-07-08
8	4299-CHE-2013 CLAIMS 23-09-2013.pdf	2013-09-23
9	4299-CHE-2013 DESCRIPTION (COMPLETE) 23-09-2013.pdf	2013-09-23
9	4299-CHE-2013 CORRESPONDENC OTHERS 30-07-2014.pdf	2014-07-30
10	4299-CHE-2013 FORM-1 30-07-2014.pdf	2014-07-30
10	4299-CHE-2013 DRAWINGS 23-09-2013.pdf	2013-09-23
11	4299-CHE-2013 FORM-1 23-09-2013.pdf	2013-09-23
11	4299-CHE-2013 FORM-18 17-11-2014.pdf	2014-11-17
12	4299-CHE-2013-FER.pdf	2019-08-09
12	4299-CHE-2013 FORM-2 23-09-2013.pdf	2013-09-23
13	4299-CHE-2013-AbandonedLetter.pdf	2020-02-11
13	4299-CHE-2013 FORM-3 23-09-2013.pdf	2013-09-23

Search Strategy

1	search_08-08-2019.pdf