Forensic steganalysis for identification of steganography software tools using multiple format image

,

INTRODUCTION Image steganalysis is the art of uncovering the presence of secret in a mundane image. The digital era has provided numerous freeware steganography tools online that help novice users to embed data easily without any prior knowledge of steganographic algorithms [1]- [3]. This makes masked communication as a piece of cake to even an obvious illicit user. A simple analysis on existing steganography tools reveals the fact that most of them use lossless 24-bit image formats. The common among them are BMP, GIF, PNG and TIFF formats. Among them, BMP image format is the most widely used, because it provides a large area of hiding (implying large payload) with less probability of detection (less pixel change rate) in spite of its uncompressed data. The lossy JPEG images are least preferred because these image types are easily distorted (low payload; high pixel change rate) and detection is therefore much simpler. Thus, in general steganalysis process depends on cover image format and payload.
Most of the work carried out in the literature concentrates on the finding the artefacts produced by the steganographic algorithm on the cover image data as a result of embedding. Universal steganographic software tool identification is scarcely reported in current literature. This is because every steganographic software tool has an underlying algorithm and detection of algorithm is sufficient for the detection of cover or stego images. However, a good deal of steganographic software tool uses the least significant bit (LSB) encoding though it is simple for digital image steganography. This makes the steganographic software tool identification as a challenging one, since the method of steganalysis of algorithm found in literature cannot be used here [4]- [15]. A wide range of simple targeted steganographic software tool identification is reported in literature. The pioneer work in the field reported that some tools leave a signature which can be exploited to identify the tool and stego-image [16]. They proved it for a set of steganographic software tools (S-Tools, Syscop, Man-delSteg, etc) which used palette and fractal images. Westfeld and Pfitzmann [17] used the tools like EzStego v2.0b3, Jsteg v4, Steganos v1.5, and S-Tools v4.0 for detection by statistical steganalysis. Provos and Honeyman [18] later developed StegDetect to identify steganographic content and StegBreak to launch dictionary attack on those and retrieve the hidden content in JPEG images. Geetha et al. [19] identified watermarking and steganographic tools using an genetic-X-means classifier. Verma et al. [20] proposed a steganalysis technique on the basis of statistical observations on difference image histograms (DIH) for the reliable detection of classical least significant bit (LSB) steganography which measured the weak correlation between successive bit planes to construct a classifier for discrimination between stego-images and cover images. Sloan and Hernandez-Castro [21] reported the identification of openpuff in video steganalysis. All these were targeted (specific to single tool) and required patient scrutiny of the images manually which was time consuming and error prone.
An important work in this field was by [22] where they designed a fully automated, blind, media-type agnostic approach to steganalysis by bitwise analysis of header data and generated a signature for each tool. Though they reported their work to be media-type agnostic, the results reported were based on seven tools out of which five were of JPEG format, one in MP3 format and one in GIF format tools and a minimum of 10 stego images of each tool were used to generate the signature for the tool. This provided an insight into working out a universal steganalyser for steganographic software tool identification exploiting both artefacts in the image data and its metadata. Almost all current steganalysers require at least a little knowledge of the used steganographic software tools. Feeding the information may be a mammoth task comparing the number of tools available [23]. So a payload independent universal steganalyser is proposed that initially exploits the macroscopically changing fields of the image metadata and information from metadata to identify the tools by clustering. This helps in segregating most of the tools, while those similar to cover are further processed. This is done by forming a signature from stego images of those tools from artefacts present in their image data. The scope of the proposed universal steganalyser is limited to stego tools that work in these five image formats namely BMP, GIF, PNG, TIFF and JPEG.
The structure of this paper is as follows : Section 2. describes the structure of the Universal steganalyser; Section 3. presents the steganalysis using image header; Section 4. extends the steganalysis by comparing the signature generated for the steganographic software tool in image data; Section 5. measures and evaluates the technique experimentally; Section 6. concludes the work, and is followed by appendices and references.

PROPOSED UNIVERSAL STEGANALYSIS
The steganalysis of stego images from different steganographic software tools is done in two phases. In the first phase, the stego images are first distinguished based on one of the five image formats. Then, for each image format, certain fields or information from the fields of the header data are extracted. These features are then subjected to unsupervised clustering by means of extended K-Means. Extended K-Means clustering acts as pattern matching template to identify different steganographic software tools uniquely. Though this initial clustering identifies most of the steganographic software tools, it leaves space for steganalysis of steganographic software tools that take care of not disturbing the metadata while processing. The stego images from these steganographic software tools resemble cover images and are placed in the cluster as that of the cover. The second phase of steganalysis starts by taking these clusters. As a prerequisite for this phase, a signature is generated for each tool from the artefacts in image data. This signature is compared against the signature found in the stego images of the cluster. If a signature match is found, then the tool is identified. The block diagram of the proposed steganalyser is given in Figure 1.

STEGANALYSIS USING IMAGE HEADER ARTEFACTS
A great portion of literature in digital image forensic exploit header data for various purposes [24]. Here it is used for steganalysis of steganographic software tools.

Fields considered in each image format
As mentioned earlier the steganalyser is to exploit the vulnerable fields of image header to identify the tool. The fields that may lead to identification of the tools are detailed for each format [25]- [29].

BMP image format
The BMP images have a fixed byte format. The fields-bits per pixel, image data padding (last two bytes of 4 bytes of SizeofBitmap field) horizontal resolution, Vertical resolution is used since most steganographic software tool modify them. In addition, the actual size of BMP file derived from the fields is used. Thus, these five fields form features that are used for identifying tool in BMP images.

GIF image format
There are two version formats in GIF; 87a and 89a. The trailer field is used to find camouflage steganographic software tools that do not make a single change in image but insert the secret data after the image data. The version field in file header, the packed field of the global colour table in logical screen descriptor and the packed field of the local image descriptor which has the image and colour table data information are also used. Presence of graphic control, comment and plain text extension block, size of global and local colour table are also unique features to identify tool. Thus, these fields form the features to identify tools in GIF format.

JPEG image format
The JPEG format is dependent on the quality factor of the JPEG compression and thus can be used to distinguish not only tools but also algorithms. The fields that are exploited are as follows: JFIF version, density unit field in JFIF header, presence of data after last end of image (EOI) marker, presence of comment marker (COM), quantisation table length and location of Huffman table. These six information from header form the JPEG feature vector.

PNG image format
The PNG file format supports a number of chunks which help in tool identification. The presence of auxiliary chunk types like time-time of last modification, text-extensions and their cyclic redundancy check (CRC) are used to cluster tools. End of file is checked with IEND chunk field. Thus, the feature vector for PNG format is taken from these fields.

TIFF image format
The TIFF is supported by data in two ordering: little endian and big endian, which forms the first feature. Here again presence of additional tags like artist, copyright, hostcomputer, make, model, software or datetime indicate a tool. Presence of New SubFileType or SubFileType Tag can account for significant tool identification. In addition, the fields like Number of tags in image file directory, Number of StripOffset, and information derived from RowsPerStrip, StripOffsets, StripByteCounts and DataType fields to indicate data embedded at End of image are exploited as features for TIFF images.

Clustering algorithm
When the labels of the given data are unknown, unsupervised learning takes place through clustering. One simple form of clustering the given information is K-Means clustering. This clustering requires number of cluster (K) as input. The provision of number of clusters is not possible in a practical scenario. So an extension to the K-Means is made by repeating with the K-Means algorithm with increasing cluster numbers until the distance of each sample to its centroid is zero. Thus, the optimal number of clusters is determined. The pseudo code for the algorithm is given as below: Within cluster distance of cluster is zero means exact match. Thus, the algorithm helps in identifying the exact tool's header signature which is later correlated with the tool.

STEGANALYSIS BY ARTEFACTS IN IMAGE DATA
The stego images of the steganographic software tool that do not modify the header or the metadata of the cover image cannot be detected by the above process. In order to identify those stego images and to ultimately reveal the tool, the artefacts left by the steganographic software tool in image data is considered. The metadata (stego key) about the steganographic process is in some way hidden inside the image data [16].
The fact is that the metadata is either hidden sequentially in the start or at the end of the image file or randomly. Even though the metadata may vary in byte level, it is found that at bit level, things do not change [22]. Things may be either the bit or its position. This is generated as a signature of the tool by examining its stego image data in bit level. The signature is generated from either first 100 pixels or the last 100 pixels depending on the tool and saved in signature library. This signature is compared with the bits of the stego image to be tested. A match implies the tool being used. The characteristic signature of WB Stego tool as an average of last 30 pixels over 50 images is shown in Figure 2. Thus an automated approach for universal steganalysis of software tool is done by tracing the artefact left by the tool in both the image header and its data.

EXPERIMENTAL RESULTS AND DISCUSSION
No benchmark steganographic tools exists for steganalysis. So to create a repository of stego images, steganographic software tools are downloaded from sites referred in [23] using images from sources Bossbase, McGill databases. Table 1 lists the different steganographic software tools used. For cover images, both images from clean source and internet are exploited. McGill Image database [30] which proves a challenging cover source for steganalysis is taken for clean images. Thus, the cover image database consists of 1000 images with random 500 from each source. They are basically either tiff or bmp format images. They are resized to 512 × 512 for simplicity. The cover images are then converted to five image formats namely BMP, TIFF, PNG, GIF and JPEG. For JPEG images, 100% compression ratio is used. A random 100 images from the cover source is chosen for each steganographic software tool to make the stego images for each format. Thus, a total of 6,500 (25×100 BMP, 7×100 GIF, 12×100 JPEG, 13×100 PNG, 6×100 TIFF) stego images are created. The secret data is random data ranging from 1 byte to maximum possible payload by the tool. 100 random cover images (CO) for each format is also taken (though a single cover image is enough). The experiment is carried out on the set up database. The results of first phase of steganalysis are shown in Figure 3. In this phase, it is noted that the steganographic software tool (DE,V,OS) that hide data at the end of image file are all clustered in separate group and are identified regardless of formats. Software specific to format (GS, IP, WB, BS) are largely difficult to identify, since care is taken by tool to leave no trace in header data. In steganographic software tool that support more than one format, at least one format is insecure (HR, HI, IS, ST, JH, OP, SL). Only one system (hs) that supports multiple format is not detectable in any of the formats. Also, it is verified that of all formats, identification of tool is very difficult in BMP because of its simple and short header, other formats have large information in header which in turn leads to loopholes or vulnerability. For the second phase, those images that resemble cover (stego images clustered along with cover) are fed as the input to the steganalyser and the results are tabulated as in Table 2.

CONCLUSION
In almost all the formats, the identification of tool by its stego image independent of payload is a major contribution of the proposed steganalyser over the statistical steganalyser which finds detection of stego images with payload less than 5% of the maximum capacity as an arduous task. The other facts that can be concluded from the experimentation are almost all tools leave a trace in either the header or the data of the stego image. BMP format has the least vulnerable header of all image formats and size of secret payload is irrelevant because it is not related to the image statistics but to the tool signature. Thus, this universal blind structural steganalyser is capable of identifying tools which leave their trace in the stego images irrespective of the size of secret payload. Conversely, this means that this method will not operate at all against implementations of algorithms that do not produce characteristic irregularities in their header (simple LSB batchwise processing) or store their metadata in the image BS. However, at large, a match against a stego signature can provide a useful indication that a particular tool may have been used and consequently an indication that the file may contain steganography. Also, this steganalyser produces a 100% match for a tool irrespective of payload which cannot be the case for other statistical steganalyser. So such universal structural steganalysers can effectively be deployed as the pre mechanisms to the existing steganalysis techniques and help to improve overall accuracy.