Dejan Lukan is a security researcher for InfoSec Institute and penetration tester from Slovenia. He is very interested in finding new bugs in real world software products with source code analysis, fuzzing and reverse engineering. He also has a great passion for developing his own simple scripts for security related problems and learning about new hacking techniques. He knows a great deal about programming languages, as he can write in couple of dozen of them. His passion is also Antivirus bypassing techniques, malware research and operating systems, mainly Linux, Windows and BSD. He also has his own blog available here:.
Free Training Tools. Editors Choice.
Related Boot Camps. More Posts by Author. 7 responses to “PDF File Format: Basic Structure”.
Architecture In order to get the most out of PDFBox it is neccessary to understand how a PDF document is organized as PDFBox was architected around the concepts layed out in the ISO-32000 (PDF) Specification. Quick Introduction to the PDF format A PDF file is made up of a sequence of bytes. These bytes, grouped into tokens, make up the basic objects upon which higher level objects and structures are built see ISO-32000 7.3. PDFBox makes these basic objects available in the.org.apache.pdfbox.cos.
Tom Kaneko Design & Architecture: Sketch, Design / Build in Practice. SketchUp is hands-down the most intuitive, not to mention powerful, easy-to-learn 3D drawing tool on the planet. If you want to be productive within a couple of hours, you’ve come to the right place. Architecture, Engineering & Construction Collection includes Revit + AutoCAD + Navisworks + more. TRIAL FILE SIZE (estimated maximum) 16 GB RECOMMENDED 10 Mbps Internet connection. Turn off all active applications, including virus checking software.
package (The COS Model). The organization of these objects, how to they are read and how to write them is defined in the file structure of the PDF see ISO-32000 7.5. In addition a file can be encrpyted to protect the document’s content see ISO-32000 7.5. PDFBox handles the reading in the.org.apache.pdfbox.pdfparser.
package. Writing of PDF files is handled in the.org.apache.pdfbox.pdfwriter.
package. Within the file structure basic objects are used to create a document structure building higher level objects such as pages, bookmarks, annotations see ISO-32000 7.7. PDFBox makes these higher level objects available through the.org.apache.pdfbox.pdfmodel.
package (The PD Model). In addition there is a COS representation available for the PD model if there is a need to inspect the underlying structure or to handle special cases where the higher level PD model doesn’t provide the functionality needed.
It's always the COS model which is represented in the PDF file. The COS Model As outlined above the basic PDF objects are represented in PDFBox in the org.apache.pdfbox.cos package. PDF Type Description Example PDFBox class ISO 32000 Boolean Standard True/False values true org.apache.pdfbox.cos.COSBoolean 7.3.2 Number Integer and floating point numbers 1 2.3 org.apache.pdfbox.cos.COSInteger org.apache.pdfbox.cos.COSFloat 7.3.3 String A sequence of characters (This is a string) org.apache.pdfbox.cos.COSString 7.3.4 Name A predefined value in a PDF document, typically used as a key in a dictionary /Type org.apache.pdfbox.cos.COSName 7.3.5 Array Arrays are one-dimensional lists of objects accessed by a numeric index. Within an array each basic object is permitted as an entry. 549 3.14 false (Ralph) /SomeName org.apache.pdfbox.cos.COSArray 7.3.6 Dictionary A map of name value pairs « /Type /XObject /Name (Name)/Size 1» org.apache.pdfbox.cos.COSDictionary 7.3.7 Stream A stream of data, typically compressed.
This is used for page contents, images and embedded font streams. 12 0 obj « /Type /XObject » stream 04040404 endstream org.apache.pdfbox.cos.COSStream 7.3.8 Object A wrapper to any of the other objects, this can be used to reference an object multiple times. An object is referenced by using two numbers, an object number and a generation number. Initially the generation number will be zero unless the object got replaced later in the stream.
12 0 obj « /Type /XObject » endobj org.apache.pdfbox.cos.COSObject A page in a PDF document is represented with a COSDictionary. The entries that are available for a page can be seen in the PDF Reference and an example of a page looks like this. COSDictionary page =.; COSArray mediaBox = ( COSArray ) page. GetDictionaryObject ( 'MediaBox' ); System. Println ( 'Width:' + mediaBox.
Get ( 3 ) ); As can be seen from that little example the COS model provides a low level API to access information within the PDF. In order to use the COS model successfully a good knowledge of the PDF specification is needed. The PD Model The COS Model allows access to all aspects of a PDF document. This type of programming is tedious and error prone though because the user must know all of the names of the parameters and no helper methods are available. The PD Model was created to help alleviate this problem. Each type of object(page, font, image) has a set of defined attributes that can be available in the dictionary. A PD Model class is available for each of these so that strongly typed methods are available to access the attributes.
The same code from above to get the page width can be rewritten to use PD Model classes. PDPage page =.; PDRectangle mediaBox = page. GetMediaBox ; System. Println ( 'Width:' + mediaBox. GetWidth ); PD Model objects sit on top of COS model. Typically, the classes in the PD Model will only store a COS object and all setter/getter methods will modify data that is stored in the COS object. For example, when you call PDPage.getLastModified the method will do a lookup in the COSDictionary with the key “LastModified”, if it is found the value is then converter to a java.util.Calendar.
When PDPage.setLastModified( Calendar ) is called then the Calendar is converted to a string in the COSDictionary.