By Date: <-- -->
By Thread: <-- -->

Some questions for converting HTML to PDF using HTMLWorker



I am developing a program to convert HTML source to PDF.
 
I searched mailing list and I found that HTMLWorker and HTMLParser class.
 
HTMLParser may not support CJK string(I tested HTMLParser but all CJK strings became blanks.) and I decided to use HTMLWorker.
 
I made the code as followings; (I used  iTextSharp 3.1.5)

===============================================================================
   Private Sub Test_HTMLWorker()
        Dim fs As New FileStream("test.html", FileMode.Open, FileAccess.Read, FileShare.ReadWrite )
        Dim sr As New StreamReader(fs, System.Text.Encoding.Default)
        Dim sReader As New StringReader(sr.ReadToEnd)
        sr.Close()
        fs.Close()

        Dim document As Document = New Document(A4, 20, 20, 20, 20)

        PdfWriter.GetInstance(document, New FileStream("test_output.pdf", FileMode.Create))

        FontFactory.Register("c:\\windows\\fonts\\gulim.ttc")

        Dim st As StyleSheet = New StyleSheet
        st.LoadTagStyle("body", "face", "Gulim")
        st.LoadTagStyle("body", "encoding", "Identity-H")
        st.LoadTagStyle("body", "leading", "12,0")

        document.Open()

        Dim worker As html.simpleparser.HTMLWorker = New html.simpleparser.HTMLWorker(document)

        Dim p As ArrayList = worker.ParseToList(sReader, st)

        For k As Integer = 0 To p.Count - 1
            document.Add(p.Item(k))
            document.Add(New Paragraph(vbCrLf))

        Next

        document.Close()

        sReader.Close()

    End Sub
=================================================================================

This code works fine at the HTML sources that are composed of only texts.

But, it does not work at the HTML sources with img tags; in detail, the layout of generated PDF files are different from original HTML sources.

Also, if I does not use width and height attributes at img tag, that images do not inserted at the generated PDF file.

I think that this problem results from HTMLWorker may not consider the space of image - especially the img tag within <p> tag.


Then, I tried to insert the space that was equal to the height of image but the position of image was not updated (I succeeded in finding the chunk objects with image).


I attached sample HTML file and generated PDF files for your test.

If you could take a few minutes to answer my questions, I would really appreciate it.

Best regards,

S. H. Park

Test HTML sample


Test HTML sample


 

Test HTML sample


1

2

3

4

asdf

sdf

sdf

sdf

dfdf

dfdf

dfdf

 

 

 

 

 

 

Google

 

 

Attachment: test_output.pdf
Description: Adobe PDF document

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
iText-questions mailing list
iText-questions (at) lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions