單頁PDF含有一張表
com.aistrong.analysis.pdf.service
Class ReaderTextService
Method Detail:
public ArrayList<List<WordWithTextPositions>> readWordWithTextPositions(String path)
Arguments:
path - pdf文件存儲路徑
Returns:
ArrayList<List<WordWithTextPositions>>
每個WordWithTextPositions對象中存儲了1行(參看注意)中所有字符,其中每個字符對應一個TextPosition對象,每個TextPosition存儲了該字符所有相關信息,包含字符、坐標等,詳細介紹參看pdfBox API文檔Class TextPosition
Instance:
package com;
import java.io.IOException;
import java.util.List;
import org.apache.pdfbox.text.PDFLocalStripper.WordWithTextPositions;
import org.apache.pdfbox.text.TextPosition;
import com.aistrong.analysis.pdf.service.ReaderTextService;
public class TestReadWordWithTextPositions {
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
ReaderTextService rts = new ReaderTextService();
for(List<WordWithTextPositions> l : rts.readWordWithTextPositions("/Users/hhhtide/Desktop/PDF_Extract/Data/table1.pdf")) {
for(WordWithTextPositions wwtp : l) {
for(TextPosition tp : wwtp.getTextPositions()) {
System.out.println("word:" + tp.getUnicode().toString() + " X:" + tp.getX());
}
}
}
}
}
輸出:
word:表 X:48.0
word:1 X:64.5
word:2 X:75.5
word:0 X:81.0
word:0 X:86.5
word:9 X:92.0
word:~ X:97.5
word:2 X:108.5
word:0 X:114.0
word:1 X:119.5
word:1 X:125.0
word:年 X:136.0
word:安 X:147.0
word:慶 X:158.0
word:市 X:169.0
.
.
.
注意:
一行