mardi 4 août 2015

Parse line of text after multiline regex pattern

I am attempting to parse fields from a pdf file converted to txt via pdfbox. Here is an example of a field I need to extract, "BUYER NAME AND ADDRESS:". These documents often contain translations, and the ":" colon appears a variable number of characters after BUYER NAME AND ADDRESS. Example below.

Txt file..
BUYER NAME AND ADDRESS / NOMBRE Y
DIRECCIÓN DEL COMPRADOR:
Name of buyer here
Txt continues..

Here is my attempted pattern / scanning code.

Scanner sc = new Scanner(txtFile);
Pattern p = Pattern.compile("BUYER NAME AND ADDRESS:.*", Pattern.MULTILINE);
sc.findWithinHorizon(p, 0);
String buyer = sc.nextLine();
buyer = sc.nextLine();
System.out.println("Buyer Name: "+buyer);

This works when the text file is english only e.g. BUYER NAME AND ADDRESS: but if there are additional characters or line returns, it fails. How can I fix the pattern?

Aucun commentaire:

Enregistrer un commentaire