Utility to extract text from .rtf files?

Started by Ned L, July 24, 2018, 12:54:04 PM

Previous topic - Next topic

Ned L

I have a client who is receiving files in .rtf format.  I need to extract the plain text from the file (without any formatting information.)  I found a VERY old (20+ years?) specification of .rtf syntax (Rev 1.5) and tried to use its basic formatting rules to filter out formatting and graphic information, but apparently newer features don't conform to the original syntax.

Does anyone have experience pulling text from .rtf files?  Just a reasonably recent specification of the .rtf format would probably be enough.  Or, if a utility already exists to do what I want, that would be great, too.  My client is running PxPlus under Linux.

- Ned Lee
QA Solutions Division
AMS Software

Devon Austen

Hi Ned,

Here is a link to the current RTF specification (https://www.microsoft.com/en-ca/download/details.aspx?id=10725) at least according to Wikipedia (https://en.wikipedia.org/wiki/Rich_Text_Format).

Another approach you could use is to use the COM interface and open the RTF file via MS Word and then read back the text. If you are using PxPlus 2017 or higher this is made even easier because you can use the new *obj/word object (https://manual.pvxplus.com/?PxPlus%20User%20Guide/External%20Components/PxPlus%20COM%20Support/Word%20Object.htm).
Principal Software Engineer for PVX Plus Technologies LTD.

Ken Sproul

The cvs() function has an option to strip the RTF formatting and return just the plain text: plain_text$=cvs(rtf_text$,"rtf:native")
Ken Sproul
DPI Information Service, Inc.
Pivotal Systems LLC

Devon Austen

The CVS RTF conversion was done for converting simple RTF from PxPlus RTF Multilines into text. It is not guaranteed to work with any RTF generated elsewhere.

From my testing I was able to get text with some extra newlines from a conversion of a RTF file created by MS Wordpad but I was not able to get any text from conversion of RTF generated by MS Word 2010.

I think the easiest solution is to let Microsoft's own tools handle the conversion and to let MS Word do the work.
Principal Software Engineer for PVX Plus Technologies LTD.