• 0

[C#] Reading text from MS Word files


Question

hey guys. how can i read the text from an MS word, and possibly other Ms Office files while using the least possible resources. If someone could share some code they already have for this, it wud be really sweet :p

basically what im trying to do is build a desktop searching application like MSNs googles and the others in C# for a class project. but mine doesnt have to be as complex or as feature rich. just a basic version of what they do.

any other tips wud also be appreciated.

thanks

danish

ps: im storing the data im indexing in an MS Access file. seems inefficient to me. any better way to do that?

Link to comment
https://www.neowin.net/forum/topic/316480-c-reading-text-from-ms-word-files/
Share on other sites

Recommended Posts

  • 0

I have reproduced the application error with a minimum amount of code. I get the same error no matter if I release the COM object or not. I don't get the error if I use the IFilter for office documents, only adobe...:

using System;
using System.Text;
using System.Runtime.InteropServices;

namespace TestError
{


	/// <summary>
	/// Summary description for Class1.
	/// </summary>
	class Class1
	{
  /// <summary>
  /// The main entry point for the application.
  /// </summary>
  [STAThread]
  static void Main(string[] args)
  {
  	IFilter f = (IFilter)new CFilter();
  	Marshal.ReleaseComObject(f);
  	f = null;
  	Console.WriteLine("finished");
  	
  	Console.ReadLine();
  }
	}

	[ComImport]

	[Guid("89BCB740-6119-101A-BCB7-00DD010655AF")]

	[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]

	public interface IFilter

	{

  void Init([MarshalAs(UnmanagedType.U4)] IFILTER_INIT grfFlags, 

  	uint cAttributes,

  	[MarshalAs(UnmanagedType.LPArray, SizeParamIndex=1)] FULLPROPSPEC[] aAttributes,

  	ref uint pdwFlags);



  void GetChunk([MarshalAs(UnmanagedType.Struct)] out STAT_CHUNK pStat);



  [PreserveSig] int GetText(ref uint pcwcBuffer, [MarshalAs(UnmanagedType.LPWStr)] StringBuilder buffer);

        

  void GetValue(ref UIntPtr ppPropValue);



  void BindRegion([MarshalAs(UnmanagedType.Struct)]FILTERREGION origPos, ref Guid riid, ref UIntPtr ppunk);

	}



	[ComImport]
	[Guid("4C904448-74A9-11d0-AF6E-00C04FD8DC02")]
	public class CFilter

	{

	}

	[Flags]

	public enum IFILTER_INIT

	{

  NONE                   = 0,

  CANON_PARAGRAPHS       = 1,

  HARD_LINE_BREAKS       = 2,

  CANON_HYPHENS          = 4,

  CANON_SPACES           = 8,

  APPLY_INDEX_ATTRIBUTES = 16,

  APPLY_CRAWL_ATTRIBUTES = 256,

  APPLY_OTHER_ATTRIBUTES = 32,

  INDEXING_ONLY          = 64,

  SEARCH_LINKS           = 128,        

  FILTER_OWNED_VALUE_OK  = 512

	}


	[StructLayout(LayoutKind.Sequential)]

	public struct STAT_CHUNK

	{

  public uint  idChunk;

  [MarshalAs(UnmanagedType.U4)]     public CHUNK_BREAKTYPE breakType;

  [MarshalAs(UnmanagedType.U4)]     public CHUNKSTATE flags;

  public uint locale;

  [MarshalAs(UnmanagedType.Struct)] public FULLPROPSPEC attribute;

  public uint idChunkSource;

  public uint cwcStartSource;

  public uint cwcLenSource;

	}

    

	[StructLayout(LayoutKind.Sequential)]

	public struct FILTERREGION

	{

  public uint idChunk;

  public uint cwcStart;

  public uint cwcExtent;

	}

    

	public enum CHUNKSTATE
	{

  CHUNK_TEXT               = 0x1,

  CHUNK_VALUE              = 0x2,

  CHUNK_FILTER_OWNED_VALUE = 0x4

	}

	[StructLayout(LayoutKind.Sequential)]
	public struct FULLPROPSPEC
	{

  public Guid guidPropSet;

  public PROPSPEC psProperty;

	}

	public enum CHUNK_BREAKTYPE
	{

  CHUNK_NO_BREAK = 0,

  CHUNK_EOW      = 1,

  CHUNK_EOS      = 2,

  CHUNK_EOP      = 3,

  CHUNK_EOC      = 4

	}

	[StructLayout(LayoutKind.Sequential)]
	public struct PROPSPEC

	{

  public uint ulKind;

  public uint propid;

  public IntPtr lpwstr;

	}


}

  • 0

Have you resolved this issue yet? I get the same error.

This works too, but in all the approaches I have tried so far, I always get an application error, but only with pdf files:

(ReadFile.exe is the name of my assembly)

Font Capture: ReadFile.exe - Application Error

The instruction at "0x030a61b3" referenced memory at "0x03a823e8". The memory could not be "read"

This always happens when my program closes - it works perefctly fine until I exit Main()...

I wonder if this has something to do with the Adobe IFilter not being released properly?

586137106[/snapback]

  • 0
No I haven't found a solution for the problem yet. And since I have no experience with COM programming I probably won't :)

586248556[/snapback]

I wasn't able to fix the error, but I prevented the error from displaying by using:

SetErrorMode(SEM_NOGPFAULTERRORBOX);

place it in your main thread.

  • 0
Though it works great for .doc files, it does not work for .docx(default MS Office Word 2007 format) files. Any suggestions?

docx files are compressed xml files. If you change the .docx extension to .zip then WinRar and WinZip, etc. can open the "document" and you can browse and extract the xml files. Indexing them is as simple as extracting and then using xpath :)

  • 0

Microsoft recently released the file specifications for all the Microsoft Office file formats (.doc, .xls, etc). If you want to use minimal resources, your best bet is study the .doc file format and write code to parse it yourself. Not fun at all, but it would be the only way to do this without using a library or the Word Object Model.

Check out the fun bedtime reading.

Edit: didn't realize what an old thread this was. Oops.

Edited by boogerjones
This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Posts

    • Anybody that thinks flying cars were possible are idiots. Everyone would basically need a pilot liscence, can you imagine how insane and dangerous that would be.
    • Microsoft Edge 149.0.4022.80 by Razvan Serea Microsoft Edge is a super fast and secure web browser from Microsoft. It works on almost any device, including PCs, iPhones and Androids. It keeps you safe online, protects your privacy, and lets you browse the web quickly. You can even use it on all your devices and keep your browsing history and favorites synced up. Built on the same technology as Chrome, Microsoft Edge has additional built-in features like Startup boost and Sleeping tabs, which boost your browsing experience with world class performance and speed that are optimized to work best with Windows. Microsoft Edge security and privacy features such as Microsoft Defender SmartScreen, Password Monitor, InPrivate search, and Kids Mode help keep you and your loved ones protected and secure online. Microsoft Edge has features to keep both you and your family protected. Enable content filters and access activity reports with your Microsoft Family Safety account and experience a kid-friendly web with Kids Mode. The new Microsoft Edge is now compatible with your favorite extensions, so it’s easy to personalize your browsing experience. Microsoft Edge 149.0.4022.80 changelog: Fixes Fixed an issue that prevented QR code generation from working. Feature updates Intune MAM Protected Downloads. The protected downloads feature for Intune MAM will now save downloaded files to the Documents > Microsoft Edge > Downloads folder in OneDrive. Extensions monitoring in the Edge management service. The Microsoft Edge management service now allows admins to gain visibility into extensions installed across their managed users. From the extensions monitoring page, admins can see which extensions have been installed as well as manage user requests for blocked extensions. For more information, see Microsoft Edge Extensions Monitoring. Validate Edge builds early with enterprise preview. Enterprise preview provides a simpler way for admins to flight pre-release Edge builds to their users. To reduce friction and bolster usage, users will receive pre-release builds directly inside of their Stable Edge application. Admins can allow users to easily opt-out of the preview experience, using built-in rollback to switch between their pre-release and stable channels with ease. Microsoft 365 admin center users can configure the feature, view their flighting population, and receive personalized recommendations all in one place. For more information, see Get started with Enterprise Preview in Microsoft Edge. Download: Microsoft Edge (64-bit) | 193.0 MB (Freeware) Download: Microsoft Edge (32-bit) | 170.0 MB Download: Microsoft Edge (ARM64) | 188.0 MB View: Microsoft Edge Website | Release History Get alerted to all of our Software updates on Twitter at @NeowinSoftware
    • The machines are starting to fight back any way they can.
    • No news articles about the Arch Linux repo being majorly infected with malware?!?
    • Waymo recalls self-driving software after cars enter closed freeway work zones by Paul Hill Waymo, the self-driving car maker owned by Alphabet – the parent company of Google –, has recalled some of its fifth-generation Automated Driving Systems (ADS). It did so after some of its cars drove through closed construction zones. According to the National Highway Traffic Safety Administration (NHTSA), the affected vehicles were capable of driving through a closed freeway construction zone and continuing to drive at speed. The listing on the NHTSA website says that Waymo is currently developing a solution to fix this issue, but in the meantime, freeway driving is being restricted. Waymo will update its ADS software so that vehicles can detect when they can avoid entering construction zones. According to the Safety Recall Report, on April 20, 2026, Waymo’s Field Safety Committee began meetings reviewing an event from April 11, 2026, and five events from April 19, 2026, where Waymo’s autonomous vehicles didn’t recognize and drove past ramp closure signs into the pre-planned freeway construction zones. This took place in Phoenix, Arizona. Separately, on May 18, 2026, seven Waymo vehicles entered freeway lanes with active construction in the San Francisco Bay Area by driving between cones that were placed to show the lane was closed. On the back of both of these events, Waymo restricted freeway driving until it could address the issue. In June, Waymo’s Safety Board reviewed the issue and additional information related to ADS performances around construction zones; then, as a result, it decided to conduct a recall. This development is not good for Waymo as it adds to a growing list of technical hiccups its cars have experienced. Ultimately, it will lead to more scrutiny from lawmakers around the world who will be more cautious about letting autonomous vehicles on their roads without tighter regulation. For readers in areas where Waymo operates, does this news make you more wary about stepping into one of these vehicles?
  • Recent Achievements

    • Week One Done
      Eurosoft10 earned a badge
      Week One Done
    • One Month Later
      Eurosoft10 earned a badge
      One Month Later
    • One Year In
      Skeet Campbell earned a badge
      One Year In
    • One Month Later
      Sharbel earned a badge
      One Month Later
    • First Post
      BizSAR earned a badge
      First Post
  • Popular Contributors

    1. 1
      +primortal
      599
    2. 2
      +Edouard
      190
    3. 3
      PsYcHoKiLLa
      79
    4. 4
      Michael Scrip
      77
    5. 5
      Steven P.
      69
  • Tell a friend

    Love Neowin? Tell a friend!