• 0

[C#] Reading text from MS Word files


Question

hey guys. how can i read the text from an MS word, and possibly other Ms Office files while using the least possible resources. If someone could share some code they already have for this, it wud be really sweet :p

basically what im trying to do is build a desktop searching application like MSNs googles and the others in C# for a class project. but mine doesnt have to be as complex or as feature rich. just a basic version of what they do.

any other tips wud also be appreciated.

thanks

danish

ps: im storing the data im indexing in an MS Access file. seems inefficient to me. any better way to do that?

Link to comment
https://www.neowin.net/forum/topic/316480-c-reading-text-from-ms-word-files/
Share on other sites

Recommended Posts

  • 0

I have reproduced the application error with a minimum amount of code. I get the same error no matter if I release the COM object or not. I don't get the error if I use the IFilter for office documents, only adobe...:

using System;
using System.Text;
using System.Runtime.InteropServices;

namespace TestError
{


	/// <summary>
	/// Summary description for Class1.
	/// </summary>
	class Class1
	{
  /// <summary>
  /// The main entry point for the application.
  /// </summary>
  [STAThread]
  static void Main(string[] args)
  {
  	IFilter f = (IFilter)new CFilter();
  	Marshal.ReleaseComObject(f);
  	f = null;
  	Console.WriteLine("finished");
  	
  	Console.ReadLine();
  }
	}

	[ComImport]

	[Guid("89BCB740-6119-101A-BCB7-00DD010655AF")]

	[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]

	public interface IFilter

	{

  void Init([MarshalAs(UnmanagedType.U4)] IFILTER_INIT grfFlags, 

  	uint cAttributes,

  	[MarshalAs(UnmanagedType.LPArray, SizeParamIndex=1)] FULLPROPSPEC[] aAttributes,

  	ref uint pdwFlags);



  void GetChunk([MarshalAs(UnmanagedType.Struct)] out STAT_CHUNK pStat);



  [PreserveSig] int GetText(ref uint pcwcBuffer, [MarshalAs(UnmanagedType.LPWStr)] StringBuilder buffer);

        

  void GetValue(ref UIntPtr ppPropValue);



  void BindRegion([MarshalAs(UnmanagedType.Struct)]FILTERREGION origPos, ref Guid riid, ref UIntPtr ppunk);

	}



	[ComImport]
	[Guid("4C904448-74A9-11d0-AF6E-00C04FD8DC02")]
	public class CFilter

	{

	}

	[Flags]

	public enum IFILTER_INIT

	{

  NONE                   = 0,

  CANON_PARAGRAPHS       = 1,

  HARD_LINE_BREAKS       = 2,

  CANON_HYPHENS          = 4,

  CANON_SPACES           = 8,

  APPLY_INDEX_ATTRIBUTES = 16,

  APPLY_CRAWL_ATTRIBUTES = 256,

  APPLY_OTHER_ATTRIBUTES = 32,

  INDEXING_ONLY          = 64,

  SEARCH_LINKS           = 128,        

  FILTER_OWNED_VALUE_OK  = 512

	}


	[StructLayout(LayoutKind.Sequential)]

	public struct STAT_CHUNK

	{

  public uint  idChunk;

  [MarshalAs(UnmanagedType.U4)]     public CHUNK_BREAKTYPE breakType;

  [MarshalAs(UnmanagedType.U4)]     public CHUNKSTATE flags;

  public uint locale;

  [MarshalAs(UnmanagedType.Struct)] public FULLPROPSPEC attribute;

  public uint idChunkSource;

  public uint cwcStartSource;

  public uint cwcLenSource;

	}

    

	[StructLayout(LayoutKind.Sequential)]

	public struct FILTERREGION

	{

  public uint idChunk;

  public uint cwcStart;

  public uint cwcExtent;

	}

    

	public enum CHUNKSTATE
	{

  CHUNK_TEXT               = 0x1,

  CHUNK_VALUE              = 0x2,

  CHUNK_FILTER_OWNED_VALUE = 0x4

	}

	[StructLayout(LayoutKind.Sequential)]
	public struct FULLPROPSPEC
	{

  public Guid guidPropSet;

  public PROPSPEC psProperty;

	}

	public enum CHUNK_BREAKTYPE
	{

  CHUNK_NO_BREAK = 0,

  CHUNK_EOW      = 1,

  CHUNK_EOS      = 2,

  CHUNK_EOP      = 3,

  CHUNK_EOC      = 4

	}

	[StructLayout(LayoutKind.Sequential)]
	public struct PROPSPEC

	{

  public uint ulKind;

  public uint propid;

  public IntPtr lpwstr;

	}


}

  • 0

Have you resolved this issue yet? I get the same error.

This works too, but in all the approaches I have tried so far, I always get an application error, but only with pdf files:

(ReadFile.exe is the name of my assembly)

Font Capture: ReadFile.exe - Application Error

The instruction at "0x030a61b3" referenced memory at "0x03a823e8". The memory could not be "read"

This always happens when my program closes - it works perefctly fine until I exit Main()...

I wonder if this has something to do with the Adobe IFilter not being released properly?

586137106[/snapback]

  • 0
No I haven't found a solution for the problem yet. And since I have no experience with COM programming I probably won't :)

586248556[/snapback]

I wasn't able to fix the error, but I prevented the error from displaying by using:

SetErrorMode(SEM_NOGPFAULTERRORBOX);

place it in your main thread.

  • 0
Though it works great for .doc files, it does not work for .docx(default MS Office Word 2007 format) files. Any suggestions?

docx files are compressed xml files. If you change the .docx extension to .zip then WinRar and WinZip, etc. can open the "document" and you can browse and extract the xml files. Indexing them is as simple as extracting and then using xpath :)

  • 0

Microsoft recently released the file specifications for all the Microsoft Office file formats (.doc, .xls, etc). If you want to use minimal resources, your best bet is study the .doc file format and write code to parse it yourself. Not fun at all, but it would be the only way to do this without using a library or the Word Object Model.

Check out the fun bedtime reading.

Edit: didn't realize what an old thread this was. Oops.

Edited by boogerjones
This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Posts

    • Maybe it's just my old-school soul talking, but I’ve always felt that games aren't 'real' games until they hit the PC. Leaving the PC community out at launch just doesn't sit right with me. That being said, I'm probably going to buy the PS5 just for the fun of trying it out.
    • The Vibe Coding Playbook: Building Your Tech Business with AI —was $35, now FREE by Steven Parker Claim your complimentary copy (worth $35) of "The Vibe Coding Playbook: Building Your Tech Business with AI" for free, before the offer ends on June 23. Description A detailed and up-to-date walkthrough for entrepreneurs with limited (or non-existent) coding skills who want to build profitable software companies using new gen-AI tools. In The Vibe Coding Playbook: Building Your Tech Business With AI, renowned AI and data science educator Siraj Raval walks you through exactly what you need to do to build a technology business with generative AI-powered code assistants. Raval offers step-by-step guidance for non-technical professionals and entrepreneurs interested in creating scalable, profitable enterprises without spending years learning how to code. This book conceives of new artificial intelligence tools, like Cursor, as “co-founders,” lighting your way to constructing valuable software products and services. You’ll learn to build minimally viable products (MVPs), iterate on your software products as you develop and after launch, and grow your company while maintaining a lean, efficient, solopreneur-focused structure. Inside the book: Detailed guidance for entrepreneurs interested in creating powerful tech solutions for niche problems and markets without hiring expensive software developers Strategies for using generative AI tools to substitute for traditional technical co-founders Illustrative case studies from real-world founders who built successful technology businesses without learning to code Useful tools for non-technical entrepreneurs, including prompt libraries, decision trees, QR codes linking to video tutorials demonstrating key techniques, and access to an exclusive online community of like-minded founders Perfect for ambitious professionals and entrepreneurs who want to build a successful technology company now – using commercially available AI tools – The Vibe Coding Playbook is your personal roadmap to creating useful and profitable software for customers without learning how to code. How to download for free Please ensure you read the terms and conditions to claim this offer. Complete and verifiable information is required in order to receive this free offer. If you have previously made use of these offers, you will not need to re-register. Was $35, but is now FREE | Below free offer link expires on June 23. The Vibe Coding Playbook: Building Your Tech Business with AI The below offers are also available for free in exchange for your (work) email: The Vibe Coding Playbook: Building Your Tech Business with AI ($35 Value) FREE - Expires 6/23 The Persuasion Engine: How Any Business Can Use AI-Powered Neuromarketing to Understand and Win Customers ($28 Value) FREE - Expires 6/24 How to Do More with Less: Future-Proofing Yourself in an AI-driven Economy ($28 Value) FREE - Expires 6/30 Cloud Security Fundamentals: Building the Foundations for Secure Cloud Platforms ($131.95 Value) FREE - Expires 7/1 The Complete Free AI Learning: Master ChatGPT, Claude, Gemini & More ($21 Value) FREE How to Build an AI Design Workflow with Gamma ($21 Value) FREE The Ultimate Linux Newbie Guide – Featured Free content Python Notes for Professionals – Featured Free content Learn Linux in 5 Days – Featured Free content Quick Reference Guide for Cybersecurity – Featured Free content We post these because we earn commission on each lead so as not to rely solely on advertising, which many of our readers block. It all helps toward paying staff reporters, servers and hosting costs. Other ways to support Neowin The above deal not doing it for you, but still want to help? Check out the links below. Check out our partner software in the Neowin Store Buy a T-shirt at Neowin's Threadsquad Subscribe to Neowin - for $14 a year, or $28 a year for an ad-free experience Disclosure: An account at Neowin Deals is required to participate in any deals powered by our affiliate, StackCommerce. For a full description of StackCommerce's privacy guidelines, go here. Neowin benefits from shared revenue of each sale made through the branded deals site.
    • Rockstar confirms Grand Theft Auto VI pre-orders begin next week, unveils cover art by Pulasthi Ariyasinghe The release date of Grand Theft Auto VI has moved quite a lot since its original announcement in 2023, but it finally looks like the game has found its final launch slot. Rockstar today had a new video upload on its YouTube channel, and while it wasn't a new trailer for the game, the company revealed two things. This was the pre-order kickoff date for Grand Theft Auto VI as well as the game's official cover art. The company revealed that June 25 is when fans of the series will be able to pre-order their copy of Grand Theft Auto VI. Pre-orders will be available both digitally and in retail stores. The newly unveiled cover art shows off the two new protagonists, as well as a few more characters that are probably vital to the campaign storyline. Shots of vehicles players can use like a light helicopter, motorcycle, sports car, and speed boat are also seen here, alongside a shot of a crocodile. "Jason and Lucia have always known the deck is stacked against them," says Rockstar describing the campaign's protagonist duo. "But when an easy score goes wrong, they find themselves on the darkest side of the sunniest place in America, in the middle of a conspiracy stretching across the state of Leonida — forced to rely on each other more than ever if they want to make it out alive." Grand Theft Auto VI is coming to Xbox Series X|S and PlayStation 5 on November 19, 2026. A PC version has not been confirmed yet, though it's expected by many to land after the console release. When asked about this, the Take-Two CEO says it considers the core audience for the Grand Theft Auto franchise to be on consoles.
  • Recent Achievements

    • Week One Done
      Huge Trailer earned a badge
      Week One Done
    • Week One Done
      Classifyskilleducation earned a badge
      Week One Done
    • One Month Later
      eurospharma62 earned a badge
      One Month Later
    • Week One Done
      With What earned a badge
      Week One Done
    • Week One Done
      Harris Gilbert earned a badge
      Week One Done
  • Popular Contributors

    1. 1
      +primortal
      553
    2. 2
      +Edouard
      168
    3. 3
      PsYcHoKiLLa
      72
    4. 4
      Michael Scrip
      64
    5. 5
      ATLien_0
      64
  • Tell a friend

    Love Neowin? Tell a friend!