• 0

[C#] Reading text from MS Word files


Question

hey guys. how can i read the text from an MS word, and possibly other Ms Office files while using the least possible resources. If someone could share some code they already have for this, it wud be really sweet :p

basically what im trying to do is build a desktop searching application like MSNs googles and the others in C# for a class project. but mine doesnt have to be as complex or as feature rich. just a basic version of what they do.

any other tips wud also be appreciated.

thanks

danish

ps: im storing the data im indexing in an MS Access file. seems inefficient to me. any better way to do that?

Link to comment
https://www.neowin.net/forum/topic/316480-c-reading-text-from-ms-word-files/
Share on other sites

Recommended Posts

  • 0

I have reproduced the application error with a minimum amount of code. I get the same error no matter if I release the COM object or not. I don't get the error if I use the IFilter for office documents, only adobe...:

using System;
using System.Text;
using System.Runtime.InteropServices;

namespace TestError
{


	/// <summary>
	/// Summary description for Class1.
	/// </summary>
	class Class1
	{
  /// <summary>
  /// The main entry point for the application.
  /// </summary>
  [STAThread]
  static void Main(string[] args)
  {
  	IFilter f = (IFilter)new CFilter();
  	Marshal.ReleaseComObject(f);
  	f = null;
  	Console.WriteLine("finished");
  	
  	Console.ReadLine();
  }
	}

	[ComImport]

	[Guid("89BCB740-6119-101A-BCB7-00DD010655AF")]

	[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]

	public interface IFilter

	{

  void Init([MarshalAs(UnmanagedType.U4)] IFILTER_INIT grfFlags, 

  	uint cAttributes,

  	[MarshalAs(UnmanagedType.LPArray, SizeParamIndex=1)] FULLPROPSPEC[] aAttributes,

  	ref uint pdwFlags);



  void GetChunk([MarshalAs(UnmanagedType.Struct)] out STAT_CHUNK pStat);



  [PreserveSig] int GetText(ref uint pcwcBuffer, [MarshalAs(UnmanagedType.LPWStr)] StringBuilder buffer);

        

  void GetValue(ref UIntPtr ppPropValue);



  void BindRegion([MarshalAs(UnmanagedType.Struct)]FILTERREGION origPos, ref Guid riid, ref UIntPtr ppunk);

	}



	[ComImport]
	[Guid("4C904448-74A9-11d0-AF6E-00C04FD8DC02")]
	public class CFilter

	{

	}

	[Flags]

	public enum IFILTER_INIT

	{

  NONE                   = 0,

  CANON_PARAGRAPHS       = 1,

  HARD_LINE_BREAKS       = 2,

  CANON_HYPHENS          = 4,

  CANON_SPACES           = 8,

  APPLY_INDEX_ATTRIBUTES = 16,

  APPLY_CRAWL_ATTRIBUTES = 256,

  APPLY_OTHER_ATTRIBUTES = 32,

  INDEXING_ONLY          = 64,

  SEARCH_LINKS           = 128,        

  FILTER_OWNED_VALUE_OK  = 512

	}


	[StructLayout(LayoutKind.Sequential)]

	public struct STAT_CHUNK

	{

  public uint  idChunk;

  [MarshalAs(UnmanagedType.U4)]     public CHUNK_BREAKTYPE breakType;

  [MarshalAs(UnmanagedType.U4)]     public CHUNKSTATE flags;

  public uint locale;

  [MarshalAs(UnmanagedType.Struct)] public FULLPROPSPEC attribute;

  public uint idChunkSource;

  public uint cwcStartSource;

  public uint cwcLenSource;

	}

    

	[StructLayout(LayoutKind.Sequential)]

	public struct FILTERREGION

	{

  public uint idChunk;

  public uint cwcStart;

  public uint cwcExtent;

	}

    

	public enum CHUNKSTATE
	{

  CHUNK_TEXT               = 0x1,

  CHUNK_VALUE              = 0x2,

  CHUNK_FILTER_OWNED_VALUE = 0x4

	}

	[StructLayout(LayoutKind.Sequential)]
	public struct FULLPROPSPEC
	{

  public Guid guidPropSet;

  public PROPSPEC psProperty;

	}

	public enum CHUNK_BREAKTYPE
	{

  CHUNK_NO_BREAK = 0,

  CHUNK_EOW      = 1,

  CHUNK_EOS      = 2,

  CHUNK_EOP      = 3,

  CHUNK_EOC      = 4

	}

	[StructLayout(LayoutKind.Sequential)]
	public struct PROPSPEC

	{

  public uint ulKind;

  public uint propid;

  public IntPtr lpwstr;

	}


}

  • 0

Have you resolved this issue yet? I get the same error.

This works too, but in all the approaches I have tried so far, I always get an application error, but only with pdf files:

(ReadFile.exe is the name of my assembly)

Font Capture: ReadFile.exe - Application Error

The instruction at "0x030a61b3" referenced memory at "0x03a823e8". The memory could not be "read"

This always happens when my program closes - it works perefctly fine until I exit Main()...

I wonder if this has something to do with the Adobe IFilter not being released properly?

586137106[/snapback]

  • 0
No I haven't found a solution for the problem yet. And since I have no experience with COM programming I probably won't :)

586248556[/snapback]

I wasn't able to fix the error, but I prevented the error from displaying by using:

SetErrorMode(SEM_NOGPFAULTERRORBOX);

place it in your main thread.

  • 0
Though it works great for .doc files, it does not work for .docx(default MS Office Word 2007 format) files. Any suggestions?

docx files are compressed xml files. If you change the .docx extension to .zip then WinRar and WinZip, etc. can open the "document" and you can browse and extract the xml files. Indexing them is as simple as extracting and then using xpath :)

  • 0

Microsoft recently released the file specifications for all the Microsoft Office file formats (.doc, .xls, etc). If you want to use minimal resources, your best bet is study the .doc file format and write code to parse it yourself. Not fun at all, but it would be the only way to do this without using a library or the Word Object Model.

Check out the fun bedtime reading.

Edit: didn't realize what an old thread this was. Oops.

Edited by boogerjones
This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Posts

    • Vivaldi version 8.0.4033.50 released June 17: https://vivaldi.com/blog/desktop/minor-update-eight-8-0/
    • The Online part hasn't even been announced and probably won't be included on day one. This is a massive singleplayer game.
    • While I agree with all that, it just proves there's an a** built for every seat.
    • Lol are you mad because I'm not using AI? I'd rather pay people than lose a bunch of potential customers and get humilated because I used AI. A lot of people won't purchase a game if it used AI during development.
    • LibreWolf 152.0-1 by Razvan Serea LibreWolf is an independent “fork” of Firefox, with the primary goals of privacy security and user freedom. It is the community run successor to LibreFox. LibreWolf is designed to increase protection against tracking and fingerprinting techniques, while also including a few security improvements. This is achieved through our privacy and security oriented settings and patches. LibreWolf also aims to remove all the telemetry, data collection and annoyances, as well as disabling anti-freedom features like DRM. LibreWolf features: Latest Firefox — LibreWolf is compiled directly from the latest build of Firefox Stable. You will have the the latest features, and security updates. Independent Build — LibreWolf uses a build independent of Firefox and has its own settings, profile folder and installation path. As a result, it can be installed alongside Firefox or any other browser. No phoning home — Embedded server links and other calling home functions are removed. In other words, minimal background connections by default. User settings updates Extensions firewall: limit internet access for extensions. Multi-platform (Windows/Linux/Mac/and soon Android) Community-Driven Dark theme (classic and advanced) LibreWolf privacy features: Delete cookies and website data on close. Include only privacy respecting search engines like DuckDuckGo and Searx. Include uBlockOrigin with custom default filter lists, and Tracking Protection in strict mode, to block trackers and ads. Strip tracking elements from URLs, both natively and through uBO. Enable dFPI, also known as Total Cookie Protection. Enable RFP which is part of the Tor Uplift project. RFP is considered the best in class anti-fingerprinting solution, and its goal is to make users look the same and cover as many metrics as possible, in an effort to block fingerprinting techniques. Always display user language as en-US to websites, in order to protect the language used in the browser and in the OS. Disable WebGL, as it is a strong fingerprinting vector. Prevent access to the location services of the OS, and use Mozilla's location API instead of Google's API. Limit ICE candidates generation to a single interface when sharing video or audio during a videoconference. Force DNS and WebRTC inside the proxy, when one is being used. Trim cross-origin referrers, so that they don't include the full URI. Disable link prefetching and speculative connections. Disable disk cache and clear temporary files on close. Disable form autofill. Disable search and form history...and more. LibreWolf 152.0-1 changelog: Upstream release, see the Firefox 152.0 Release Notes Notable changes: The AppImages are now built on Codeberg along with the other releases We have decided to wait a bit longer to enable the settings redesign, due to use being aware of multiple upstream issues Download: LibreWolf 64-bit | Portable 64-bit | ~100.0 MB (Open Source) Download: ARM64 | Portable ARM64 Links: LibreWolf Home Page | Addons | Screenshot | Reddit Get alerted to all of our Software updates on Twitter at @NeowinSoftware
  • Recent Achievements

    • Week One Done
      Huge Trailer earned a badge
      Week One Done
    • Week One Done
      Classifyskilleducation earned a badge
      Week One Done
    • One Month Later
      eurospharma62 earned a badge
      One Month Later
    • Week One Done
      With What earned a badge
      Week One Done
    • Week One Done
      Harris Gilbert earned a badge
      Week One Done
  • Popular Contributors

    1. 1
      +primortal
      560
    2. 2
      +Edouard
      169
    3. 3
      PsYcHoKiLLa
      73
    4. 4
      Michael Scrip
      64
    5. 5
      ATLien_0
      64
  • Tell a friend

    Love Neowin? Tell a friend!