• 0

[C#] Reading text from MS Word files


Question

hey guys. how can i read the text from an MS word, and possibly other Ms Office files while using the least possible resources. If someone could share some code they already have for this, it wud be really sweet :p

basically what im trying to do is build a desktop searching application like MSNs googles and the others in C# for a class project. but mine doesnt have to be as complex or as feature rich. just a basic version of what they do.

any other tips wud also be appreciated.

thanks

danish

ps: im storing the data im indexing in an MS Access file. seems inefficient to me. any better way to do that?

Link to comment
https://www.neowin.net/forum/topic/316480-c-reading-text-from-ms-word-files/
Share on other sites

Recommended Posts

  • 0

I have reproduced the application error with a minimum amount of code. I get the same error no matter if I release the COM object or not. I don't get the error if I use the IFilter for office documents, only adobe...:

using System;
using System.Text;
using System.Runtime.InteropServices;

namespace TestError
{


	/// <summary>
	/// Summary description for Class1.
	/// </summary>
	class Class1
	{
  /// <summary>
  /// The main entry point for the application.
  /// </summary>
  [STAThread]
  static void Main(string[] args)
  {
  	IFilter f = (IFilter)new CFilter();
  	Marshal.ReleaseComObject(f);
  	f = null;
  	Console.WriteLine("finished");
  	
  	Console.ReadLine();
  }
	}

	[ComImport]

	[Guid("89BCB740-6119-101A-BCB7-00DD010655AF")]

	[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]

	public interface IFilter

	{

  void Init([MarshalAs(UnmanagedType.U4)] IFILTER_INIT grfFlags, 

  	uint cAttributes,

  	[MarshalAs(UnmanagedType.LPArray, SizeParamIndex=1)] FULLPROPSPEC[] aAttributes,

  	ref uint pdwFlags);



  void GetChunk([MarshalAs(UnmanagedType.Struct)] out STAT_CHUNK pStat);



  [PreserveSig] int GetText(ref uint pcwcBuffer, [MarshalAs(UnmanagedType.LPWStr)] StringBuilder buffer);

        

  void GetValue(ref UIntPtr ppPropValue);



  void BindRegion([MarshalAs(UnmanagedType.Struct)]FILTERREGION origPos, ref Guid riid, ref UIntPtr ppunk);

	}



	[ComImport]
	[Guid("4C904448-74A9-11d0-AF6E-00C04FD8DC02")]
	public class CFilter

	{

	}

	[Flags]

	public enum IFILTER_INIT

	{

  NONE                   = 0,

  CANON_PARAGRAPHS       = 1,

  HARD_LINE_BREAKS       = 2,

  CANON_HYPHENS          = 4,

  CANON_SPACES           = 8,

  APPLY_INDEX_ATTRIBUTES = 16,

  APPLY_CRAWL_ATTRIBUTES = 256,

  APPLY_OTHER_ATTRIBUTES = 32,

  INDEXING_ONLY          = 64,

  SEARCH_LINKS           = 128,        

  FILTER_OWNED_VALUE_OK  = 512

	}


	[StructLayout(LayoutKind.Sequential)]

	public struct STAT_CHUNK

	{

  public uint  idChunk;

  [MarshalAs(UnmanagedType.U4)]     public CHUNK_BREAKTYPE breakType;

  [MarshalAs(UnmanagedType.U4)]     public CHUNKSTATE flags;

  public uint locale;

  [MarshalAs(UnmanagedType.Struct)] public FULLPROPSPEC attribute;

  public uint idChunkSource;

  public uint cwcStartSource;

  public uint cwcLenSource;

	}

    

	[StructLayout(LayoutKind.Sequential)]

	public struct FILTERREGION

	{

  public uint idChunk;

  public uint cwcStart;

  public uint cwcExtent;

	}

    

	public enum CHUNKSTATE
	{

  CHUNK_TEXT               = 0x1,

  CHUNK_VALUE              = 0x2,

  CHUNK_FILTER_OWNED_VALUE = 0x4

	}

	[StructLayout(LayoutKind.Sequential)]
	public struct FULLPROPSPEC
	{

  public Guid guidPropSet;

  public PROPSPEC psProperty;

	}

	public enum CHUNK_BREAKTYPE
	{

  CHUNK_NO_BREAK = 0,

  CHUNK_EOW      = 1,

  CHUNK_EOS      = 2,

  CHUNK_EOP      = 3,

  CHUNK_EOC      = 4

	}

	[StructLayout(LayoutKind.Sequential)]
	public struct PROPSPEC

	{

  public uint ulKind;

  public uint propid;

  public IntPtr lpwstr;

	}


}

  • 0

Have you resolved this issue yet? I get the same error.

This works too, but in all the approaches I have tried so far, I always get an application error, but only with pdf files:

(ReadFile.exe is the name of my assembly)

Font Capture: ReadFile.exe - Application Error

The instruction at "0x030a61b3" referenced memory at "0x03a823e8". The memory could not be "read"

This always happens when my program closes - it works perefctly fine until I exit Main()...

I wonder if this has something to do with the Adobe IFilter not being released properly?

586137106[/snapback]

  • 0
No I haven't found a solution for the problem yet. And since I have no experience with COM programming I probably won't :)

586248556[/snapback]

I wasn't able to fix the error, but I prevented the error from displaying by using:

SetErrorMode(SEM_NOGPFAULTERRORBOX);

place it in your main thread.

  • 0
Though it works great for .doc files, it does not work for .docx(default MS Office Word 2007 format) files. Any suggestions?

docx files are compressed xml files. If you change the .docx extension to .zip then WinRar and WinZip, etc. can open the "document" and you can browse and extract the xml files. Indexing them is as simple as extracting and then using xpath :)

  • 0

Microsoft recently released the file specifications for all the Microsoft Office file formats (.doc, .xls, etc). If you want to use minimal resources, your best bet is study the .doc file format and write code to parse it yourself. Not fun at all, but it would be the only way to do this without using a library or the Word Object Model.

Check out the fun bedtime reading.

Edit: didn't realize what an old thread this was. Oops.

Edited by boogerjones
This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Posts

    • >defenders of AI-generated artworks often claim that AI is just a tool It is not. It is the inhuman artist replacement. The human writing the prompt is the employer/manager requesting the work product of the artist -- a supervisory/descriptive job that doesn't carry with it any rights to the copyright of that work product at all. And since AI is not human itself, it can't gain copyright for anything it is asked to regurgitate or hallucinate, so it can't transfer that copyright to the employer/manager/human who asked for the output. This was all legally reaffirmed last year. So, no, while there are AI tools, AI slopware generation is NOT a "tool" in the legal definition of that word.
    • As long as i get to play GTA 6 before it ends 😂😂
    • Google is opening the world's first AI museum in Los Angeles by Ivan Jenic Image via: Google Ever since AI image generators went mainstream, the debate over whether AI-generated art is real art hasn't let up. Those who don’t consider AI to be art say that if a machine does the creating and anyone can prompt it, there’s no skill involved, and therefore no art is produced. The counter-argument is equally persistent, as defenders of AI-generated artworks often claim that AI is just a tool, and that every major technological breakthrough, like the camera or the computer, was met with the same skepticism before eventually being accepted as a legitimate creative medium. Google’s position in this debate is clear. Which is no surprise, as the company is investing billions in AI infrastructure. And now, in efforts to encourage people to use its AI even more, Google is opening Dataland on June 20, which it's calling the world's first AI arts museum. Located inside The Grand LA, a Frank Gehry-designed building in Los Angeles, the museum spans 25,000 square feet. The museum is built around a collaboration with media artist Refik Anadol, who has worked with Google since 2016. The inaugural exhibition is called Machine Dreams: Rainforest, and is powered by an AI model trained on “an extensive dataset of the natural world.” It generates 1.2 billion pixels of visuals in real time and reacts to visitors dynamically. The space also generates soundscapes, real-time emotion sensing, and algorithmically produced scents. Image via: Refik Anadol Studio / Google Google says that the museum is powered by its Gemini models, which run on Google Cloud. So, everything is generated inside one of Google’s AI data centers and is streamed to the museum. Alongside the museum opening, Google Arts & Culture is funding an AI Artist Residency, giving four artists $25,000 grants each, along with mentorship from Refik Anadol Studio and access to Google's machine learning tools. Their work will be shown at Dataland and on the Google Arts & Culture website later this year. Google’s AI museum will undoubtedly initiate a fired-up debate on social media, and we can’t wait to see the first reactions. Via: Smithsonian Magazine
    • Calling GTA 6 overhyped crap doesn’t make you edgy, it just makes you sound like someone who hasn’t enjoyed anything since the PS2 era.
    • I’m not arguing whether Rockstar likes money. Obviously, they do, they’re a business. I’m saying this isn’t new. They’ve always launched console first. This is just how Rockstar operates.
  • Recent Achievements

    • First Post
      AndreaB earned a badge
      First Post
    • Week One Done
      Huge Trailer earned a badge
      Week One Done
    • Week One Done
      Classifyskilleducation earned a badge
      Week One Done
    • One Month Later
      eurospharma62 earned a badge
      One Month Later
    • Week One Done
      With What earned a badge
      Week One Done
  • Popular Contributors

    1. 1
      +primortal
      571
    2. 2
      +Edouard
      178
    3. 3
      PsYcHoKiLLa
      74
    4. 4
      Michael Scrip
      68
    5. 5
      neufuse
      64
  • Tell a friend

    Love Neowin? Tell a friend!