• 0

[C#] Reading text from MS Word files


Question

hey guys. how can i read the text from an MS word, and possibly other Ms Office files while using the least possible resources. If someone could share some code they already have for this, it wud be really sweet :p

basically what im trying to do is build a desktop searching application like MSNs googles and the others in C# for a class project. but mine doesnt have to be as complex or as feature rich. just a basic version of what they do.

any other tips wud also be appreciated.

thanks

danish

ps: im storing the data im indexing in an MS Access file. seems inefficient to me. any better way to do that?

Link to comment
https://www.neowin.net/forum/topic/316480-c-reading-text-from-ms-word-files/
Share on other sites

Recommended Posts

  • 0

I have reproduced the application error with a minimum amount of code. I get the same error no matter if I release the COM object or not. I don't get the error if I use the IFilter for office documents, only adobe...:

using System;
using System.Text;
using System.Runtime.InteropServices;

namespace TestError
{


	/// <summary>
	/// Summary description for Class1.
	/// </summary>
	class Class1
	{
  /// <summary>
  /// The main entry point for the application.
  /// </summary>
  [STAThread]
  static void Main(string[] args)
  {
  	IFilter f = (IFilter)new CFilter();
  	Marshal.ReleaseComObject(f);
  	f = null;
  	Console.WriteLine("finished");
  	
  	Console.ReadLine();
  }
	}

	[ComImport]

	[Guid("89BCB740-6119-101A-BCB7-00DD010655AF")]

	[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]

	public interface IFilter

	{

  void Init([MarshalAs(UnmanagedType.U4)] IFILTER_INIT grfFlags, 

  	uint cAttributes,

  	[MarshalAs(UnmanagedType.LPArray, SizeParamIndex=1)] FULLPROPSPEC[] aAttributes,

  	ref uint pdwFlags);



  void GetChunk([MarshalAs(UnmanagedType.Struct)] out STAT_CHUNK pStat);



  [PreserveSig] int GetText(ref uint pcwcBuffer, [MarshalAs(UnmanagedType.LPWStr)] StringBuilder buffer);

        

  void GetValue(ref UIntPtr ppPropValue);



  void BindRegion([MarshalAs(UnmanagedType.Struct)]FILTERREGION origPos, ref Guid riid, ref UIntPtr ppunk);

	}



	[ComImport]
	[Guid("4C904448-74A9-11d0-AF6E-00C04FD8DC02")]
	public class CFilter

	{

	}

	[Flags]

	public enum IFILTER_INIT

	{

  NONE                   = 0,

  CANON_PARAGRAPHS       = 1,

  HARD_LINE_BREAKS       = 2,

  CANON_HYPHENS          = 4,

  CANON_SPACES           = 8,

  APPLY_INDEX_ATTRIBUTES = 16,

  APPLY_CRAWL_ATTRIBUTES = 256,

  APPLY_OTHER_ATTRIBUTES = 32,

  INDEXING_ONLY          = 64,

  SEARCH_LINKS           = 128,        

  FILTER_OWNED_VALUE_OK  = 512

	}


	[StructLayout(LayoutKind.Sequential)]

	public struct STAT_CHUNK

	{

  public uint  idChunk;

  [MarshalAs(UnmanagedType.U4)]     public CHUNK_BREAKTYPE breakType;

  [MarshalAs(UnmanagedType.U4)]     public CHUNKSTATE flags;

  public uint locale;

  [MarshalAs(UnmanagedType.Struct)] public FULLPROPSPEC attribute;

  public uint idChunkSource;

  public uint cwcStartSource;

  public uint cwcLenSource;

	}

    

	[StructLayout(LayoutKind.Sequential)]

	public struct FILTERREGION

	{

  public uint idChunk;

  public uint cwcStart;

  public uint cwcExtent;

	}

    

	public enum CHUNKSTATE
	{

  CHUNK_TEXT               = 0x1,

  CHUNK_VALUE              = 0x2,

  CHUNK_FILTER_OWNED_VALUE = 0x4

	}

	[StructLayout(LayoutKind.Sequential)]
	public struct FULLPROPSPEC
	{

  public Guid guidPropSet;

  public PROPSPEC psProperty;

	}

	public enum CHUNK_BREAKTYPE
	{

  CHUNK_NO_BREAK = 0,

  CHUNK_EOW      = 1,

  CHUNK_EOS      = 2,

  CHUNK_EOP      = 3,

  CHUNK_EOC      = 4

	}

	[StructLayout(LayoutKind.Sequential)]
	public struct PROPSPEC

	{

  public uint ulKind;

  public uint propid;

  public IntPtr lpwstr;

	}


}

  • 0

Have you resolved this issue yet? I get the same error.

This works too, but in all the approaches I have tried so far, I always get an application error, but only with pdf files:

(ReadFile.exe is the name of my assembly)

Font Capture: ReadFile.exe - Application Error

The instruction at "0x030a61b3" referenced memory at "0x03a823e8". The memory could not be "read"

This always happens when my program closes - it works perefctly fine until I exit Main()...

I wonder if this has something to do with the Adobe IFilter not being released properly?

586137106[/snapback]

  • 0
No I haven't found a solution for the problem yet. And since I have no experience with COM programming I probably won't :)

586248556[/snapback]

I wasn't able to fix the error, but I prevented the error from displaying by using:

SetErrorMode(SEM_NOGPFAULTERRORBOX);

place it in your main thread.

  • 0
Though it works great for .doc files, it does not work for .docx(default MS Office Word 2007 format) files. Any suggestions?

docx files are compressed xml files. If you change the .docx extension to .zip then WinRar and WinZip, etc. can open the "document" and you can browse and extract the xml files. Indexing them is as simple as extracting and then using xpath :)

  • 0

Microsoft recently released the file specifications for all the Microsoft Office file formats (.doc, .xls, etc). If you want to use minimal resources, your best bet is study the .doc file format and write code to parse it yourself. Not fun at all, but it would be the only way to do this without using a library or the Word Object Model.

Check out the fun bedtime reading.

Edit: didn't realize what an old thread this was. Oops.

Edited by boogerjones
This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Posts

    • The quantum search for Time's origin had an equally mind-boggling conclusion by Sayan Sen Image by Steve Johnson via Pexels A theoretical study from researchers at the University of Surrey suggested that the direction of time may not be fundamentally fixed in certain quantum systems. The work, published in Scientific Reports, examined how the “arrow of time” could emerge from microscopic physics and found that time-reversal symmetry can remain intact even in models used to describe processes such as energy loss and thermalisation. The arrow of time refers to the observed one-way direction from past to future in everyday life. In macroscopic processes, this is easy to see. Spilled milk spreads across a table and does not gather back into a glass, and heat flows from hotter objects to colder ones. These processes shape the common sense idea that time moves in a single direction. However, at the level of fundamental physics, many equations do not prefer a direction of time. Time-reversal symmetry means that the same physical laws can describe a system whether time moves forward or backward. This has made it difficult to explain why irreversible behaviour appears in the large-scale world even when the underlying rules do not require it. Dr Andrea Rocco, Associate Professor in Physics and Mathematical Biology at the University of Surrey, described this contrast: "One way to explain this is when you look at a process like spilt milk spreading across a table, it's clear that time is moving forward. But if you were to play that in reverse, like a movie, you'd immediately know something was wrong – it would be hard to believe milk could just gather back into a glass. However, there are processes, such as the motion of a pendulum, that look just as believable in reverse. The puzzle is that, at the most fundamental level, the laws of physics resemble the pendulum; they do not account for irreversible processes. Our findings suggest that while our common experience tells us that time only moves one way, we are just unaware that the opposite direction would have been equally possible." The study focused on open quantum systems, which are quantum systems that interact with a surrounding environment. This environment, often described as a heat bath, can exchange energy and information with the system. The researchers used this framework to study how a direction of time might appear even when the underlying physics does not enforce one. A key part of the analysis involved the Markov approximation. This is a simplification used in many models where the system is assumed not to retain memory of its past states. The idea is that changes depend only on the current state, not on earlier history. This is commonly used when studying thermalisation, which is the process where a system settles into equilibrium with its environment. The study also used concepts such as master equations, including the Lindblad and Pauli equations, which describe how probabilities of different quantum states change over time. Another related model discussed was quantum Brownian motion, which describes the random-like movement of a quantum particle interacting continuously with its environment. In these descriptions, a “memory kernel” can appear, which is a mathematical term that accounts for how past states influence current behaviour. The researchers found that applying the Markov approximation did not break time-reversal symmetry. Even when the system interacted with an effectively infinite heat bath, the resulting equations of motion remained symmetric in time. This meant that the same mathematical description could, in principle, run forward or backward in time without contradiction. The study further showed that standard frameworks used in open quantum systems, including quantum Brownian motion and master equations like the Lindblad and Pauli forms, could be written in a time-symmetric way. These equations are typically used to describe processes that look irreversible, such as dissipation and thermalisation, but the results suggested they can also be interpreted as allowing evolution in both time directions. Thomas Guff, Research Fellow in Quantum Thermodynamics, said: "The surprising part of this project was that even after making the standard simplifying assumption to our equations describing open quantum systems, the equations still behaved the same way whether the system was moving forwards or backwards in time. When we carefully worked through the maths, we found that this behaviour had to be the case because a key part of the equation, the "memory kernel," is symmetrical in time. We also found a small but important detail which is usually overlooked – a time discontinuous factor emerged that kept the time-symmetry property intact. It’s unusual to see such a mathematical mechanism in a physics equation because it's not continuous, and it was very surprising to see it appear so naturally." The researchers also noted that deriving a one-way arrow of time from time-reversal symmetric microscopic dynamics remains an open problem across fields such as thermodynamics, statistical mechanics, particle physics, and cosmology. Their results suggested that some standard descriptions of irreversible behaviour in open quantum systems may be better understood using a time-symmetric formulation of Markovianity. According to the study, processes such as thermalisation, which are usually treated as irreversible, could in theory be described in a way that allows evolution in either time direction under the same rules. This does not imply that time reversal occurs in everyday life, but rather that the underlying equations do not strictly enforce a single direction. Overall, the findings suggested that the perceived direction of time may emerge from how physical systems are modelled and approximated, rather than from a fundamental asymmetry in the laws themselves. The researchers noted that this perspective could have implications for ongoing work in quantum mechanics, thermodynamics, and cosmology on the origin of time’s arrow. Source: University of Surrey, Nature This article was generated with some help from AI and reviewed by an editor. Under Section 107 of the Copyright Act 1976, this material is used for the purpose of news reporting. Fair use is a use permitted by copyright statute that might otherwise be infringing
    • A bit premature... 100% Marketing. Bizarre.
    • A $300 price hike is insane! No one is going to want to pay that much!
    • Since the 1st one flopped, there is really no reason to make another one. It's just losing money left and right.
  • Recent Achievements

    • Reacting Well
      BizSAR earned a badge
      Reacting Well
    • First Post
      AndreaB earned a badge
      First Post
    • Week One Done
      Huge Trailer earned a badge
      Week One Done
    • Week One Done
      Classifyskilleducation earned a badge
      Week One Done
    • One Month Later
      eurospharma62 earned a badge
      One Month Later
  • Popular Contributors

    1. 1
      +primortal
      581
    2. 2
      +Edouard
      182
    3. 3
      PsYcHoKiLLa
      75
    4. 4
      Michael Scrip
      73
    5. 5
      neufuse
      64
  • Tell a friend

    Love Neowin? Tell a friend!