Jump to content



Photo

[C#] Reading text from MS Word files


  • This topic is locked This topic is locked
35 replies to this topic

#1 dtmunir

dtmunir

    Neowinian

  • Joined: 22-April 03

Posted 04 May 2005 - 17:25

hey guys. how can i read the text from an MS word, and possibly other Ms Office files while using the least possible resources. If someone could share some code they already have for this, it wud be really sweet :p

basically what im trying to do is build a desktop searching application like MSNs googles and the others in C# for a class project. but mine doesnt have to be as complex or as feature rich. just a basic version of what they do.

any other tips wud also be appreciated.

thanks
danish

ps: im storing the data im indexing in an MS Access file. seems inefficient to me. any better way to do that?


#2 Gooey

Gooey

    Taoian

  • Joined: 07-September 01

Posted 06 May 2005 - 08:27

What you'll probably have to do is add a COM reference to the Microsoft Word Object library and access word that way, but the word API isn't the friendliest thing in the world. this link should help though....

http://www.codeproje...application.asp

However a better method might be to dump Access and use MS-SQL if you can as that will let you store office documents inside it, which it will index for you and let you search against the contents of these. Effectivly this will do all the hard work for you. Not sure if MSDE / SQL Express also does this.

#3 OP dtmunir

dtmunir

    Neowinian

  • Joined: 22-April 03

Posted 06 May 2005 - 12:16

hey man. thanks for the response.

ok, the api is a bit resource intensive, since it involves loading up an instance of the ms word app in the memory. since i want to index files quickly, and there will be plenty of file types to index, the overhead of opening up each ms office app will be too much. isnt there some dll or somethin that can be used to index text from office files?

secondly regarding ms sql, the problem is that this has to be a redistributable desktop application. with ms accesss, i can easily bundle an mdb file with empty tables in the project. but with ms sql, i cant even be sure if the client pcs will hv it or not.

thanks for all ur help
danish

#4 Gooey

Gooey

    Taoian

  • Joined: 07-September 01

Posted 06 May 2005 - 12:43

If there is a dll that lets you peek into the file then I haven't heard of it, all the things I've done with Word were through the API. I see your issue with SQL server, the other option would have been usign MSDE \ SQL Express which you could include in your app, but unfortuantly these don't inlcude full text searchs because of size reasons. The last I heard this wasn't being re-considered. :(

Perhaps your best option is to use the IFILTER interface (http://msdn.microsof...refint_9sfm.asp) which is what I'm lead to believe the MSN Desktop search uses to look inside files. I can't offer much info on these as they are new to me as of ten mins ago, but I'd imagine there would be Word and other office interfaces already floating around somewhere which you could use.

Hope this helps...

#5 idbuythatforadollar

idbuythatforadollar

    Swwwwwwweet!

  • Joined: 07-January 03
  • Location: Boink

Posted 06 May 2005 - 12:47

Very easy (copied from an app i made) its in vb.net but that shouldnt stop you. You need references to the word.interop dlls:

           'Cheap way of opening word docs, open as a doc, and save as a text file. Then open the text file!
            Dim wWordApp As Word.Application = New Word.Application
            wWordApp.DisplayAlerts = Word.WdAlertLevel.wdAlertsNone

            Dim dFile As Word.Document = wWordApp.Documents.Open(CType(sFilename, Object))

            dFile.SaveAs(Path.GetDirectoryName(Application.ExecutablePath) + "\temp.txt", Word.WdSaveFormat.wdFormatText)
            dFile.Close()




#6 OP dtmunir

dtmunir

    Neowinian

  • Joined: 22-April 03

Posted 06 May 2005 - 14:26

hey man. thanks for tht tip. but u see it involves creating an object of the wordapp class, which takes up a lot of resources. imagine if im indexing files on the fly as they get modified by the user. then i might hv a scenario where the user is working on a word doc, and xls spreadsheet and a powerpoint presentation at the same time, making changes to all 3.

i wud be repeatedly required to create instances of wordapp, excelapp and powerpointapp over and over again, which wud just make the resource consumption ghastly :p

thanks for replyin tho. really appreicate it :)

#7 Pyth007

Pyth007

    Resident One Post Wonder

  • Joined: 17-May 05

Posted 17 May 2005 - 22:58

I'm not sure how well this would work, but... If you were to open a Word .doc in a plain text editor (eg Notepad), you'd see a lot of unrecognizable characters plus the actual characters of the text that was written in Word. Since it sounds as though ou are only interested in the text, and not any of the formating, you may try opening and reading the Word .doc as a text file in your program (see TextReader class). Because Word may have stored newlines differently, you may be stuck with using only Read() or ReadtoEnd() methods. Perhaps ReadToEnd() stored in a string, and apply a regular expression (RegEx class) to parse this to only include "real" characters (eg \w\s to match word characters (digits and alphabet) and white space characters (mostly " " spaces, since Word would probably have goofed-up tabs, newlines, etc.)). I'm not sure how well this would work for other Office documents, however....

#8 OP dtmunir

dtmunir

    Neowinian

  • Joined: 22-April 03

Posted 18 May 2005 - 19:39

hey everybody. thanks a lot for all the responses.
ive finally managed to figure this one out. the answer lies in the use of IFilters, as gooey suggested.
after a lot of searching on the net and a bit of tweaking, ive ,made a C Sharp class that can extract the text frrom .doc, .xls and .ppt files. ill post the code shortly, but i must warn that this class has very primitive error checking, and although it hardly ever crahses, its not feasable for distribution i would think. if somebody ever improves on this, pls do post a version here, or mail it to me. thanks a lot.

#9 riffas

riffas

    Resident One Post Wonder

  • Joined: 30-May 05

Posted 30 May 2005 - 17:10

Hi!

I'm working in a similar project but for a document management repository. I also need to parse doc files, although I'm using the Lucene .net project for storing and retrieving indexes.

Can you please post the C Sharp class in its actual status?

Thanks in Advance,
Tiago


hey everybody. thanks a lot for all the responses.
ive finally managed to figure this one out. the answer lies in the use of IFilters, as gooey suggested.
after a lot of searching on the net and a bit of tweaking, ive ,made a C Sharp class that can extract the text frrom .doc, .xls and .ppt files. ill post the code shortly, but i must warn that this class has very primitive error checking, and although it hardly ever crahses, its not feasable for distribution i would think. if somebody ever improves on this, pls do post a version here, or mail it to me. thanks a lot.

View Post



#10 OP dtmunir

dtmunir

    Neowinian

  • Joined: 22-April 03

Posted 30 May 2005 - 18:51

hey
i tried to post the code earlier on, but the newowin server kept giving me errors.sorry abt it. im tryin again now. hopefully it works this time.

Edited:
It works!!
ok this is how to use this:
add a new code file to ur project and just copy past all this code.
create a OfficeFileReader.OfficeFileReader object, can call the method GetText.
the syntax is as follows:

public static void Main()
{
  OfficeFileReader.OfficeFileReader objOFR = new OfficeFileReader.OfficeFileReader()
  string output="";
  objOFR.GetText("C:\\MyWordFile.Doc", ref output);
  Console.WriteLine(output);
}


///==============================================================

/// Office File Reader

///==============================================================

using System;

using System.Text;

using System.Runtime.InteropServices;



namespace OfficeFileReader
{
    #region Stuff you Dont even need to look at
    [Flags]

    public enum IFILTER_INIT
    {

        NONE = 0,

        CANON_PARAGRAPHS = 1,

        HARD_LINE_BREAKS = 2,

        CANON_HYPHENS = 4,

        CANON_SPACES = 8,

        APPLY_INDEX_ATTRIBUTES = 16,

        APPLY_CRAWL_ATTRIBUTES = 256,

        APPLY_OTHER_ATTRIBUTES = 32,

        INDEXING_ONLY = 64,

        SEARCH_LINKS = 128,

        FILTER_OWNED_VALUE_OK = 512

    }



    [Flags]

    public enum IFILTER_FLAGS
    {

        OLE_PROPERTIES = 1

    }



    public enum CHUNK_BREAKTYPE
    {

        CHUNK_NO_BREAK = 0,

        CHUNK_EOW = 1,

        CHUNK_EOS = 2,

        CHUNK_EOP = 3,

        CHUNK_EOC = 4

    }



    [Flags]

    public enum CHUNKSTATE
    {

        CHUNK_TEXT = 0x1,

        CHUNK_VALUE = 0x2,

        CHUNK_FILTER_OWNED_VALUE = 0x4

    }



    public enum PSKIND
    {

        LPWSTR = 0,

        PROPID = 1

    }



    [StructLayout(LayoutKind.Sequential)]

    public struct PROPSPEC
    {

        public uint ulKind;

        public uint propid;

        public IntPtr lpwstr;

    }



    [StructLayout(LayoutKind.Sequential)]

    public struct FULLPROPSPEC
    {

        public Guid guidPropSet;

        public PROPSPEC psProperty;

    }



    [StructLayout(LayoutKind.Sequential)]

    public struct STAT_CHUNK
    {

        public uint idChunk;

        [MarshalAs(UnmanagedType.U4)]
        public CHUNK_BREAKTYPE breakType;

        [MarshalAs(UnmanagedType.U4)]
        public CHUNKSTATE flags;

        public uint locale;

        [MarshalAs(UnmanagedType.Struct)]
        public FULLPROPSPEC attribute;

        public uint idChunkSource;

        public uint cwcStartSource;

        public uint cwcLenSource;

    }



    [StructLayout(LayoutKind.Sequential)]

    public struct FILTERREGION
    {

        public uint idChunk;

        public uint cwcStart;

        public uint cwcExtent;

    }


    #endregion

    [ComImport]

    [Guid("89BCB740-6119-101A-BCB7-00DD010655AF")]

    [InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]

    public interface IFilter
    {

        void Init([MarshalAs(UnmanagedType.U4)] IFILTER_INIT grfFlags,

                  uint cAttributes,

                  [MarshalAs(UnmanagedType.LPArray, SizeParamIndex = 1)] FULLPROPSPEC[] aAttributes,

                  ref uint pdwFlags);



        void GetChunk([MarshalAs(UnmanagedType.Struct)] out STAT_CHUNK pStat);



        [PreserveSig]
        int GetText(ref uint pcwcBuffer, [MarshalAs(UnmanagedType.LPWStr)] StringBuilder buffer);



        void GetValue(ref UIntPtr ppPropValue);



        void BindRegion([MarshalAs(UnmanagedType.Struct)]FILTERREGION origPos, ref Guid riid, ref UIntPtr ppunk);

    }



    [ComImport]

    [Guid("f07f3920-7b8c-11cf-9be8-00aa004b9986")]

    public class CFilter
    {

    }





    public class Constants
    {

        public const uint PID_STG_DIRECTORY = 0x00000002;



        public const uint PID_STG_CLASSID = 0x00000003;

        public const uint PID_STG_STORAGETYPE = 0x00000004;



        public const uint PID_STG_VOLUME_ID = 0x00000005;

        public const uint PID_STG_PARENT_WORKID = 0x00000006;

        public const uint PID_STG_SECONDARYSTORE = 0x00000007;



        public const uint PID_STG_FILEINDEX = 0x00000008;

        public const uint PID_STG_LASTCHANGEUSN = 0x00000009;

        public const uint PID_STG_NAME = 0x0000000a;

        public const uint PID_STG_PATH = 0x0000000b;



        public const uint PID_STG_SIZE = 0x0000000c;

        public const uint PID_STG_ATTRIBUTES = 0x0000000d;

        public const uint PID_STG_WRITETIME = 0x0000000e;

        public const uint PID_STG_CREATETIME = 0x0000000f;

        public const uint PID_STG_ACCESSTIME = 0x00000010;

        public const uint PID_STG_CHANGETIME = 0x00000011;



        public const uint PID_STG_CONTENTS = 0x00000013;

        public const uint PID_STG_SHORTNAME = 0x00000014;



        public const int FILTER_E_END_OF_CHUNKS = (unchecked((int)0x80041700));

        public const int FILTER_E_NO_MORE_TEXT = (unchecked((int)0x80041701));

        public const int FILTER_E_NO_MORE_VALUES = (unchecked((int)0x80041702));



        public const int FILTER_E_NO_TEXT = (unchecked((int)0x80041705));

        public const int FILTER_E_NO_VALUES = (unchecked((int)0x80041706));



        public const int FILTER_S_LAST_TEXT = (unchecked((int)0x00041709));

       
    }
    public class OfficeFileReader
    { 
        public void GetText(String path,ref string text)
            // path is the path of the .doc, .xls or .ppt  file
            // text is the variable in which all the extracted text will be stored
        {
            String result = "";
            int count = 0;
            try
            {
                IFilter ifilt = (IFilter)(new CFilter());
                //System.Runtime.InteropServices.UCOMIPersistFile ipf = (System.Runtime.InteropServices.UCOMIPersistFile)(ifilt);
                System.Runtime.InteropServices.ComTypes.IPersistFile ipf= (System.Runtime.InteropServices.ComTypes.IPersistFile)(ifilt);
                ipf.Load(@path, 0);
                uint i = 0;
                STAT_CHUNK ps = new STAT_CHUNK();
                ifilt.Init(IFILTER_INIT.NONE, 0, null, ref i);
                int hr = 0;
               
                while (hr == 0)
                {
                   
                        ifilt.GetChunk(out ps);
                        if (ps.flags == CHUNKSTATE.CHUNK_TEXT)
                        {
                            uint pcwcBuffer = 1000;
                            int hr2 = 0;
                            while (hr2 == Constants.FILTER_S_LAST_TEXT || hr2 == 0)
                            {
                                try
                                {
                                    pcwcBuffer = 1000;
                                    System.Text.StringBuilder sbBuffer = new StringBuilder((int)pcwcBuffer);
                                    hr2 = ifilt.GetText(ref pcwcBuffer, sbBuffer);
                                    // Console.WriteLine(pcwcBuffer.ToString());
                                    if (hr2 >= 0) result += sbBuffer.ToString(0, (int)pcwcBuffer);
                                    //textBox1.Text +="\n";
                                    // result += "#########################################";
                                    count++;
                                }
                                catch (System.Runtime.InteropServices.COMException myE)
                                {
                                    Console.WriteLine(myE.Data + "\n" + myE.Message + "\n");

                                }
                            }
                        }
                   
                }
               
            }
            catch (System.Runtime.InteropServices.COMException myE)
            {
                Console.WriteLine(myE.Data + "\n" + myE.Message + "\n");

            }

            text = result;
            //return count;
   return;

        }
    }

}


Edited by dtmunir, 30 May 2005 - 18:56.


#11 dannysmurf

dannysmurf

    Neowinian

  • Joined: 10-December 01

Posted 02 June 2005 - 06:43

Very nice code, works perfectly. :D Thanks for posting.

#12 upake

upake

    Neowinian

  • Joined: 25-June 05

Posted 25 June 2005 - 15:07

Hi I get these errors while trying to make a class library, any ideas ???


Preparing resources...
Updating references...
Performing main compilation...
e:\documents and settings\upake\my documents\visual studio projects\classlibrary2\officefilereader.cs(308,36): error CS0234: The type or namespace name 'ComTypes' does not exist in the class or namespace 'System.Runtime.InteropServices' (are you missing an assembly reference?)

e:\documents and settings\upake\my documents\visual studio projects\classlibrary2\officefilereader.cs(309,5): error CS0246: The type or namespace name 'ipf' could not be found (are you missing a using directive or an assembly reference?)

e:\documents and settings\upake\my documents\visual studio projects\classlibrary2\officefilereader.cs(338,27): error CS0117: 'System.Runtime.InteropServices.COMException' does not contain a definition for 'Data'

e:\documents and settings\upake\my documents\visual studio projects\classlibrary2\officefilereader.cs(349,23): error CS0117: 'System.Runtime.InteropServices.COMException' does not contain a definition for 'Data'



Please help

#13 OP dtmunir

dtmunir

    Neowinian

  • Joined: 22-April 03

Posted 26 June 2005 - 07:26

hey upake. which version of the .Net Framework are u using to compile this?

#14 upake

upake

    Neowinian

  • Joined: 25-June 05

Posted 27 June 2005 - 02:40

Hi dtmunir,

I am using Visual Studio 2003, and the framework version is v1.1


Upake



hey upake. which version of the .Net Framework are u using to compile this?

View Post



#15 OP dtmunir

dtmunir

    Neowinian

  • Joined: 22-April 03

Posted 27 June 2005 - 07:15

ok, i dont hv VS 2003, or .Net 1.1.
I built this on VS 2005 and .Net 2.0 Beta
but i dont think that should be a problem, b/c the code isnt mine, and the site i took it from didnt build it on .Net 2.0

since i cant figure out wat the problem is, if u want, i could compile this code into a dll, so that you could use.

or if any one else has managed to make this code work on VS 2003, pls share....