• 0

Determine if string can be split up into words


Question

I need to write an algorithm using dynamic programming to take an input in the form of a string without spaces or punctuation and return

a) if it can be split up into a sequence of words. "themailer" -> true; "thdfapple" -> false

b) the sequence produced. "the", "mailer"

In the dynamic programming approach, you have to find a subproblem... I've tried several with no avail... i feel like I'm closer right now with saying the subproblem is the longest word adjacent to the current position i'm testing

My general approach is to fill a table that is n x n long where n is the length of the string, the horizontal axis is the START of the substring and the vertical axis is the END of the substring (that i am testing)

Can anyone offer some insight / help on doing this? I realize this is highly esoteric and probably not explained very well... Sorry

9 answers to this question

Recommended Posts

  • 0

Hmm, that looks to be quite a challenge. You'd need some sort of dictionary to compare valid words to, but it might get a little interesting when you encounter words which could exist within other words, e.g. "ten" is a valid word, but so is "tenth". Unless you have a defined set of words and your work backwards (longest -> shortest) and break the string down that way? Never done anything like this before, so it would be interesting to see how you do it...

  • 0

If you want to do this with DYNAMIC PROGRAMMING you basically have to keep pointers of where your current words start/end and try to find a matching item. If the last word doesn't fit anywhere then try to move the previous one and so on.

  • Like 1
  • 0
  On 03/03/2010 at 20:38, Argote said:

If you want to do this with DYNAMIC PROGRAMMING you basically have to keep pointers of where your current words start/end and try to find a matching item. If the last word doesn't fit anywhere then try to move the previous one and so on.

That sounds like a good approach! Rather you than me though! Sounds very tricky indeed :rolleyes:

  • 0

The constraint is that it has to be done in O(n^2) time. Not quite sure what you mean Argote, but if your algorithm has to go back, then you might end up with n^n time because for every letter, you can have n mistakes.

I'm glad you seem to be familiar with dynamic programming though. I think I have a solution; perhaps you can verify that what I have runs in O(n^2) and is "dynamic programming"

pseudo python... the check to see if something is a word is considered to take unit time (that's just how it is.)

for start in range(0, length):
     for end in range(start+1, length):
         isWord = #check the current substring based on start and end.
         if isWord: table[start][end] = substring length

This should take n^2 time, according to a source, but it takes a little less since it only covers half of a table

due to the fact that end >= start is a constraint. This basically goes through all the substrings:

[a]bcd

[ab]cd

[abc]d

acd

a[bc]d (etc)

and when it hits one that is a word, it puts the length of that word into that cell (so to get the word, you trace the table UP that many rows)

I then propose to do a DEPTH FIRST SEARCH from the end of this table back, (as if it were a graph).

If I can get to the start state, then I can split the string up (and will have actually built it by then).

Depth First Search is O(|V| + |E|)... |V| = n, and |E| seems to be constrained to be n(n-1)=n^2-n in this case.

You have n nodes, and each node can have n-1 edges since it's basically an undirected graph.

so I have O(n^2 + n^2 -n) = O(n^2) runtime, I THINK....

please feel free to correct me or comment on it. I'm not sure if this is actually dynamic programming since I have to do a multipass (the DFS after I fill in the table)

This is my test string, by the way.... I hammered it out by combining interesting cases... it covers things such as choices like "than" or "thank" or "an"... "mail" or "mailer" or "email"

  thankillthemailer
t 00000000000000000
h 00000000000000000
a 00000000000000000
n 40000000000000000
k 50000000000000000
i 00000000000000000
l 00000000000000000
l 00004300000000000
t 00000000000000000
h 00000000000000000
e 00000000320000000
m 00000000430000000
a 00000000000000000
i 00000000000000000
l 00000000005400000
e 00000000000000000
r 00000000000600000

so the columns represent the START and the rows represent the END from my algorithm, so if you look at table['k','l'], you get a 4, which means that the 'l' is the end of a 4 letter word. (yes, i realize there are two l's, but it's easier to explain that way. i'm obviously talking about the second l)

  • 0

There is a much less complex and more elegant solution.

You propose to finish with a DFS working backwards to locate all solutions (paths through the directed graph back to the start). This itself is very close to the more elegant solution I will attempt to describe. We are indeed going to be starting at the end of the string rather than the beginning.

Let us call a string that has been successfully split into words a "sentence". Consider that every sentence must end with a word. Working from the end of the string forward, we will construct all valid sentences recursively.

Rather than iterating through all end and then all start positions, let us consider our end position static - we simply always end with the end of our "current" string as we are not interested in words ending before the end of our current string because those solutions don't provide a valid word at the end of the sentence. We will thus iterate only the start position, working backwards from end-1 to 0. If we ever get to start = 0 without finding a valid word, then there aren't any valid sentences because there are no valid words that the sentence can end with. We don't need to iterate again with a new end position, that would be wasted computation.

As we work backwards by start position, each time we encounter a valid word we will recurse with the substring preceding the word, and then prepend each result from the recursion to our "current" valid word to generate an array of our valid solution sentences. We don't recurse if start is 0 as we have reached the beginning of the string and can simply add the valid word to our array of solutions and return.

In pseudo-code (I have a running solution in PHP that I can generalize and post if there is interest, though it is much better left as an exercise for the reader)

 function find_sentences(string) {
     sentences = [];
     end = length(string) - 1;
     for (start = end; start >= 0; start--) {
         sublength = end - start + 1;
         substring = substring(string, start, sublength);
         if (is_word(substring)) {
             if (start > 0) {
                 next = substring(string, 0, start);
                 recurse = find_sentences(next);
                 foreach (recurse as result) {
                     sentences[] = result + ' ' + substring;
                 }
             } else {
                 sentences[] = substring;
             }
         }
     }
     return sentences;
 }

Enjoy! Any questions, I'll be happy to answer as best I can.

Also, thanks to ArmedMonkey - when I originally began on a solution for a similar problem, it was your solution that 'tipped' me off to the solution I described above.

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Posts

    • Ea and Ubisoft's already have a 3rd party client mode where it removes the store features from the client and only uses 60mb of ram rather than 700.... I use playnite for gog and epic games because the the third party legendary client etc Combined in playnite epic and gog only use 200mb/ram while also showing my entire library from every other platform...
    • Also, this domain thing only works with Pro version of Windows, not the home version
    • More useless bloat to kill your pc by accident
    • Windows 11 gets more customization, a Recall home page, and more in new builds by Taras Buria Microsoft kicks off this Monday with a duo of nearly identical builds for Windows Insiders in the Dev and Beta Channels. Build 26200.5661 (Dev) and 26120.4452 (Beta) are now available for download with two big changes: a new home page for Recall and the recently spotted ability to customize where system indicators appear on the screen. The new Recall home page features more personalized content to help you get back to recent activities. It displays your latest snapshots and a curated view of the top three applications and websites you have spent the most time on in the past 24 hours. Here is what it looks like: In addition, Recall received a new nav bar on the left side of the screen with quick links to Home, Timeline, Feedback, and Settings. The next big addition is the ability to change where system indicators (brightness, volume, and more) appear on the screen. Now, you can set these at the top-left corner or top-center. To adjust this, go to Settings > System > Notifications > Position of the onscreen pop-up. Here are other changes included in today's builds: [Start menu] We are adding a Boolean to the Configure Start Pins policy to allow admins to apply Start menu pins once. This means that a user will receive admin pins on day 0 but can then make any changes to their Start pinned layout and have those safeguarded. These changes can be optionally applied through the existing configuration service provider (CSP). [File Explorer] We are restarting the roll out of AI actions in File Explorer that began rolling out with Build 26120.4151. Some Insiders may have seen the feature disappear. [Settings] In the most recently flights, we have added the country or region selected during device setup under Settings > Time & language > Language & region. Here is what was fixed: [General] Fixed the issue causing the Windows Vista boot sound to play instead of the Windows 11 boot sound. Fixed an issue where the option to reset your PC under Settings > System > Recovery wasn’t working on the previous build. Fixed an underlying issue leading to certain KVM virtual machines unexpectedly failing to boot, showing “UNSUPPORTED_PROCESSOR”. The Dev build has an extra fix: Fixed the issue causing a small number of Insiders to experience repeated bugchecks with KERNEL_SECURITY_CHECK_FAILURE after upgrading to most current Dev Channel builds. Known issues include the following: [General] [IMPORTANT NOTE] When joining the Beta Channel on Windows 11, version 24H2 – you will be offered Build 26120.4250 After installing Build 26120.4250, you will be offered the most recent update available. This 2-hop experience to get onto the latest flight in the Beta Channel is just temporary. After you do a PC reset under Settings > System > Recovery, your build version may incorrectly show as Build 26100 instead of Build 26120. This will not prevent you from getting future Beta Channel updates, which will resolve this issue. Some Windows Insiders may experience a rollback trying to install this update with a 0x80070005 in Windows Update. We’re working on a fix for Windows Insiders impacted. [Start menu] The following are known issues for Windows Insiders with the new Start menu: Using touch to navigate the new Start menu may not work reliably. For example, it currently does not support the swipe-up gesture. Drag and drop capabilities are limited from “All” to “Pinned.” In some cases, duplicate entries may appear in folders on the Start menu. [Xbox Controllers] Some Insiders are experiencing an issue where using their Xbox Controller via Bluetooth is causing their PC to bugcheck. Here is how to resolve the issue. Open Device Manager by searching for it via the search box on your taskbar. Once Device Manager is open, click on “View” and then “Devices by Driver”. Find the driver named “oemXXX.inf (XboxGameControllerDriver.inf)” where the “XXX” will be a specific number on your PC. Right-click on that driver and click “Uninstall”. [Click to Do (Preview)] The following known issues will be fixed in future updates to Windows Insiders: Windows Insiders on AMD or Intel™-powered Copilot+ PCs may experience long wait times on the first attempt to perform intelligent text actions in Click to Do after a new build or model update. [File Explorer] The following are known issues for AI actions in File Explorer: Narrator scan mode may not work properly in the action result canvas window for the Summarize AI action for Microsoft 365 files when reading bulleted lists. As a workaround, you can use Caps + Right key to navigate. [Widgets] Until we complete support for pinning in the new widgets board experience, pinning reverts you back to the previous experience You can find the announcement for the Dev build here and for the Beta build here.
  • Recent Achievements

    • One Month Later
      adnan.hebibovic earned a badge
      One Month Later
    • Week One Done
      adnan.hebibovic earned a badge
      Week One Done
    • Dedicated
      tesla maxwell earned a badge
      Dedicated
    • Dedicated
      Camlann earned a badge
      Dedicated
    • Week One Done
      fredss earned a badge
      Week One Done
  • Popular Contributors

    1. 1
      +primortal
      629
    2. 2
      Michael Scrip
      224
    3. 3
      ATLien_0
      220
    4. 4
      +FloatingFatMan
      145
    5. 5
      Xenon
      134
  • Tell a friend

    Love Neowin? Tell a friend!