• 0

Determine if string can be split up into words


Question

I need to write an algorithm using dynamic programming to take an input in the form of a string without spaces or punctuation and return

a) if it can be split up into a sequence of words. "themailer" -> true; "thdfapple" -> false

b) the sequence produced. "the", "mailer"

In the dynamic programming approach, you have to find a subproblem... I've tried several with no avail... i feel like I'm closer right now with saying the subproblem is the longest word adjacent to the current position i'm testing

My general approach is to fill a table that is n x n long where n is the length of the string, the horizontal axis is the START of the substring and the vertical axis is the END of the substring (that i am testing)

Can anyone offer some insight / help on doing this? I realize this is highly esoteric and probably not explained very well... Sorry

9 answers to this question

Recommended Posts

  • 0

Hmm, that looks to be quite a challenge. You'd need some sort of dictionary to compare valid words to, but it might get a little interesting when you encounter words which could exist within other words, e.g. "ten" is a valid word, but so is "tenth". Unless you have a defined set of words and your work backwards (longest -> shortest) and break the string down that way? Never done anything like this before, so it would be interesting to see how you do it...

  • 0

If you want to do this with DYNAMIC PROGRAMMING you basically have to keep pointers of where your current words start/end and try to find a matching item. If the last word doesn't fit anywhere then try to move the previous one and so on.

  • Like 1
  • 0
  On 03/03/2010 at 20:38, Argote said:

If you want to do this with DYNAMIC PROGRAMMING you basically have to keep pointers of where your current words start/end and try to find a matching item. If the last word doesn't fit anywhere then try to move the previous one and so on.

That sounds like a good approach! Rather you than me though! Sounds very tricky indeed :rolleyes:

  • 0

The constraint is that it has to be done in O(n^2) time. Not quite sure what you mean Argote, but if your algorithm has to go back, then you might end up with n^n time because for every letter, you can have n mistakes.

I'm glad you seem to be familiar with dynamic programming though. I think I have a solution; perhaps you can verify that what I have runs in O(n^2) and is "dynamic programming"

pseudo python... the check to see if something is a word is considered to take unit time (that's just how it is.)

for start in range(0, length):
     for end in range(start+1, length):
         isWord = #check the current substring based on start and end.
         if isWord: table[start][end] = substring length

This should take n^2 time, according to a source, but it takes a little less since it only covers half of a table

due to the fact that end >= start is a constraint. This basically goes through all the substrings:

[a]bcd

[ab]cd

[abc]d

acd

a[bc]d (etc)

and when it hits one that is a word, it puts the length of that word into that cell (so to get the word, you trace the table UP that many rows)

I then propose to do a DEPTH FIRST SEARCH from the end of this table back, (as if it were a graph).

If I can get to the start state, then I can split the string up (and will have actually built it by then).

Depth First Search is O(|V| + |E|)... |V| = n, and |E| seems to be constrained to be n(n-1)=n^2-n in this case.

You have n nodes, and each node can have n-1 edges since it's basically an undirected graph.

so I have O(n^2 + n^2 -n) = O(n^2) runtime, I THINK....

please feel free to correct me or comment on it. I'm not sure if this is actually dynamic programming since I have to do a multipass (the DFS after I fill in the table)

This is my test string, by the way.... I hammered it out by combining interesting cases... it covers things such as choices like "than" or "thank" or "an"... "mail" or "mailer" or "email"

  thankillthemailer
t 00000000000000000
h 00000000000000000
a 00000000000000000
n 40000000000000000
k 50000000000000000
i 00000000000000000
l 00000000000000000
l 00004300000000000
t 00000000000000000
h 00000000000000000
e 00000000320000000
m 00000000430000000
a 00000000000000000
i 00000000000000000
l 00000000005400000
e 00000000000000000
r 00000000000600000

so the columns represent the START and the rows represent the END from my algorithm, so if you look at table['k','l'], you get a 4, which means that the 'l' is the end of a 4 letter word. (yes, i realize there are two l's, but it's easier to explain that way. i'm obviously talking about the second l)

  • 0

There is a much less complex and more elegant solution.

You propose to finish with a DFS working backwards to locate all solutions (paths through the directed graph back to the start). This itself is very close to the more elegant solution I will attempt to describe. We are indeed going to be starting at the end of the string rather than the beginning.

Let us call a string that has been successfully split into words a "sentence". Consider that every sentence must end with a word. Working from the end of the string forward, we will construct all valid sentences recursively.

Rather than iterating through all end and then all start positions, let us consider our end position static - we simply always end with the end of our "current" string as we are not interested in words ending before the end of our current string because those solutions don't provide a valid word at the end of the sentence. We will thus iterate only the start position, working backwards from end-1 to 0. If we ever get to start = 0 without finding a valid word, then there aren't any valid sentences because there are no valid words that the sentence can end with. We don't need to iterate again with a new end position, that would be wasted computation.

As we work backwards by start position, each time we encounter a valid word we will recurse with the substring preceding the word, and then prepend each result from the recursion to our "current" valid word to generate an array of our valid solution sentences. We don't recurse if start is 0 as we have reached the beginning of the string and can simply add the valid word to our array of solutions and return.

In pseudo-code (I have a running solution in PHP that I can generalize and post if there is interest, though it is much better left as an exercise for the reader)

 function find_sentences(string) {
     sentences = [];
     end = length(string) - 1;
     for (start = end; start >= 0; start--) {
         sublength = end - start + 1;
         substring = substring(string, start, sublength);
         if (is_word(substring)) {
             if (start > 0) {
                 next = substring(string, 0, start);
                 recurse = find_sentences(next);
                 foreach (recurse as result) {
                     sentences[] = result + ' ' + substring;
                 }
             } else {
                 sentences[] = substring;
             }
         }
     }
     return sentences;
 }

Enjoy! Any questions, I'll be happy to answer as best I can.

Also, thanks to ArmedMonkey - when I originally began on a solution for a similar problem, it was your solution that 'tipped' me off to the solution I described above.

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Posts

    • Steam Summer Sale 2025 is here offering weeks of massive discounts for PC gamers by Pulasthi Ariyasinghe It is time to jump into another Steam Summer Sale. The 2025 edition of the yearly massive sale at the biggest PC gaming store on the market has just kicked off, and it is touting thousands of discounts for everything from the oldest classics to the newest releases and everything in between. The Steam servers predictably wobbled for a while at the launch of the sale, but things finally seem to be calming down enough to browse the latest sales. The front page is the place to be for anyone looking for recommendations, with it putting the spotlight on fresh games every day. However, keep in mind that the discounts themselves will not be changing and will remain static throughout the sale. Some recent blockbusters like Metaphor: ReFantazio, Monster Hunter Wilds, Sid Meier's Civilization VII, Black Myth: Wukong, Call of Duty: Black Ops 6, Clair Obscur: Expedition 33, Warhammer 40,000: Space Marine 2, Indiana Jones and the Great Circle, Blue Prince, Kingdom Come: Deliverance II, and much more are currently discounted. Valve has also brought back the special "Deep Discounts" section, letting Steam users find the most deeply discounted games quickly straight from the front page. The page has a selection of "all-time greats" with 85% to 95% cuts to their standard prices. This time, it includes titles like Black Mesa, Little Nightmares II, Kerbal Space Program, Steins Gate, Furi, Crash Bandicoot N. Sane Trilogy, Star Wars Battlefront II, Robocop Rogue City, and much more for just a few dollars each. As for the duration, the Steam Summer Sale of 2025 will be open for business until July 10, giving all of you two whole weeks to try and keep your wallets closed tight. As always, be sure to read our Weekend PC Game Deals special this coming Saturday to check out the biggest highlights from the sale. If you miss this store-wide promotion, Valve's next major sale will land this September as part of a new time slot for the Steam Autumn Sale. Head over here to see the complete 2025 sales roadmap for the platform.
    • Great way to get an entire physical model of the products (i.e. end users) and better market them...definitely not something I'd want on ANY of my personal devices, but can be useful in retail shops for sure.
    • Hi all. ISP pushing me from 800Mbps to  1200Mbps, however, I want to buy my own modem. Theirs have been a bit wonky, would rather just handle it myself. Data capped at 1.2TB, haven't gone over yet (although pretty close at times), paying $15 rental, some have suggested a DIY solution. Anyway, I'm in need of a few upgrades and need some suggestions. My onboard LAN is capped at 1Gbit (I219-v and a Killer E2500), choose your poison? Anyway, an add-on NIC might be in order, not sure what to go with. Then there's the modem and router. A lot of these modems are very close in price, some supporting 2.3 - 2.5Gbit, others a flat 1Gbit that wouldn't let me touch my 1200Mbit that I'm paying. Boils down to a Netgear CM3000 (for below cost due to built up rewards) or Motorola B12. Then of course, there's my router. I bought a Nighthawk X10 a while ago for $10 at thrift, before this whole upgrade thing came up. Its ports are capped at 1Gbit, but it DOES have a 10Gbit SFP+ port. I could possibly get a NIC that has sfp+ port, or a switch that has 2.5Gbit ports. I'm the only wired PC in the house, everything else is wifi, no foreseeable new wired connections in the far off future. Or, I could get an RS100 router that would let me use my 1200Mbit connection. I've been subscribed for about 6 years, so have paid about $1080 in rental since then. It's too bad I didn't DIY since the beginning, it would have been paid of long ago -_- better late than never I guess. What's your thought on this?
    • Always nice to have an updated official windows installer / ISO.
  • Recent Achievements

    • Conversation Starter
      Kavin25 earned a badge
      Conversation Starter
    • One Month Later
      Leonard grant earned a badge
      One Month Later
    • Week One Done
      pcdoctorsnet earned a badge
      Week One Done
    • Rising Star
      Phillip0web went up a rank
      Rising Star
    • One Month Later
      Epaminombas earned a badge
      One Month Later
  • Popular Contributors

    1. 1
      +primortal
      550
    2. 2
      ATLien_0
      206
    3. 3
      +FloatingFatMan
      180
    4. 4
      Michael Scrip
      147
    5. 5
      Som
      119
  • Tell a friend

    Love Neowin? Tell a friend!