• 0

Determine if string can be split up into words


Question

I need to write an algorithm using dynamic programming to take an input in the form of a string without spaces or punctuation and return

a) if it can be split up into a sequence of words. "themailer" -> true; "thdfapple" -> false

b) the sequence produced. "the", "mailer"

In the dynamic programming approach, you have to find a subproblem... I've tried several with no avail... i feel like I'm closer right now with saying the subproblem is the longest word adjacent to the current position i'm testing

My general approach is to fill a table that is n x n long where n is the length of the string, the horizontal axis is the START of the substring and the vertical axis is the END of the substring (that i am testing)

Can anyone offer some insight / help on doing this? I realize this is highly esoteric and probably not explained very well... Sorry

9 answers to this question

Recommended Posts

  • 0

Hmm, that looks to be quite a challenge. You'd need some sort of dictionary to compare valid words to, but it might get a little interesting when you encounter words which could exist within other words, e.g. "ten" is a valid word, but so is "tenth". Unless you have a defined set of words and your work backwards (longest -> shortest) and break the string down that way? Never done anything like this before, so it would be interesting to see how you do it...

  • 0

If you want to do this with DYNAMIC PROGRAMMING you basically have to keep pointers of where your current words start/end and try to find a matching item. If the last word doesn't fit anywhere then try to move the previous one and so on.

  • Like 1
  • 0
  On 03/03/2010 at 20:38, Argote said:

If you want to do this with DYNAMIC PROGRAMMING you basically have to keep pointers of where your current words start/end and try to find a matching item. If the last word doesn't fit anywhere then try to move the previous one and so on.

That sounds like a good approach! Rather you than me though! Sounds very tricky indeed :rolleyes:

  • 0

The constraint is that it has to be done in O(n^2) time. Not quite sure what you mean Argote, but if your algorithm has to go back, then you might end up with n^n time because for every letter, you can have n mistakes.

I'm glad you seem to be familiar with dynamic programming though. I think I have a solution; perhaps you can verify that what I have runs in O(n^2) and is "dynamic programming"

pseudo python... the check to see if something is a word is considered to take unit time (that's just how it is.)

for start in range(0, length):
     for end in range(start+1, length):
         isWord = #check the current substring based on start and end.
         if isWord: table[start][end] = substring length

This should take n^2 time, according to a source, but it takes a little less since it only covers half of a table

due to the fact that end >= start is a constraint. This basically goes through all the substrings:

[a]bcd

[ab]cd

[abc]d

acd

a[bc]d (etc)

and when it hits one that is a word, it puts the length of that word into that cell (so to get the word, you trace the table UP that many rows)

I then propose to do a DEPTH FIRST SEARCH from the end of this table back, (as if it were a graph).

If I can get to the start state, then I can split the string up (and will have actually built it by then).

Depth First Search is O(|V| + |E|)... |V| = n, and |E| seems to be constrained to be n(n-1)=n^2-n in this case.

You have n nodes, and each node can have n-1 edges since it's basically an undirected graph.

so I have O(n^2 + n^2 -n) = O(n^2) runtime, I THINK....

please feel free to correct me or comment on it. I'm not sure if this is actually dynamic programming since I have to do a multipass (the DFS after I fill in the table)

This is my test string, by the way.... I hammered it out by combining interesting cases... it covers things such as choices like "than" or "thank" or "an"... "mail" or "mailer" or "email"

  thankillthemailer
t 00000000000000000
h 00000000000000000
a 00000000000000000
n 40000000000000000
k 50000000000000000
i 00000000000000000
l 00000000000000000
l 00004300000000000
t 00000000000000000
h 00000000000000000
e 00000000320000000
m 00000000430000000
a 00000000000000000
i 00000000000000000
l 00000000005400000
e 00000000000000000
r 00000000000600000

so the columns represent the START and the rows represent the END from my algorithm, so if you look at table['k','l'], you get a 4, which means that the 'l' is the end of a 4 letter word. (yes, i realize there are two l's, but it's easier to explain that way. i'm obviously talking about the second l)

  • 0

There is a much less complex and more elegant solution.

You propose to finish with a DFS working backwards to locate all solutions (paths through the directed graph back to the start). This itself is very close to the more elegant solution I will attempt to describe. We are indeed going to be starting at the end of the string rather than the beginning.

Let us call a string that has been successfully split into words a "sentence". Consider that every sentence must end with a word. Working from the end of the string forward, we will construct all valid sentences recursively.

Rather than iterating through all end and then all start positions, let us consider our end position static - we simply always end with the end of our "current" string as we are not interested in words ending before the end of our current string because those solutions don't provide a valid word at the end of the sentence. We will thus iterate only the start position, working backwards from end-1 to 0. If we ever get to start = 0 without finding a valid word, then there aren't any valid sentences because there are no valid words that the sentence can end with. We don't need to iterate again with a new end position, that would be wasted computation.

As we work backwards by start position, each time we encounter a valid word we will recurse with the substring preceding the word, and then prepend each result from the recursion to our "current" valid word to generate an array of our valid solution sentences. We don't recurse if start is 0 as we have reached the beginning of the string and can simply add the valid word to our array of solutions and return.

In pseudo-code (I have a running solution in PHP that I can generalize and post if there is interest, though it is much better left as an exercise for the reader)

 function find_sentences(string) {
     sentences = [];
     end = length(string) - 1;
     for (start = end; start >= 0; start--) {
         sublength = end - start + 1;
         substring = substring(string, start, sublength);
         if (is_word(substring)) {
             if (start > 0) {
                 next = substring(string, 0, start);
                 recurse = find_sentences(next);
                 foreach (recurse as result) {
                     sentences[] = result + ' ' + substring;
                 }
             } else {
                 sentences[] = substring;
             }
         }
     }
     return sentences;
 }

Enjoy! Any questions, I'll be happy to answer as best I can.

Also, thanks to ArmedMonkey - when I originally began on a solution for a similar problem, it was your solution that 'tipped' me off to the solution I described above.

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Posts

    • Hundreds of printer models with weak password algorithms exposed, no firmware patch possible by Usama Jawad Many people don't think or care about it, but printer security is a pretty important avenue when it comes to evaluating your cybersecurity posture. Last month, it was found that the companion software for Procolored printers was distributing malware. Now, it has been revealed that hundreds of printer models all over the globe are susceptible to attacks targeting their admin credentials. Bleeping Computer has reported that CVE-2024-51978 is a one of the eight printer vulnerabilities recently discovered by security researchers. Combined, these allow authenticated and unauthenticated attackers to discover the default admin password, perform remote code execution (RCE), crash the printer, and leak other sensitive information. Severity ratings go from a score of 5.3 (medium) to 9.8 (critical), indicating that these are pretty severe vulnerabilities. The most dangerous vulnerability in there exposes the default admin password, and primarily affects Brother printers. This is because Brother utilizes a rather weak password generation algorithm that is highly dependent upon the device's serial number and a static salt table. Analysis of the code revealed that the first 16 characters of the serial number are appended with eight bytes from a static salt table, with the results being hashed with SHA256 and Base64-encoded. Finally, the first eight characters are then taken and some of them are replaced with special characters to form the password. The static nature of the password generation algorithm means that an attacker can chain various existing vulnerabilities to get access to your serial number and eventually your default admin password. It is important to note that not all printer models are affected by all of these flaws, but the default admin password exposure does affect 695 models. The breakdown for the number of printer model affected by the eight vulnerabilities is as follows: Brother: 689 Fujifilm: 46 Konica Minolta: 6 Ricoh: 5 Toshiba: 2 Brother has informed the security researchers that it cannot fully remediate the password generation vulnerability through a firmware patch. It can only fix the issue in its next printer models by patching the problem during the manufacturing process. This makes it crucial for customers of affected models to change their default admin password as soon as possible, which is good practice anyway.
    • Win11 being 2.3x faster than Win10, so that means without changing any hardware, Doom Eternal will run at over 2x the framerate? Or that any computational stuff will be done 2x faster just because of OS? Press X for doubt. I wonder where they pulled these metrics out of coz they make no sense. Not to mention my laptop with Ryzen 2500U that used to run Windows 11 simply CAN'T anymore because of BS arbitrary requirements Microsoft has imposed in Win11 that weren't there intiially. So, that laptop is now running Ubuntu or Fedora and runs at 1000000000000000000000x faster speeds because Windows 11 just doesn't work at all so it gets DNF.
    • “Ask Photos” is coming to more Google Photos users in the US by Paul Hill Google has announced several improvements to a Gemini-powered Google Photos feature called Ask Photos. The feature launched last year in early access and it was great for longer search requests, however, when users typed “Dog” or “Beach” looking for those types of pictures, Ask Photos underperformed. Now, Google has addressed this and is bringing the feature to more eligible users in the United States. The search giant said that users enjoyed asking queries such as “suggest photos that’d make great phone backgrounds” or “what did I eat on my trip to Barcelona?” and getting Gemini-powered responses to these complex questions. Unfortunately, simple searches like “beach” or “dogs” were generating less than optimal results and people complained. To remedy the situation, Google has brought the best of Google Photos’ classic search feature to Ask Photos to improve latency. Easy, short requests will be dealt with by the old search mechanism and when you ask complex queries, it will switch over to the Gemini-powered search results. Now that Google has addressed this main issue of simple searches by integrated functionality from the previous search model, the company feels more confident to open up beyond early access and deliver it to more users in the United States. To be eligible, you must be 18+, be in the US, have your Google Account language set to English (United States), and have Face Groups turned on. This feature is only available on Android and iOS. To start using it, open Google Photos and tap on Ask at the bottom, then press Try now and agree to the terms. If you do not see the Ask button, it means you’re not yet eligible. The feature also can’t be used on computers yet. Hopefully, Google will quickly expand this feature outside the US so that international users can try it out too.
    • When are MS going to learn. Traditional home users who email, watch cat videos and perform basic tasks don't care if W11 is slightly faster...assuming it's true. Few home users care about faster speed and more secure claims. In any event, it's irrelevant. MS stopped caring about home users years ago as evident by their hostile, anti choice home user policies.
    • They have. https://www.xda-developers.com...-25h2-update-kind-of-small/
  • Recent Achievements

    • Week One Done
      suprememobiles earned a badge
      Week One Done
    • Week One Done
      Marites earned a badge
      Week One Done
    • One Year In
      runge100 earned a badge
      One Year In
    • One Month Later
      runge100 earned a badge
      One Month Later
    • One Month Later
      jfam earned a badge
      One Month Later
  • Popular Contributors

    1. 1
      +primortal
      567
    2. 2
      +FloatingFatMan
      177
    3. 3
      ATLien_0
      169
    4. 4
      Michael Scrip
      127
    5. 5
      Xenon
      119
  • Tell a friend

    Love Neowin? Tell a friend!