[regex] specify character's neighbour without matching said neighbour


#1

Hello everyone

I was doing some challenges in python when I stumbled upon a hitch I couldn’t find how to solve:

how can I specify a certain character when it’s next to another character without matching the latest ?

Exemple, let’s say I wan’t to split the string "I don't know what 'python' means " into seperate words :

I want my regex to match the quote inside the word don't since it’s part of the word, but I wouldn’t want to match the quote around python since those are meta-characters, not part of a word.

My first guess would be to only match a quotation mark when it’s surronded by alpha characters, but I wouldn’t know how to do that without matching the character that surronds the quote :

for exemple : r"\w'\w" would match n't while r"[\w'\w]" would match any quotation mark or alpha characters.

Any help is appreciated, thanks a lot for readig !


#2

So you want to match surrounding characters, but not. That rules out all solutions because it is a contradiction. You’ll need to reconsider what your purpose is. For example, if you want to find a location, that doesn’t mean that you have to have a 1-character match.


#3

Let me rephrase : I want to match the quote Only if they’re surronded by two alpha characters, so that the quotes around 'python' wouldn’t get matched whereas the apostrophe inside the word don't would get matched. sorry if it wasn’t clear


#4

Your example satisfies that


#5

I get the feeling that you haven’t explained what problem you’re trying to solve, instead you’ve already picked the solution, except it can’t satisfy your problem (because of the contradiction you introduced), and even if it could it may still not be the best or only solution to whatever the problem is.
You need to take a few steps back and refer to what you’re really trying to do.

Match the whole thing you’re looking for and figure out where in that the apostrophe is (optionally by using a capture group but certainly not required)

Or if you really need a pattern that behaves like that, then it’s probably because you’re trying to plug it in to something else, in which case it’s very relevant what that something else is and what problem you’re trying to solve with it


#6

… as demonstrated in JS,

const regex = /\w'\w/;
console.log(regex.test("don't"));       // <- true
console.log(regex.test("'Python'"));    // <- false

#7

I get the feeling that you haven’t explained what problem you’re trying to solve

Yeah, i’m sorry if i’m not clear enough, i’m probably not using the right terminology, i’ll try and start from the start :

I’m doing a challange where I have to define a function that will beark a sentence into a list of words, so

I'm a sentence

after being passed trough the function should become

["i'm","a","sentence"]

my first instinct was to use split, so I did the following :

def word_break(str):
  str=split(r"[ ,;:_\-\"]+" , str.trim())

however, this doesn’t work with a sentence that has quotation marks : since

"I'm not that 'crazy'"

would return

["i'm",'not','that',"'crazy'"]

And of course I can’t add ' to the pattern since it will split "i'm" into two words.
And i’m definitely lost on this one :confused:

Anyway, thanks for your patience, sorry I wasn’t clear from the start.


#8

Nothing to apologize for. I suppose I’m debugging your reasoning/question more than anything else. Frankly that’s all I ever do because otherwise there’d be little to ask.

Pattern matching on English is a mess. Or English is a mess, rather. And it’ll be a whole lot messier if you try to cram in a bunch of different rules into the same pattern.
If you can get away with it, split on whitespace (the str class’s split does this by default) and then clean up the words later with a bunch of special cases tested in sequence (one at a time, one after the other). It won’t be perfect, but it’s not going to be anyway. Best you can do with reasonable effort is make it simple.


#9

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.