Używam metody typu open source, która analizuje tekst HTML na NSString.Klasa parsowania html o otwartym kodzie źródłowym niepoprawnie analizuje spacje między akapitami
Wynikowe ciągi mają duże ilości białego odstępu między pierwszymi akapitami, ale tylko jeden wiersz odstępu dla kolejnych akapitów. Oto przykład wyjścia.
Poniżej znajduje się metoda, do której dzwonię. Zmieniłem tylko dwie linie kodu. W przypadku
stopCharacters
i newLineAndWhitespaceCharacters
usunąłem /n
z zestawu znaków, ponieważ po jej dołączeniu cały tekst był jednym długim akapitem.
- (NSString *)stringByConvertingHTMLToPlainText {
// Pool
NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];
// Character sets
NSCharacterSet *stopCharacters = [NSCharacterSet characterSetWithCharactersInString:[NSString stringWithFormat:@"< \t\r%C%C%C%C", 0x0085, 0x000C, 0x2028, 0x2029]];
NSCharacterSet *newLineAndWhitespaceCharacters = [NSCharacterSet characterSetWithCharactersInString:[NSString stringWithFormat:@" \t\r%C%C%C%C", 0x0085, 0x000C, 0x2028, 0x2029]];
NSCharacterSet *tagNameCharacters = [NSCharacterSet characterSetWithCharactersInString:@"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"];
// Scan and find all tags
NSMutableString *result = [[NSMutableString alloc] initWithCapacity:self.length];
NSScanner *scanner = [[NSScanner alloc] initWithString:self];
[scanner setCharactersToBeSkipped:nil];
[scanner setCaseSensitive:YES];
NSString *str = nil, *tagName = nil;
BOOL dontReplaceTagWithSpace = NO;
do {
// Scan up to the start of a tag or whitespace
if ([scanner scanUpToCharactersFromSet:stopCharacters intoString:&str]) {
[result appendString:str];
str = nil; // reset
}
// Check if we've stopped at a tag/comment or whitespace
if ([scanner scanString:@"<" intoString:NULL]) {
// Stopped at a comment or tag
if ([scanner scanString:@"!--" intoString:NULL]) {
// Comment
[scanner scanUpToString:@"-->" intoString:NULL];
[scanner scanString:@"-->" intoString:NULL];
} else {
// Tag - remove and replace with space unless it's
// a closing inline tag then dont replace with a space
if ([scanner scanString:@"/" intoString:NULL]) {
// Closing tag - replace with space unless it's inline
tagName = nil; dontReplaceTagWithSpace = NO;
if ([scanner scanCharactersFromSet:tagNameCharacters intoString:&tagName]) {
tagName = [tagName lowercaseString];
dontReplaceTagWithSpace = ([tagName isEqualToString:@"a"] ||
[tagName isEqualToString:@"b"] ||
[tagName isEqualToString:@"i"] ||
[tagName isEqualToString:@"q"] ||
[tagName isEqualToString:@"span"] ||
[tagName isEqualToString:@"em"] ||
[tagName isEqualToString:@"strong"] ||
[tagName isEqualToString:@"cite"] ||
[tagName isEqualToString:@"abbr"] ||
[tagName isEqualToString:@"acronym"] ||
[tagName isEqualToString:@"label"]);
}
// Replace tag with string unless it was an inline
if (!dontReplaceTagWithSpace && result.length > 0 && ![scanner isAtEnd]) [result appendString:@" "];
}
// Scan past tag
[scanner scanUpToString:@">" intoString:NULL];
[scanner scanString:@">" intoString:NULL];
}
} else {
// Stopped at whitespace - replace all whitespace and newlines with a space
if ([scanner scanCharactersFromSet:newLineAndWhitespaceCharacters intoString:NULL]) {
if (result.length > 0 && ![scanner isAtEnd]) [result appendString:@" "]; // Dont append space to beginning or end of result
}
}
} while (![scanner isAtEnd]);
// Cleanup
[scanner release];
// Decode HTML entities and return
NSString *retString = [[result stringByDecodingHTMLEntities] retain];
[result release];
// Drain
[pool drain];
// Return
return [retString autorelease];
}
EDIT:
Oto NSLog napisu. I tylko wklejony pierwszych kilku akapitach
Mitt Romney spent the past six years running for president. After his loss to President Barack Obama, he'll have to chart a different course.
His initial plan: spend time with his family. He has five sons and 18 grandchildren, with a 19th on the way.
"I don't look at postelection to be a time of regrouping. Instead it's a time of forward focus," Romney told reporters aboard his plane Tuesday evening as he returned to Boston after the final campaign stop of his political career. "I have, of course, a family and life important to me, win or lose."
The most visible member of that family — wife Ann Romney — says neither she nor her husband will seek political office again.
itp ....
for (int j = 25; j< 50; j++) {
char test = [completeTrimmed characterAtIndex:([completeTrimmed rangeOfString:@"chart a different course."].location + j)];
NSLog(@"%hhd", test);
}
012-11-11 17:15:57.668 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.669 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.669 LMU_LAL_LAUNCHER[5431:c07] 10
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 10
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 10
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 72
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 115
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 110
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 116
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 97
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 108
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 112
2012-11-11 17:15:57.676 LMU_LAL_LAUNCHER[5431:c07] 108
2012-11-11 17:15:57.676 LMU_LAL_LAUNCHER[5431:c07] 97
Prawdopodobnie musisz użyć metody stringByReplacingCharactersInRange, aby usunąć dodatkowe spacje z końcowego łańcucha. – iDev
Ja już wypróbowane 'completeTrimmed = [completeTrimmed stringByReplacingOccurrencesOfString: @ " "withString: @" "];', ale to niczego nie – Mahir
zrobić @"" powinna mieć około 15 przestrzenie pomiędzy nimi, ale komentarz autocorrects go 1 spacja – Mahir