-
-
Notifications
You must be signed in to change notification settings - Fork 421
Add tweet source for Twikit.Tweet #315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's Guide by SourceryThis pull request adds a Updated class diagram for the Tweet classclassDiagram
class Tweet {
-_data: dict
+id: int | None
+created_at: datetime | None
+text: str | None
+user: User | None
+entities: dict | None
+extended_entities: dict | None
+reply_count: int | None
+retweet_count: int | None
+favorite_count: int | None
+quote_count: int | None
+in_reply_to_status_id: int | None
+in_reply_to_user_id: int | None
+is_quote_status: bool | None
+lang: str | None
+possibly_sensitive: bool | None
+scopes: list | None
+card: dict | None
+conversation_id: int | None
+view_count_state(): str | None
+has_community_notes(): bool
+source: str | None
+quote: Tweet | None
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
WalkthroughA new property method Changes
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @nennneko5787 - I've reviewed your changes - here's some feedback:
Overall Comments:
- Consider stripping HTML tags from the
source
property for cleaner output.
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
twikit/tweet.py (1)
201-204
: Consider renaming the property to be more specific.The name "source" is ambiguous and could be confused with source code. Consider a more descriptive name like
client_source
orposting_client
to clearly indicate its purpose.- def source(self) -> str | None: - return self._data.get('source') + def client_source(self) -> str | None: + return self._data.get('source')
@property | ||
def source(self) -> str | None: | ||
return self._data.get('source') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add documentation for the source
property.
The property is missing from the class docstring. Add documentation to describe its purpose and return type.
Add the following to the class docstring after line 93:
urls : :class:`list`
Information about URLs contained in the tweet.
+ source : :class:`str` | None
+ The HTML source indicating the client/device used to post the tweet.
full_text : :class:`str` | None
The full text of the tweet.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@property | |
def source(self) -> str | None: | |
return self._data.get('source') | |
urls : :class:`list` | |
Information about URLs contained in the tweet. | |
source : :class:`str` | None | |
The HTML source indicating the client/device used to post the tweet. | |
full_text : :class:`str` | None | |
The full text of the tweet. |
💡 Verification agent
🧩 Analysis chain
Add HTML sanitization for the source property.
The property returns raw HTML which could be unsafe if used without sanitization. Consider adding HTML parsing/sanitization to extract just the client name or sanitize the HTML to prevent XSS attacks.
Let's check if there are any HTML sanitization libraries available for Python:
Consider using an HTML sanitization library like bleach
to safely handle the HTML content. Add it as a dependency in setup.py
and implement sanitization in the property.
🌐 Web query:
What are the recommended Python libraries for HTML sanitization?
💡 Result:
Python offers several robust libraries for HTML sanitization, each with distinct features catering to different security and performance needs:
1. pybluemonday (GitHub)
A high-performance library leveraging Go's bluemonday via bindings:
- Features:
- Whitelist-based sanitization with policies (
UGCPolicy
,StrictPolicy
) - Globally allow/deny attributes per element
- Benchmarked as ~60x faster than alternatives like Bleach[1]
- Whitelist-based sanitization with policies (
- Example:
from pybluemonday import UGCPolicy s = UGCPolicy() print(s.sanitize("<script>alert(1)</script><b class='stuff'>test</b>")) # <b>test</b>
2. nh3 (GitHub)
Rust-based bindings for Ammonia, optimized for speed:
- Features:
- Sanitizes HTML via strict allowlists
- Benchmarked at ~20x faster than Bleach[10]
- Installation:
pip install nh3
3. Bleach (Docs)
A deprecated but widely used library:
- Features:
- Configurable allowlists for tags/attributes via
Cleaner
class[4][9] - Supports CSS sanitization and protocol restrictions (e.g.,
http
,https
)
- Configurable allowlists for tags/attributes via
- Limitations: Slower than Rust/Go alternatives[1][10], no longer actively maintained[9].
4. html-sanitizer (PyPI)
Opinionated library for strict cleanup:
- Features:
- Converts inline styles to semantic tags (e.g.,
<span style="bold">
→<strong>
)[8] - Merges redundant tags and normalizes whitespace
- Converts inline styles to semantic tags (e.g.,
Honorable Mentions: Custom Solutions
- BeautifulSoup/lxml: Often used with custom rules but require manual implementation[6].
- Django utilities: Built-in tools for basic sanitization in web frameworks[5][7].
Recommendations: For critical applications requiring speed and security, pybluemonday or nh3 are optimal. For legacy systems or simpler needs, Bleach or html-sanitizer remain viable despite lower performance.
Citations:
- 1: https://github.com/ColdHeat/pybluemonday
- 2: https://pythonhosted.org/bleach/clean.html
- 3: https://www.linode.com/community/questions/3374/python-way-to-keep-out-xss
- 4: https://bleach.readthedocs.io/en/latest/clean.html
- 5: https://forum.djangoproject.com/t/tools-to-sanitize-html/6284
- 6: https://www.zenrows.com/blog/python-html-parser
- 7: https://codingnomads.com/python-flask-sanitize-html
- 8: https://pypi.org/project/html-sanitizer/
- 9: https://pypi.org/project/bleach/
- 10: https://github.com/messense/nh3
Sanitize HTML output in the source
property (twikit/tweet.py: 201-204)
The source
property currently returns raw HTML, which poses an XSS risk if rendered unprocessed. Given recent findings, consider integrating an HTML sanitization library. You have a few good options:
- pybluemonday or nh3 are high-performance, secure choices for sanitizing HTML.
- If you prefer a more legacy approach, Bleach remains an option but note it’s slower and less actively maintained.
Please add the chosen library as a dependency in setup.py
and update the source
property to sanitize the HTML—either by extracting and returning just the client name or by properly cleaning the HTML content.
Add Tweet.source to retrieve the tweeted device.
At this time, HTML contained in the data is output without parsing. (If it needs to be parsed, I add a process to parse it.)
Summary by Sourcery
New Features:
source
property to theTweet
class, allowing retrieval of the device used to post the tweet.Summary by CodeRabbit