Skip to content

Stabilized v1.1.2 #90

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 52 commits into from
Oct 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
5c637e8
Bumped rolling version
OSINT-TECHNOLOGIES Oct 1, 2024
aae975d
Added support for adminpanels_dorking.sb
OSINT-TECHNOLOGIES Oct 3, 2024
891e10e
Added support for adminpanels_dorking.db
OSINT-TECHNOLOGIES Oct 3, 2024
ac0ebdf
Added support for adminpanels_dorking.db
OSINT-TECHNOLOGIES Oct 3, 2024
eee9dc2
Add files via upload
OSINT-TECHNOLOGIES Oct 3, 2024
d40eea1
Added config parameters for Google Dorking module
OSINT-TECHNOLOGIES Oct 3, 2024
12923b4
Added some logging and dorking delay
OSINT-TECHNOLOGIES Oct 3, 2024
b8da955
Modified dorking_delay and delay_step transfer
OSINT-TECHNOLOGIES Oct 3, 2024
f32d0ce
Added pring&return function
OSINT-TECHNOLOGIES Oct 3, 2024
07eeed5
Updated CLI visual
OSINT-TECHNOLOGIES Oct 3, 2024
c283244
Reactivated some part of settings menu, added possibility to view and…
OSINT-TECHNOLOGIES Oct 3, 2024
f6189ed
Fixed DPULSE stuck by config file absence
OSINT-TECHNOLOGIES Oct 3, 2024
2957d35
Added new settings point menu
OSINT-TECHNOLOGIES Oct 8, 2024
a447d91
Added clear journal content logic
OSINT-TECHNOLOGIES Oct 8, 2024
a1bbc03
Delete dorking/basic_dorking.db
OSINT-TECHNOLOGIES Oct 8, 2024
d249c57
Updated basic_dorking content
OSINT-TECHNOLOGIES Oct 8, 2024
4073fe3
Added webstructure dorking database
OSINT-TECHNOLOGIES Oct 8, 2024
e61fe1a
Added new dorking table support
OSINT-TECHNOLOGIES Oct 8, 2024
d98dfb2
Added new dorking table support
OSINT-TECHNOLOGIES Oct 8, 2024
4e4c211
Added new dorking table support
OSINT-TECHNOLOGIES Oct 8, 2024
a00526a
Fixed SSL/TLS security fix
OSINT-TECHNOLOGIES Oct 14, 2024
fcbda25
Fixed incomplete URL substring sanitization issue
OSINT-TECHNOLOGIES Oct 14, 2024
c2373e9
Fixed 'No such file or directory' error when dorking mode is set to None
OSINT-TECHNOLOGIES Oct 14, 2024
4ab7b1b
Fixed 'No such file or directory' error when dorking mode is set to None
OSINT-TECHNOLOGIES Oct 14, 2024
fa17445
Updated README.md (removed some badges)
OSINT-TECHNOLOGIES Oct 14, 2024
6f829e9
Update README.md (added new mentions)
OSINT-TECHNOLOGIES Oct 14, 2024
673c64c
Added support of custom Dorking DB generation
OSINT-TECHNOLOGIES Oct 14, 2024
20d78e7
Reactivated "Generate custom Dorking DB" menu point
OSINT-TECHNOLOGIES Oct 14, 2024
c5b4ffb
Added separate module to handle custom dorking db generation
OSINT-TECHNOLOGIES Oct 14, 2024
a7b606d
Added custom Dorking DB usage support
OSINT-TECHNOLOGIES Oct 15, 2024
60bb93f
Added custom Dorking DB usage support
OSINT-TECHNOLOGIES Oct 15, 2024
4c2cf67
Added custom Dorking DB usage support
OSINT-TECHNOLOGIES Oct 15, 2024
fe8361d
Added custom Dorking DB usage support
OSINT-TECHNOLOGIES Oct 15, 2024
76d6a73
Update README.md
OSINT-TECHNOLOGIES Oct 15, 2024
711fa9d
Added check on existent custom DB name
OSINT-TECHNOLOGIES Oct 15, 2024
ae7f2f9
Updated CLI visual
OSINT-TECHNOLOGIES Oct 15, 2024
ce8ea93
Updated CLI visual
OSINT-TECHNOLOGIES Oct 15, 2024
acecfbb
Updated CLI visual
OSINT-TECHNOLOGIES Oct 15, 2024
fc6b78a
Updated CLI visual
OSINT-TECHNOLOGIES Oct 15, 2024
f84f2b4
Code refactored
OSINT-TECHNOLOGIES Oct 15, 2024
607dd31
Delete dorking/adminpanels_dorking.db
OSINT-TECHNOLOGIES Oct 15, 2024
3b90293
Updated table name in adminpanels_dorking.db
OSINT-TECHNOLOGIES Oct 15, 2024
ba2c61f
Code refactored / admins_dorks table support
OSINT-TECHNOLOGIES Oct 15, 2024
a3e8e1c
Code refactored
OSINT-TECHNOLOGIES Oct 15, 2024
276a9e3
Code refactored
OSINT-TECHNOLOGIES Oct 15, 2024
e8d1177
Delete dorking/webstructure_dorking.db
OSINT-TECHNOLOGIES Oct 15, 2024
870b09b
Updated web_dorks table
OSINT-TECHNOLOGIES Oct 15, 2024
3f7a48a
Added web_dorks table support
OSINT-TECHNOLOGIES Oct 15, 2024
3337e97
Code refactored
OSINT-TECHNOLOGIES Oct 15, 2024
bb16fd8
Update pyproject.toml with 1.1.2 version
OSINT-TECHNOLOGIES Oct 16, 2024
8a955f5
Update poetry.lock with 1.1.2 version
OSINT-TECHNOLOGIES Oct 16, 2024
da06773
Merge branch 'main' into rolling
OSINT-TECHNOLOGIES Oct 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,8 +143,8 @@ If you have problems with starting installer.sh, you should try to use `dos2unix

# Tasks to complete before new release
- [x] Rework Google Dorking module in separate mode
- [ ] Rework Google Dorks list into separate databases with different pre-configured dorks for various purposes
- [ ] Allow user to create their own dorks DB
- [x] Rework Google Dorks list into separate databases with different pre-configured dorks for various purposes
- [x] Allow user to create their own dorks DB
- [ ] Add separate API search mode with different free APIs

# DPULSE mentions in social medias
Expand Down
43 changes: 23 additions & 20 deletions datagather_modules/crawl_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,25 +113,27 @@
'VKontakte': [], 'YouTube': [], 'Odnoklassniki': [], 'WeChat': []}

for link in links:
if 'facebook.com' in link:
parsed_url = urlparse(link)
hostname = parsed_url.hostname
if hostname and hostname.endswith('facebook.com'):
categorized_links['Facebook'].append(urllib.parse.unquote(link))
elif 'twitter.com' in link:
elif hostname and hostname.endswith('twitter.com'):

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
twitter.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 9 months ago

To fix the problem, we need to ensure that the hostname check is robust against subdomain attacks. Instead of using hostname.endswith('twitter.com'), we should check if the hostname is exactly twitter.com or ends with .twitter.com. This will prevent URLs like malicious-twitter.com from passing the check.

  • Parse the URL to extract the hostname.
  • Check if the hostname is exactly twitter.com or ends with .twitter.com.
  • Apply similar changes to other social media domain checks to ensure consistency and security.
Suggested changeset 1
datagather_modules/crawl_processor.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/datagather_modules/crawl_processor.py b/datagather_modules/crawl_processor.py
--- a/datagather_modules/crawl_processor.py
+++ b/datagather_modules/crawl_processor.py
@@ -117,22 +117,22 @@
         hostname = parsed_url.hostname
-        if hostname and hostname.endswith('facebook.com'):
-            categorized_links['Facebook'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('twitter.com'):
-            categorized_links['Twitter'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('instagram.com'):
-            categorized_links['Instagram'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('t.me'):
-            categorized_links['Telegram'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('tiktok.com'):
-            categorized_links['TikTok'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('linkedin.com'):
-            categorized_links['LinkedIn'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('vk.com'):
-            categorized_links['VKontakte'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('youtube.com'):
-            categorized_links['YouTube'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('wechat.com'):
-            categorized_links['WeChat'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('ok.ru'):
-            categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link))
+        if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
+            categorized_links['Facebook'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
+            categorized_links['Twitter'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
+            categorized_links['Instagram'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
+            categorized_links['Telegram'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
+            categorized_links['TikTok'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
+            categorized_links['LinkedIn'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
+            categorized_links['VKontakte'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
+            categorized_links['YouTube'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
+            categorized_links['WeChat'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
+            categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link))
 
EOF
@@ -117,22 +117,22 @@
hostname = parsed_url.hostname
if hostname and hostname.endswith('facebook.com'):
categorized_links['Facebook'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('twitter.com'):
categorized_links['Twitter'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('instagram.com'):
categorized_links['Instagram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('t.me'):
categorized_links['Telegram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('tiktok.com'):
categorized_links['TikTok'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('linkedin.com'):
categorized_links['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('vk.com'):
categorized_links['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('youtube.com'):
categorized_links['YouTube'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('wechat.com'):
categorized_links['WeChat'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('ok.ru'):
categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link))
if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
categorized_links['Facebook'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
categorized_links['Twitter'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
categorized_links['Instagram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
categorized_links['Telegram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
categorized_links['TikTok'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
categorized_links['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
categorized_links['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
categorized_links['YouTube'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
categorized_links['WeChat'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link))

Copilot is powered by AI and may make mistakes. Always verify output.
categorized_links['Twitter'].append(urllib.parse.unquote(link))
elif 'instagram.com' in link:
elif hostname and hostname.endswith('instagram.com'):
categorized_links['Instagram'].append(urllib.parse.unquote(link))
elif 't.me' in link:
elif hostname and hostname.endswith('t.me'):
categorized_links['Telegram'].append(urllib.parse.unquote(link))
elif 'tiktok.com' in link:
elif hostname and hostname.endswith('tiktok.com'):

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
tiktok.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 9 months ago

To fix the problem, we need to ensure that the hostname check is robust and cannot be bypassed by embedding the target string in an unexpected location. The best way to achieve this is to use a more precise check that ensures the hostname is exactly the expected domain or a subdomain of it.

  • We will modify the hostname.endswith checks to ensure that the hostname is either exactly the target domain or a subdomain of it.
  • This involves checking if the hostname is equal to the target domain or ends with . followed by the target domain.
  • We will make these changes in the sm_gather function within the datagather_modules/crawl_processor.py file.
Suggested changeset 1
datagather_modules/crawl_processor.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/datagather_modules/crawl_processor.py b/datagather_modules/crawl_processor.py
--- a/datagather_modules/crawl_processor.py
+++ b/datagather_modules/crawl_processor.py
@@ -117,22 +117,22 @@
         hostname = parsed_url.hostname
-        if hostname and hostname.endswith('facebook.com'):
-            categorized_links['Facebook'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('twitter.com'):
-            categorized_links['Twitter'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('instagram.com'):
-            categorized_links['Instagram'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('t.me'):
-            categorized_links['Telegram'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('tiktok.com'):
-            categorized_links['TikTok'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('linkedin.com'):
-            categorized_links['LinkedIn'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('vk.com'):
-            categorized_links['VKontakte'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('youtube.com'):
-            categorized_links['YouTube'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('wechat.com'):
-            categorized_links['WeChat'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('ok.ru'):
-            categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link))
+        if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
+            categorized_links['Facebook'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
+            categorized_links['Twitter'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
+            categorized_links['Instagram'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
+            categorized_links['Telegram'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
+            categorized_links['TikTok'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
+            categorized_links['LinkedIn'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
+            categorized_links['VKontakte'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
+            categorized_links['YouTube'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
+            categorized_links['WeChat'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
+            categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link))
 
EOF
@@ -117,22 +117,22 @@
hostname = parsed_url.hostname
if hostname and hostname.endswith('facebook.com'):
categorized_links['Facebook'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('twitter.com'):
categorized_links['Twitter'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('instagram.com'):
categorized_links['Instagram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('t.me'):
categorized_links['Telegram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('tiktok.com'):
categorized_links['TikTok'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('linkedin.com'):
categorized_links['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('vk.com'):
categorized_links['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('youtube.com'):
categorized_links['YouTube'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('wechat.com'):
categorized_links['WeChat'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('ok.ru'):
categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link))
if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
categorized_links['Facebook'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
categorized_links['Twitter'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
categorized_links['Instagram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
categorized_links['Telegram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
categorized_links['TikTok'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
categorized_links['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
categorized_links['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
categorized_links['YouTube'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
categorized_links['WeChat'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link))

Copilot is powered by AI and may make mistakes. Always verify output.
categorized_links['TikTok'].append(urllib.parse.unquote(link))
elif 'linkedin.com' in link:
elif hostname and hostname.endswith('linkedin.com'):
categorized_links['LinkedIn'].append(urllib.parse.unquote(link))
elif 'vk.com' in link:
elif hostname and hostname.endswith('vk.com'):

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
vk.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 9 months ago

To fix the problem, we need to ensure that the hostname is checked correctly to prevent malicious URLs from bypassing the security check. The best way to do this is to ensure that the hostname matches the exact domain or a subdomain of the allowed host. This can be achieved by checking if the hostname ends with the allowed domain and is preceded by a dot or is exactly the allowed domain.

  • We will modify the hostname.endswith checks to ensure that the hostname either matches the allowed domain exactly or is a subdomain of the allowed domain.
  • Specifically, we will update the checks to use a more secure method that ensures the hostname is either the allowed domain or ends with .<allowed domain>.
Suggested changeset 1
datagather_modules/crawl_processor.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/datagather_modules/crawl_processor.py b/datagather_modules/crawl_processor.py
--- a/datagather_modules/crawl_processor.py
+++ b/datagather_modules/crawl_processor.py
@@ -117,22 +117,22 @@
         hostname = parsed_url.hostname
-        if hostname and hostname.endswith('facebook.com'):
-            categorized_links['Facebook'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('twitter.com'):
-            categorized_links['Twitter'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('instagram.com'):
-            categorized_links['Instagram'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('t.me'):
-            categorized_links['Telegram'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('tiktok.com'):
-            categorized_links['TikTok'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('linkedin.com'):
-            categorized_links['LinkedIn'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('vk.com'):
-            categorized_links['VKontakte'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('youtube.com'):
-            categorized_links['YouTube'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('wechat.com'):
-            categorized_links['WeChat'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('ok.ru'):
-            categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link))
+        if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
+            categorized_links['Facebook'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
+            categorized_links['Twitter'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
+            categorized_links['Instagram'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
+            categorized_links['Telegram'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
+            categorized_links['TikTok'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
+            categorized_links['LinkedIn'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
+            categorized_links['VKontakte'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
+            categorized_links['YouTube'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
+            categorized_links['WeChat'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
+            categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link))
 
EOF
@@ -117,22 +117,22 @@
hostname = parsed_url.hostname
if hostname and hostname.endswith('facebook.com'):
categorized_links['Facebook'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('twitter.com'):
categorized_links['Twitter'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('instagram.com'):
categorized_links['Instagram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('t.me'):
categorized_links['Telegram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('tiktok.com'):
categorized_links['TikTok'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('linkedin.com'):
categorized_links['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('vk.com'):
categorized_links['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('youtube.com'):
categorized_links['YouTube'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('wechat.com'):
categorized_links['WeChat'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('ok.ru'):
categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link))
if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
categorized_links['Facebook'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
categorized_links['Twitter'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
categorized_links['Instagram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
categorized_links['Telegram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
categorized_links['TikTok'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
categorized_links['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
categorized_links['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
categorized_links['YouTube'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
categorized_links['WeChat'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link))

Copilot is powered by AI and may make mistakes. Always verify output.
categorized_links['VKontakte'].append(urllib.parse.unquote(link))
elif 'youtube.com' in link:
elif hostname and hostname.endswith('youtube.com'):

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
youtube.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 9 months ago

Copilot could not generate an autofix suggestion

Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.

categorized_links['YouTube'].append(urllib.parse.unquote(link))
elif 'wechat.com' in link:
elif hostname and hostname.endswith('wechat.com'):

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
wechat.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 9 months ago

To fix the problem, we need to ensure that the URL hostname is correctly parsed and checked to prevent subdomain attacks. The best way to do this is to use the urlparse function to extract the hostname and then verify that it matches the expected domain exactly or as a subdomain. We will modify the checks to ensure that the hostname ends with the correct domain and is preceded by a dot or is the exact domain.

Suggested changeset 1
datagather_modules/crawl_processor.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/datagather_modules/crawl_processor.py b/datagather_modules/crawl_processor.py
--- a/datagather_modules/crawl_processor.py
+++ b/datagather_modules/crawl_processor.py
@@ -117,22 +117,22 @@
         hostname = parsed_url.hostname
-        if hostname and hostname.endswith('facebook.com'):
-            categorized_links['Facebook'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('twitter.com'):
-            categorized_links['Twitter'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('instagram.com'):
-            categorized_links['Instagram'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('t.me'):
-            categorized_links['Telegram'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('tiktok.com'):
-            categorized_links['TikTok'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('linkedin.com'):
-            categorized_links['LinkedIn'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('vk.com'):
-            categorized_links['VKontakte'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('youtube.com'):
-            categorized_links['YouTube'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('wechat.com'):
-            categorized_links['WeChat'].append(urllib.parse.unquote(link))
-        elif hostname and hostname.endswith('ok.ru'):
-            categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link))
+        if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
+            categorized_links['Facebook'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
+            categorized_links['Twitter'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
+            categorized_links['Instagram'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
+            categorized_links['Telegram'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
+            categorized_links['TikTok'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
+            categorized_links['LinkedIn'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
+            categorized_links['VKontakte'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
+            categorized_links['YouTube'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
+            categorized_links['WeChat'].append(urllib.parse.unquote(link))
+        elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
+            categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link))
 
EOF
@@ -117,22 +117,22 @@
hostname = parsed_url.hostname
if hostname and hostname.endswith('facebook.com'):
categorized_links['Facebook'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('twitter.com'):
categorized_links['Twitter'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('instagram.com'):
categorized_links['Instagram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('t.me'):
categorized_links['Telegram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('tiktok.com'):
categorized_links['TikTok'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('linkedin.com'):
categorized_links['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('vk.com'):
categorized_links['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('youtube.com'):
categorized_links['YouTube'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('wechat.com'):
categorized_links['WeChat'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('ok.ru'):
categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link))
if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
categorized_links['Facebook'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
categorized_links['Twitter'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
categorized_links['Instagram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
categorized_links['Telegram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
categorized_links['TikTok'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
categorized_links['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
categorized_links['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
categorized_links['YouTube'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
categorized_links['WeChat'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link))

Copilot is powered by AI and may make mistakes. Always verify output.
categorized_links['WeChat'].append(urllib.parse.unquote(link))
elif 'ok.ru' in link:
elif hostname and hostname.endswith('ok.ru'):
categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link))

if not categorized_links['Odnoklassniki']:
Expand Down Expand Up @@ -211,25 +213,26 @@

for inner_list in subdomain_socials_grouped:
for link in inner_list:
if 'facebook.com' in link:
hostname = urlparse(link).hostname
if hostname and hostname.endswith('facebook.com'):

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
facebook.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 9 months ago

To fix the problem, we need to ensure that the hostname is correctly validated to prevent malicious URLs from being accepted. The best way to do this is to check that the hostname is either exactly the allowed domain or a subdomain of it. This can be achieved by ensuring the hostname ends with the allowed domain preceded by a dot or is exactly the allowed domain.

  • We will modify the checks to ensure that the hostname is either the exact allowed domain or a subdomain of it.
  • We will update the code in the file datagather_modules/crawl_processor.py to implement these changes.
  • No new imports or definitions are needed as the required functionality is already available through urlparse.
Suggested changeset 1
datagather_modules/crawl_processor.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/datagather_modules/crawl_processor.py b/datagather_modules/crawl_processor.py
--- a/datagather_modules/crawl_processor.py
+++ b/datagather_modules/crawl_processor.py
@@ -215,23 +215,23 @@
         for link in inner_list:
-            hostname = urlparse(link).hostname
-            if hostname and hostname.endswith('facebook.com'):
-                sd_socials['Facebook'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('twitter.com'):
-                sd_socials['Twitter'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('instagram.com'):
-                sd_socials['Instagram'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('t.me'):
-                sd_socials['Telegram'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('tiktok.com'):
-                sd_socials['TikTok'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('linkedin.com'):
-                sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('vk.com'):
-                sd_socials['VKontakte'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('youtube.com'):
-                sd_socials['YouTube'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('wechat.com'):
-                sd_socials['WeChat'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('ok.ru'):
-                sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
+            hostname = urlparse(link).hostname
+            if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
+                sd_socials['Facebook'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
+                sd_socials['Twitter'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
+                sd_socials['Instagram'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
+                sd_socials['Telegram'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
+                sd_socials['TikTok'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
+                sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
+                sd_socials['VKontakte'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
+                sd_socials['YouTube'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
+                sd_socials['WeChat'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
+                sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
 
EOF
@@ -215,23 +215,23 @@
for link in inner_list:
hostname = urlparse(link).hostname
if hostname and hostname.endswith('facebook.com'):
sd_socials['Facebook'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('twitter.com'):
sd_socials['Twitter'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('instagram.com'):
sd_socials['Instagram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('t.me'):
sd_socials['Telegram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('tiktok.com'):
sd_socials['TikTok'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('linkedin.com'):
sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('vk.com'):
sd_socials['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('youtube.com'):
sd_socials['YouTube'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('wechat.com'):
sd_socials['WeChat'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('ok.ru'):
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
hostname = urlparse(link).hostname
if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
sd_socials['Facebook'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
sd_socials['Twitter'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
sd_socials['Instagram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
sd_socials['Telegram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
sd_socials['TikTok'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
sd_socials['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
sd_socials['YouTube'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
sd_socials['WeChat'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))

Copilot is powered by AI and may make mistakes. Always verify output.
sd_socials['Facebook'].append(urllib.parse.unquote(link))
elif 'twitter.com' in link:
elif hostname and hostname.endswith('twitter.com'):

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
twitter.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 9 months ago

To fix the problem, we need to ensure that the hostname of the URL ends with the correct domain, preceded by a dot. This will prevent subdomain attacks where an attacker could use a domain like malicious-twitter.com to bypass the check.

  • We will modify the code to check if the hostname ends with .twitter.com instead of twitter.com.
  • This change will be applied to all similar checks for other social media domains to ensure consistency and security.
  • The changes will be made in the file datagather_modules/crawl_processor.py from lines 217 to 236.
Suggested changeset 1
datagather_modules/crawl_processor.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/datagather_modules/crawl_processor.py b/datagather_modules/crawl_processor.py
--- a/datagather_modules/crawl_processor.py
+++ b/datagather_modules/crawl_processor.py
@@ -216,22 +216,22 @@
             hostname = urlparse(link).hostname
-            if hostname and hostname.endswith('facebook.com'):
-                sd_socials['Facebook'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('twitter.com'):
-                sd_socials['Twitter'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('instagram.com'):
-                sd_socials['Instagram'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('t.me'):
-                sd_socials['Telegram'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('tiktok.com'):
-                sd_socials['TikTok'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('linkedin.com'):
-                sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('vk.com'):
-                sd_socials['VKontakte'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('youtube.com'):
-                sd_socials['YouTube'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('wechat.com'):
-                sd_socials['WeChat'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('ok.ru'):
-                sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
+            if hostname and hostname.endswith('.facebook.com'):
+                sd_socials['Facebook'].append(urllib.parse.unquote(link))
+            elif hostname and hostname.endswith('.twitter.com'):
+                sd_socials['Twitter'].append(urllib.parse.unquote(link))
+            elif hostname and hostname.endswith('.instagram.com'):
+                sd_socials['Instagram'].append(urllib.parse.unquote(link))
+            elif hostname and hostname == 't.me':
+                sd_socials['Telegram'].append(urllib.parse.unquote(link))
+            elif hostname and hostname.endswith('.tiktok.com'):
+                sd_socials['TikTok'].append(urllib.parse.unquote(link))
+            elif hostname and hostname.endswith('.linkedin.com'):
+                sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
+            elif hostname and hostname.endswith('.vk.com'):
+                sd_socials['VKontakte'].append(urllib.parse.unquote(link))
+            elif hostname and hostname.endswith('.youtube.com'):
+                sd_socials['YouTube'].append(urllib.parse.unquote(link))
+            elif hostname and hostname.endswith('.wechat.com'):
+                sd_socials['WeChat'].append(urllib.parse.unquote(link))
+            elif hostname and hostname == 'ok.ru':
+                sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
 
EOF
@@ -216,22 +216,22 @@
hostname = urlparse(link).hostname
if hostname and hostname.endswith('facebook.com'):
sd_socials['Facebook'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('twitter.com'):
sd_socials['Twitter'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('instagram.com'):
sd_socials['Instagram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('t.me'):
sd_socials['Telegram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('tiktok.com'):
sd_socials['TikTok'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('linkedin.com'):
sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('vk.com'):
sd_socials['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('youtube.com'):
sd_socials['YouTube'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('wechat.com'):
sd_socials['WeChat'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('ok.ru'):
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
if hostname and hostname.endswith('.facebook.com'):
sd_socials['Facebook'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('.twitter.com'):
sd_socials['Twitter'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('.instagram.com'):
sd_socials['Instagram'].append(urllib.parse.unquote(link))
elif hostname and hostname == 't.me':
sd_socials['Telegram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('.tiktok.com'):
sd_socials['TikTok'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('.linkedin.com'):
sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('.vk.com'):
sd_socials['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('.youtube.com'):
sd_socials['YouTube'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('.wechat.com'):
sd_socials['WeChat'].append(urllib.parse.unquote(link))
elif hostname and hostname == 'ok.ru':
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))

Copilot is powered by AI and may make mistakes. Always verify output.
sd_socials['Twitter'].append(urllib.parse.unquote(link))
elif 'instagram.com' in link:
elif hostname and hostname.endswith('instagram.com'):

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
instagram.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 9 months ago

To fix the problem, we need to ensure that the hostname check is robust against subdomain attacks. Instead of using hostname.endswith('instagram.com'), we should check that the hostname is exactly instagram.com or ends with .instagram.com. This ensures that only Instagram domains and their subdomains are matched.

  • We will modify the checks for each social media platform to ensure they handle subdomains correctly.
  • Specifically, we will update the hostname.endswith checks to use a more secure method that verifies the hostname is either the exact domain or a subdomain of the intended domain.
  • The changes will be made in the file datagather_modules/crawl_processor.py around lines 216-236.
Suggested changeset 1
datagather_modules/crawl_processor.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/datagather_modules/crawl_processor.py b/datagather_modules/crawl_processor.py
--- a/datagather_modules/crawl_processor.py
+++ b/datagather_modules/crawl_processor.py
@@ -215,23 +215,23 @@
         for link in inner_list:
-            hostname = urlparse(link).hostname
-            if hostname and hostname.endswith('facebook.com'):
-                sd_socials['Facebook'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('twitter.com'):
-                sd_socials['Twitter'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('instagram.com'):
-                sd_socials['Instagram'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('t.me'):
-                sd_socials['Telegram'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('tiktok.com'):
-                sd_socials['TikTok'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('linkedin.com'):
-                sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('vk.com'):
-                sd_socials['VKontakte'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('youtube.com'):
-                sd_socials['YouTube'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('wechat.com'):
-                sd_socials['WeChat'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('ok.ru'):
-                sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
+            hostname = urlparse(link).hostname
+            if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
+                sd_socials['Facebook'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
+                sd_socials['Twitter'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
+                sd_socials['Instagram'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
+                sd_socials['Telegram'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
+                sd_socials['TikTok'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
+                sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
+                sd_socials['VKontakte'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
+                sd_socials['YouTube'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
+                sd_socials['WeChat'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
+                sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
 
EOF
@@ -215,23 +215,23 @@
for link in inner_list:
hostname = urlparse(link).hostname
if hostname and hostname.endswith('facebook.com'):
sd_socials['Facebook'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('twitter.com'):
sd_socials['Twitter'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('instagram.com'):
sd_socials['Instagram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('t.me'):
sd_socials['Telegram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('tiktok.com'):
sd_socials['TikTok'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('linkedin.com'):
sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('vk.com'):
sd_socials['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('youtube.com'):
sd_socials['YouTube'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('wechat.com'):
sd_socials['WeChat'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('ok.ru'):
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
hostname = urlparse(link).hostname
if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
sd_socials['Facebook'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
sd_socials['Twitter'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
sd_socials['Instagram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
sd_socials['Telegram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
sd_socials['TikTok'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
sd_socials['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
sd_socials['YouTube'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
sd_socials['WeChat'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))

Copilot is powered by AI and may make mistakes. Always verify output.
sd_socials['Instagram'].append(urllib.parse.unquote(link))
elif 't.me' in link:
elif hostname and hostname.endswith('t.me'):
sd_socials['Telegram'].append(urllib.parse.unquote(link))
elif 'tiktok.com' in link:
elif hostname and hostname.endswith('tiktok.com'):
sd_socials['TikTok'].append(urllib.parse.unquote(link))
elif 'linkedin.com' in link:
elif hostname and hostname.endswith('linkedin.com'):

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
linkedin.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 9 months ago

To fix the problem, we need to ensure that the hostname is correctly validated to belong to the intended domain. This can be done by checking if the hostname ends with the domain and is either exactly the domain or has a preceding dot to allow for subdomains.

  • We will modify the code to check if the hostname ends with .linkedin.com or is exactly linkedin.com.
  • This change will be applied to all similar checks for other social media domains to ensure consistency and security.
  • The changes will be made in the file datagather_modules/crawl_processor.py from lines 217 to 236.
Suggested changeset 1
datagather_modules/crawl_processor.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/datagather_modules/crawl_processor.py b/datagather_modules/crawl_processor.py
--- a/datagather_modules/crawl_processor.py
+++ b/datagather_modules/crawl_processor.py
@@ -216,22 +216,22 @@
             hostname = urlparse(link).hostname
-            if hostname and hostname.endswith('facebook.com'):
-                sd_socials['Facebook'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('twitter.com'):
-                sd_socials['Twitter'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('instagram.com'):
-                sd_socials['Instagram'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('t.me'):
-                sd_socials['Telegram'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('tiktok.com'):
-                sd_socials['TikTok'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('linkedin.com'):
-                sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('vk.com'):
-                sd_socials['VKontakte'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('youtube.com'):
-                sd_socials['YouTube'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('wechat.com'):
-                sd_socials['WeChat'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('ok.ru'):
-                sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
+            if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
+                sd_socials['Facebook'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
+                sd_socials['Twitter'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
+                sd_socials['Instagram'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
+                sd_socials['Telegram'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
+                sd_socials['TikTok'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
+                sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
+                sd_socials['VKontakte'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
+                sd_socials['YouTube'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
+                sd_socials['WeChat'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
+                sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
 
EOF
@@ -216,22 +216,22 @@
hostname = urlparse(link).hostname
if hostname and hostname.endswith('facebook.com'):
sd_socials['Facebook'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('twitter.com'):
sd_socials['Twitter'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('instagram.com'):
sd_socials['Instagram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('t.me'):
sd_socials['Telegram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('tiktok.com'):
sd_socials['TikTok'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('linkedin.com'):
sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('vk.com'):
sd_socials['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('youtube.com'):
sd_socials['YouTube'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('wechat.com'):
sd_socials['WeChat'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('ok.ru'):
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
sd_socials['Facebook'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
sd_socials['Twitter'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
sd_socials['Instagram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
sd_socials['Telegram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
sd_socials['TikTok'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
sd_socials['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
sd_socials['YouTube'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
sd_socials['WeChat'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))

Copilot is powered by AI and may make mistakes. Always verify output.
sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
elif 'vk.com' in link:
elif hostname and hostname.endswith('vk.com'):

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
vk.com
may be at an arbitrary position in the sanitized URL.
sd_socials['VKontakte'].append(urllib.parse.unquote(link))
elif 'youtube.com' in link:
elif hostname and hostname.endswith('youtube.com'):

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
youtube.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 9 months ago

To fix the problem, we need to ensure that the hostname is correctly validated to belong to the intended domain. Instead of using hostname.endswith('youtube.com'), we should check if the hostname is exactly 'youtube.com' or a subdomain of 'youtube.com'. This can be done by ensuring the hostname ends with '.youtube.com' or is exactly 'youtube.com'.

  • Parse the URL using urlparse to extract the hostname.
  • Check if the hostname is either 'youtube.com' or ends with '.youtube.com'.
  • Apply similar checks for other social media domains to ensure consistency and security.
Suggested changeset 1
datagather_modules/crawl_processor.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/datagather_modules/crawl_processor.py b/datagather_modules/crawl_processor.py
--- a/datagather_modules/crawl_processor.py
+++ b/datagather_modules/crawl_processor.py
@@ -216,22 +216,22 @@
             hostname = urlparse(link).hostname
-            if hostname and hostname.endswith('facebook.com'):
-                sd_socials['Facebook'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('twitter.com'):
-                sd_socials['Twitter'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('instagram.com'):
-                sd_socials['Instagram'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('t.me'):
-                sd_socials['Telegram'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('tiktok.com'):
-                sd_socials['TikTok'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('linkedin.com'):
-                sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('vk.com'):
-                sd_socials['VKontakte'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('youtube.com'):
-                sd_socials['YouTube'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('wechat.com'):
-                sd_socials['WeChat'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('ok.ru'):
-                sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
+            if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
+                sd_socials['Facebook'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
+                sd_socials['Twitter'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
+                sd_socials['Instagram'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
+                sd_socials['Telegram'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
+                sd_socials['TikTok'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
+                sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
+                sd_socials['VKontakte'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
+                sd_socials['YouTube'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
+                sd_socials['WeChat'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
+                sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
 
EOF
@@ -216,22 +216,22 @@
hostname = urlparse(link).hostname
if hostname and hostname.endswith('facebook.com'):
sd_socials['Facebook'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('twitter.com'):
sd_socials['Twitter'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('instagram.com'):
sd_socials['Instagram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('t.me'):
sd_socials['Telegram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('tiktok.com'):
sd_socials['TikTok'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('linkedin.com'):
sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('vk.com'):
sd_socials['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('youtube.com'):
sd_socials['YouTube'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('wechat.com'):
sd_socials['WeChat'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('ok.ru'):
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
sd_socials['Facebook'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
sd_socials['Twitter'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
sd_socials['Instagram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
sd_socials['Telegram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
sd_socials['TikTok'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
sd_socials['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
sd_socials['YouTube'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
sd_socials['WeChat'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))

Copilot is powered by AI and may make mistakes. Always verify output.
sd_socials['YouTube'].append(urllib.parse.unquote(link))
elif 'wechat.com' in link:
elif hostname and hostname.endswith('wechat.com'):

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
wechat.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 9 months ago

To fix the problem, we need to ensure that the hostname check is more robust and cannot be easily bypassed by malicious URLs. The best way to achieve this is to use a stricter check that ensures the hostname is exactly the expected domain or a subdomain of it. We can use the urlparse function to parse the URL and then check if the hostname ends with the expected domain, preceded by a dot or being exactly the domain.

  • Modify the hostname.endswith checks to ensure that the hostname is either exactly the expected domain or a subdomain of it.
  • Update the code in the datagather_modules/crawl_processor.py file, specifically lines 217-236, to implement this stricter check.
Suggested changeset 1
datagather_modules/crawl_processor.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/datagather_modules/crawl_processor.py b/datagather_modules/crawl_processor.py
--- a/datagather_modules/crawl_processor.py
+++ b/datagather_modules/crawl_processor.py
@@ -216,22 +216,22 @@
             hostname = urlparse(link).hostname
-            if hostname and hostname.endswith('facebook.com'):
-                sd_socials['Facebook'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('twitter.com'):
-                sd_socials['Twitter'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('instagram.com'):
-                sd_socials['Instagram'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('t.me'):
-                sd_socials['Telegram'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('tiktok.com'):
-                sd_socials['TikTok'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('linkedin.com'):
-                sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('vk.com'):
-                sd_socials['VKontakte'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('youtube.com'):
-                sd_socials['YouTube'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('wechat.com'):
-                sd_socials['WeChat'].append(urllib.parse.unquote(link))
-            elif hostname and hostname.endswith('ok.ru'):
-                sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
+            if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
+                sd_socials['Facebook'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
+                sd_socials['Twitter'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
+                sd_socials['Instagram'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
+                sd_socials['Telegram'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
+                sd_socials['TikTok'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
+                sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
+                sd_socials['VKontakte'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
+                sd_socials['YouTube'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
+                sd_socials['WeChat'].append(urllib.parse.unquote(link))
+            elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
+                sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
 
EOF
@@ -216,22 +216,22 @@
hostname = urlparse(link).hostname
if hostname and hostname.endswith('facebook.com'):
sd_socials['Facebook'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('twitter.com'):
sd_socials['Twitter'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('instagram.com'):
sd_socials['Instagram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('t.me'):
sd_socials['Telegram'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('tiktok.com'):
sd_socials['TikTok'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('linkedin.com'):
sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('vk.com'):
sd_socials['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('youtube.com'):
sd_socials['YouTube'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('wechat.com'):
sd_socials['WeChat'].append(urllib.parse.unquote(link))
elif hostname and hostname.endswith('ok.ru'):
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))
if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')):
sd_socials['Facebook'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')):
sd_socials['Twitter'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')):
sd_socials['Instagram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')):
sd_socials['Telegram'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')):
sd_socials['TikTok'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')):
sd_socials['LinkedIn'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')):
sd_socials['VKontakte'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')):
sd_socials['YouTube'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')):
sd_socials['WeChat'].append(urllib.parse.unquote(link))
elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')):
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))

Copilot is powered by AI and may make mistakes. Always verify output.
sd_socials['WeChat'].append(urllib.parse.unquote(link))
elif 'ok.ru' in link:
elif hostname and hostname.endswith('ok.ru'):
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link))

sd_socials = {k: list(set(v)) for k, v in sd_socials.items()}
Expand Down
91 changes: 34 additions & 57 deletions datagather_modules/data_assembler.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,36 +22,46 @@
sys.exit()

def establishing_dork_db_connection(dorking_flag):
if dorking_flag == 'basic':
dorking_db_path = 'dorking//basic_dorking.db'
table = 'basic_dorks'
elif dorking_flag == 'iot':
dorking_db_path = 'dorking//iot_dorking.db'
table = 'iot_dorks'
elif dorking_flag == 'files':
dorking_db_path = 'dorking//files_dorking.db'
table = 'files_dorks'
dorking_db_paths = {
'basic': 'dorking//basic_dorking.db',
'iot': 'dorking//iot_dorking.db',
'files': 'dorking//files_dorking.db',
'admins': 'dorking//adminpanels_dorking.db',
'web': 'dorking//webstructure_dorking.db',
}
dorking_tables = {
'basic': 'basic_dorks',
'iot': 'iot_dorks',
'files': 'files_dorks',
'admins': 'admins_dorks',
'web': 'web_dorks',
}
if dorking_flag in dorking_db_paths:
dorking_db_path = dorking_db_paths[dorking_flag]
table = dorking_tables[dorking_flag]
elif dorking_flag.startswith('custom'):
lst = dorking_flag.split('+')
dorking_db_name = lst[1]
dorking_db_path = 'dorking//' + dorking_db_name
table = 'dorks'
else:
raise ValueError(f"Invalid dorking flag: {dorking_flag}")
return dorking_db_path, table

class DataProcessing():
def report_preprocessing(self, short_domain, report_file_type):
report_ctime = datetime.now().strftime('%d-%m-%Y, %H:%M:%S')
files_ctime = datetime.now().strftime('(%d-%m-%Y, %Hh%Mm%Ss)')
files_body = short_domain.replace(".", "") + '_' + files_ctime
if report_file_type == 'pdf':
casename = files_body + '.pdf'
elif report_file_type == 'xlsx':
casename = files_body + '.xlsx'
elif report_file_type == 'html':
casename = files_body + '.html'
casename = f"{files_body}.{report_file_type}"
foldername = files_body
db_casename = short_domain.replace(".", "")
now = datetime.now()
db_creation_date = str(now.year) + str(now.month) + str(now.day)
report_folder = "report_{}".format(foldername)
robots_filepath = report_folder + '//01-robots.txt'
sitemap_filepath = report_folder + '//02-sitemap.txt'
sitemap_links_filepath = report_folder + '//03-sitemap_links.txt'
report_folder = f"report_{foldername}"
robots_filepath = os.path.join(report_folder, '01-robots.txt')
sitemap_filepath = os.path.join(report_folder, '02-sitemap.txt')
sitemap_links_filepath = os.path.join(report_folder, '03-sitemap_links.txt')
os.makedirs(report_folder, exist_ok=True)
return casename, db_casename, db_creation_date, robots_filepath, sitemap_filepath, sitemap_links_filepath, report_file_type, report_folder, files_ctime, report_ctime

Expand Down Expand Up @@ -129,20 +139,9 @@ def data_gathering(self, short_domain, url, report_file_type, pagesearch_flag, k
pass

if dorking_flag == 'none':
pass
dorking_status = 'Google Dorking mode was not selected for this scan'
dorking_file_path = 'Google Dorking mode was not selected for this scan'
elif dorking_flag == 'basic':
dorking_db_path, table = establishing_dork_db_connection(dorking_flag.lower())
print(Fore.LIGHTMAGENTA_EX + f"\n[EXTENDED SCAN START: {dorking_flag.upper()} DORKING]\n" + Style.RESET_ALL)
dorking_status, dorking_file_path = dp.save_results_to_txt(report_folder, table, dp.get_dorking_query(short_domain, dorking_db_path, table))
print(Fore.LIGHTMAGENTA_EX + f"\n[EXTENDED SCAN END: {dorking_flag.upper()} DORKING]\n" + Style.RESET_ALL)
elif dorking_flag == 'iot':
dorking_db_path, table = establishing_dork_db_connection(dorking_flag.lower())
print(Fore.LIGHTMAGENTA_EX + f"\n[EXTENDED SCAN START: {dorking_flag.upper()} DORKING]\n" + Style.RESET_ALL)
dorking_status, dorking_file_path = dp.save_results_to_txt(report_folder, table, dp.get_dorking_query(short_domain, dorking_db_path, table))
print(Fore.LIGHTMAGENTA_EX + f"\n[EXTENDED SCAN END: {dorking_flag.upper()} DORKING]\n" + Style.RESET_ALL)
elif dorking_flag == 'files':
else:
dorking_db_path, table = establishing_dork_db_connection(dorking_flag.lower())
print(Fore.LIGHTMAGENTA_EX + f"\n[EXTENDED SCAN START: {dorking_flag.upper()} DORKING]\n" + Style.RESET_ALL)
dorking_status, dorking_file_path = dp.save_results_to_txt(report_folder, table, dp.get_dorking_query(short_domain, dorking_db_path, table))
Expand Down Expand Up @@ -180,23 +179,12 @@ def data_gathering(self, short_domain, url, report_file_type, pagesearch_flag, k
pass

if dorking_flag == 'none':
pass
dorking_status = 'Google Dorking mode was not selected for this scan'
dorking_results = 'Google Dorking mode was not selected for this scan'
elif dorking_flag == 'basic':
dorking_db_path, table = establishing_dork_db_connection(dorking_flag.lower())
print(Fore.LIGHTMAGENTA_EX + f"\n[EXTENDED SCAN START: {dorking_flag.upper()} DORKING]\n" + Style.RESET_ALL)
dorking_status, dorking_results = dp.transfer_results_to_xlsx(table, dp.get_dorking_query(short_domain, dorking_db_path, table))
print(Fore.LIGHTMAGENTA_EX + f"\n[EXTENDED SCAN END: {dorking_flag.upper()} DORKING]\n" + Style.RESET_ALL)
elif dorking_flag == 'iot':
dorking_db_path, table = establishing_dork_db_connection(dorking_flag.lower())
print(Fore.LIGHTMAGENTA_EX + f"\n[EXTENDED SCAN START: {dorking_flag.upper()} DORKING]\n" + Style.RESET_ALL)
dorking_status, dorking_results = dp.transfer_results_to_xlsx(table, dp.get_dorking_query(short_domain, dorking_db_path, table))
print(Fore.LIGHTMAGENTA_EX + f"\n[EXTENDED SCAN END: {dorking_flag.upper()} DORKING]\n" + Style.RESET_ALL)
elif dorking_flag == 'files':
dorking_file_path = 'Google Dorking mode was not selected for this scan'
else:
dorking_db_path, table = establishing_dork_db_connection(dorking_flag.lower())
print(Fore.LIGHTMAGENTA_EX + f"\n[EXTENDED SCAN START: {dorking_flag.upper()} DORKING]\n" + Style.RESET_ALL)
dorking_status, dorking_results = dp.transfer_results_to_xlsx(table, dp.get_dorking_query(short_domain, dorking_db_path, table))
dorking_status, dorking_file_path = dp.save_results_to_txt(report_folder, table, dp.get_dorking_query(short_domain, dorking_db_path, table))
print(Fore.LIGHTMAGENTA_EX + f"\n[EXTENDED SCAN END: {dorking_flag.upper()} DORKING]\n" + Style.RESET_ALL)

data_array = [ip, res, mails, subdomains, subdomains_amount, social_medias, subdomain_mails, sd_socials,
Expand Down Expand Up @@ -234,20 +222,9 @@ def data_gathering(self, short_domain, url, report_file_type, pagesearch_flag, k
pass

if dorking_flag == 'none':
pass
dorking_status = 'Google Dorking mode was not selected for this scan'
dorking_file_path = 'Google Dorking mode was not selected for this scan'
elif dorking_flag == 'basic':
dorking_db_path, table = establishing_dork_db_connection(dorking_flag.lower())
print(Fore.LIGHTMAGENTA_EX + f"\n[EXTENDED SCAN START: {dorking_flag.upper()} DORKING]\n" + Style.RESET_ALL)
dorking_status, dorking_file_path = dp.save_results_to_txt(report_folder, table, dp.get_dorking_query(short_domain, dorking_db_path, table))
print(Fore.LIGHTMAGENTA_EX + f"\n[EXTENDED SCAN END: {dorking_flag.upper()} DORKING]\n" + Style.RESET_ALL)
elif dorking_flag == 'iot':
dorking_db_path, table = establishing_dork_db_connection(dorking_flag.lower())
print(Fore.LIGHTMAGENTA_EX + f"\n[EXTENDED SCAN START: {dorking_flag.upper()} DORKING]\n" + Style.RESET_ALL)
dorking_status, dorking_file_path = dp.save_results_to_txt(report_folder, table, dp.get_dorking_query(short_domain, dorking_db_path, table))
print(Fore.LIGHTMAGENTA_EX + f"\n[EXTENDED SCAN END: {dorking_flag.upper()} DORKING]\n" + Style.RESET_ALL)
elif dorking_flag == 'files':
else:
dorking_db_path, table = establishing_dork_db_connection(dorking_flag.lower())
print(Fore.LIGHTMAGENTA_EX + f"\n[EXTENDED SCAN START: {dorking_flag.upper()} DORKING]\n" + Style.RESET_ALL)
dorking_status, dorking_file_path = dp.save_results_to_txt(report_folder, table, dp.get_dorking_query(short_domain, dorking_db_path, table))
Expand Down
1 change: 1 addition & 0 deletions datagather_modules/networking_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ def get_ssl_certificate(short_domain, port=443):
try:
logging.info('SSL CERTIFICATE GATHERING: OK')
context = ssl.create_default_context()
context.minimum_version = ssl.TLSVersion.TLSv1_2
conn = socket.create_connection((short_domain, port))
sock = context.wrap_socket(conn, server_hostname=short_domain)
cert = sock.getpeercert()
Expand Down
Binary file added dorking/adminpanels_dorking.db
Binary file not shown.
Binary file modified dorking/basic_dorking.db
Binary file not shown.
36 changes: 36 additions & 0 deletions dorking/db_creator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import sqlite3
from colorama import Fore
import os

def manage_dorks(db_name):
db_prep_string = str(db_name) + '.db'
if os.path.exists('dorking//' + db_prep_string):
print(Fore.RED + f"Sorry, but {db_prep_string} database is already exists. Choose other name for your custom DB")
pass
else:
conn = sqlite3.connect('dorking//' + str(db_prep_string))
cursor = conn.cursor()

cursor.execute('''
CREATE TABLE IF NOT EXISTS dorks (
dork_id INTEGER PRIMARY KEY,
dork TEXT NOT NULL
)
''')
conn.commit()

def add_dork(dork_id, dork):
try:
cursor.execute('INSERT INTO dorks (dork_id, dork) VALUES (?, ?)', (dork_id, dork))
conn.commit()
print(Fore.GREEN + "Successfully added new dork")
except sqlite3.IntegrityError:
print(Fore.RED + "Attention, dork_id variable must be unique")

while True:
dork_id = input(Fore.YELLOW + "Enter dork_id (or 'q' to quit this mode and save changes) >> ")
if dork_id.lower() == 'q':
break
dork = input(Fore.YELLOW + "Enter new dork >> ")
add_dork(int(dork_id), dork)
conn.close()
Loading